http://stat.ethz.ch/~stahel/lognormal/bioscience.pdf
http://www.fintools.com/wp-content/uploads/2012/02/StochasticStockPriceModeling.pdf
Log-normal Distributions
across the Sciences:
Keys and Clues
ECKHARD LI MPERT, WERNER A. STAHEL, AND MARKUS ABBT
A
s the need grows for conceptualization,
formalization, and abstraction in biology, so too does mathematics’
relevance to the field (Fagerström et al. 1996). Mathematics
is
particularly
important
for
analyzing
and
characterizing
random
variation
of,
for example, size and weight of
individuals in populations, their sensitivity to chemicals, and
time-to-event cases, such as the amount of time an individual
needs
to
recover
from
illness.
The frequency distribution
of such data is a major factor determining the type of statistical
analysis
that
can
be
validly
carried
out
on
any
data
set.
Many
widely
used
statistical
methods,
such as ANOVA (analysis
of
variance) and regression analysis, require that the data
be normally distributed, but only rarely is the frequency distribution
of
data tested when these techniques are used.
The Gaussian (normal) distribution is most often assumed
to describe the random variation that occurs in the data from
many scientific disciplines; the well-known bell-shaped curve
can easily be characterized and described by two values: the
arithmetic mean ¯x and the standard deviation s, so that data
sets are commonly described by the expression ¯x ± s. A historical
example
of
a normal distribution is that of chest measurements
of Scottish soldiers made by Quetelet, Belgian
founder of modern social statistics (Swoboda 1974). In addition,
such disparate phenomena as milk production by
cows and random deviations from target values in industrial
processes fit a normal distribution.
However, many measurements show a more or less skewed
distribution. Skewed distributions are particularly common
when mean values are low, variances large, and values cannot
be negative, as is the case, for example, with species abundance,
lengths of latent periods of infectious diseases, and distribution
of
mineral resources in the Earth’s crust. Such skewed dis-
tributions often closely fit the log-normal distribution (Aitchison
and Brown 1957, Crow and Shimizu 1988, Lee 1992,
Johnson et al. 1994, Sachs 1997). Examples fitting the normal
distribution, which is symmetrical, and the lognormal
distribution,
which is skewed, are given in Figure 1.
Note that body height fits both distributions.
Often, biological mechanisms induce log-normal distributions
(Koch
1966),
as when, for instance, exponential growth
O
N THE CHARMS OF STATISTICS
,
AND
HOW MECHANICAL MODELS RESEMBLING
GAMBLING MACHINES OFFER A LINK TO A
HANDY WAY TO CHARACTERIZE LOG
-
NORMAL DISTRIBUTIONS
,
WHICH CAN
PROVIDE DEEPER INSIGHT INTO
VARIABILITY AND PROBABILITY
—
NORMAL
OR LOG
NORMAL
:
T
HAT IS THE QUESTION
is combined with further symmetrical variation: With a mean
concentration of, say, 10
6
bacteria, one cell division more—
or less—will lead to 2 × 10
6
—or 5 × 10
5
—cells. Thus, the range
will be asymmetrical—to be precise, multiplied or divided by
2 around the mean. The skewed size distribution may be
why "exceptionally"big fruit are reported in journals year after
year
in
autumn.
Such exceptions, however, may well be the
rule: Inheritance of fruit and flower size has long been known
to fit the log-normal distribution (Groth 1914, Powers 1936,
Sinnot 1937).
Articles
What is the difference between normal and log-normal
variability? Both forms of variability are based on a variety
of forces acting independently of one another. A major
difference, however, is that the effects can be additive or
multiplicative, thus leading to normal or log-normal
distributions, respectively.
Eckhard Limpert (e-mail: Eckhard.Limpert@ipw.agrl.ethz.ch) is a
biologist and senior scientist in the Phytopathology Group of the Institute
of
Plant
Sciences
in
Zurich,
Switzerland. Werner A. Stahel (email:
stahel@stat.math.ethz.c
h) is a mathematician and head of the
Consulting Service at the Statistics Group, Swiss Federal Institute
of Technology (ETH), CH-8092 Zürich, Switzerland. Markus Abbt is
a mathematician and consultant at FJA Feilmeier & Junker AG, CH-
8008 Zürich, Switzerland. © 2001 American Institute of Biological
Sciences.
May 2001 / Vol. 51 No. 5 • BioScience 341
Articles
frequency
0 50 100 150 200
55 60 65 70
height
Figure 1. Examples of normal and log-normal distributions. While the
distribution of the heights of 1052 women (a, in inches; Snedecor and
Cochran 1989) fits the normal distribution, with a goodness of fit
p
value of
0.75, that of the content of hydroxymethylfurfurol (HMF, mg·kg
) in 1573
honey samples (b; Renner 1970) fits the log-normal (
p
= 0.41) but not the
normal (
p
= 0.0000). Interestingly, the distribution of the heights of women
fits the log-normal distribution equally well (
p
= 0.74).
Some basic principles of additive and multiplicative
effects can easily be demonstrated with the help of two
ordinary dice with sides numbered from 1 to 6. Adding the
two numbers, which is the principle of most games, leads to
values from 2 to 12, with a mean of 7, and a symmetrical
frequency distribution. The total range can be described as
7 plus or minus 5 (that is, 7 ± 5) where, in this case, 5 is not
the standard deviation. Multiplying the two numbers, however,
leads to values between 1 and 36 with a highly skewed
distribution. The total variability can be described as 6 multiplied
or
divided
by
6
(or
6
/ 6). In this case, the symmetry
has
moved
to
the
multiplicative
level.
×
Although these examples are neither normal nor lognormal
distributions,
they do clearly indicate that additive
and multiplicative effects give rise to different distributions.
Thus, we cannot describe both types of distribution in the
same way. Unfortunately, however, common belief has it
that quantitative variability is generally bell shaped and
symmetrical. The current practice in science is to use symmetrical
bars in graphs to indicate standard deviations or
errors, and the sign ± to summarize data, even though the
data or the underlying principles may suggest skewed distributions
(Factor et al. 2000, Keesing 2000, Le Naour et al.
2000, Rhew et al. 2000). In a number of cases the variability
is clearly asymmetrical because subtracting three standard
deviations
from
the
mean
produces
negative
values,
as
in the example 100 ± 50. Moreover, the example of the dice
shows that the established way to characterize symmetrical,
additive variability with the sign ± (plus or minus) has its
equivalent in the handy sign
×
/ (times or divided by), which
will be discussed further below.
Log-normal distributions are usually characterized in
terms of the log-transformed variable, using as parameters
the expected value, or mean, of its distribution, and the
standard deviation. This characterization can be advanta-
342 BioScience • May 2001 / Vol. 51 No. 5
(a)
100 200 300 400
0
0 10 20 30 40
concentration
–1
(b)
geous as, by definition, log-normal distributions
are
symmetrical
again
at
the
log
level.
Unfortunately, the widespread aversion to
statistics becomes even more pronounced as
soon as logarithms are involved. This may be
the major reason that log-normal distributions
are so little understood in general,
which leads to frequent misunderstandings
and errors. Plotting the data can help, but
graphs are difficult to communicate orally. In
short, current ways of handling log-normal
distributions are often unwieldy.
To get an idea of a sample, most people
prefer to think in terms of the original
rather than the log-transformed data. This
conception is indeed feasible and advisable
for log-normal data, too, because the familiar
properties of the normal distribution
have their analogies in the log-normal distribution.
To improve comprehension of
log-normal distributions, to encourage
their proper use, and to show their importance in life, we
present a novel physical model for generating log-normal
distributions, thus filling a 100-year-old gap. We also
demonstrate the evolution and use of parameters allowing
characterization of the data at the original scale.
Moreover, we compare log-normal distributions from a
variety of branches of science to elucidate patterns of variability,
thereby reemphasizing the importance of lognormal
distributions
in
life.
A physical model demonstrating the
genesis of log-normal distributions
There was reason for Galton (1889) to complain about colleagues
who
were
interested
only
in
averages
and
ignored
random
variability.
In his thinking, variability was even part of
the "charms of statistics." Consequently, he presented a simple
physical
model
to
give
a
clear
visualization
of
binomial and,
finally, normal variability and its derivation.
Figure 2a shows a further development of this "Galton
board," in which particles fall down a board and are deviated
at
decision
points
(the
tips
of
the triangular obstacles)
either left or right with equal probability. (Galton used simple
nails
instead
of
the isosceles triangles shown here, so his
invention resembles a pinball machine or the Japanese game
Pachinko.) The normal distribution created by the board reflects
the
cumulative
additive
effects
of
the sequence of decision
points.
A particle leaving the funnel at the top meets the tip of the
first obstacle and is deviated to the left or right by a distance
c with equal probability. It then meets the corresponding triangle
in
the
second
row,
and is again deviated in the same manner,
and so forth. The deviation of the particle from one row
to the next is a realization of a random variable with possible
values +c and –c, and with equal probability for both of them.
Finally, after passing r rows of triangles, the particle falls into
b
a
Figure 2. Physical models demonstrating the genesis of normal and log-normal distributions. Particles fall from a funnel
onto tips of triangles, where they are deviated to the left or to the right with equal probability (0.5) and finally fall into
receptacles. The medians of the distributions remain below the entry points of the particles. If the tip of a triangle is at
distance
x
from the left edge of the board, triangle tips to the right and to the left below it are placed at
x
+ c and
x
– c
for the normal distribution (panel a), and
x
· c and
x
/ c for the log-normal (panel b, patent pending), c and c being
constants. The distributions are generated by many small random effects (according to the central limit theorem) that are
additive for the normal distribution and multiplicative for the log-normal. We therefore suggest the alternative name
multiplicative normal distribution for the latter.
one of the r + 1 receptacles at the bottom. The probabilities
of ending up in these receptacles, numbered 0, 1,...,r, follow
a binomial law with parameters r and p = 0.5. When many particles
have
made
their
way
through
the
obstacles,
the height of
the particles piled up in the several receptacles will be approximately
proportional
to
the
binomial
probabilities.
For a large number of rows, the probabilities approach a
normal density function according to the central limit theorem.
In its simplest form, this mathematical law states that the
sum of many (r) independent, identically distributed random
variables is, in the limit as r , normally distributed. Therefore,
a Galton board with many rows of obstacles shows normal
density
as
the
expected
height
of
particle piles in the receptacles,
and its mechanism captures the idea of a sum of r
independent random variables.
Figure 2b shows how Galton’s construction was modified
to describe the distribution of a product of such variables,
which ultimately leads to a log-normal distribution. To this
aim, scalene triangles are needed (although they appear to be
isosceles in the figure), with the longer side to the right. Let
the distance from the left edge of the board to the tip of the
first obstacle below the funnel be x
m
. The lower corners of the
first triangle are at x
m
· c and x
/c (ignoring the gap necessary
to allow the particles to pass between the obstacles).
Therefore, the particle meets the tip of a triangle in the next
row at X = x
m
· c, or X = x
m
m
/c, with equal probabilities for both
values. In the second and following rows, the triangles with
the tip at distance x from the left edge have lower corners at
x · c and x/c (up to the gap width). Thus, the horizontal position
of
a particle is multiplied in each row by a random variable
with
equal
probabilities
for
its
two
possible
values
c
and
1/c.
Once again, the probabilities of particles falling into any receptacle
follow the same binomial law as in Galton’s
device, but because the receptacles on the right are wider
than those on the left, the height of accumulated particles is
a "histogram" skewed to the left. For a large number of rows,
the heights approach a log-normal distribution. This follows
from the multiplicative version of the central limit theorem,
which proves that the product of many independent, identically
distributed, positive random variables has approximately
a
log-normal
distribution.
Computer implementations
of the models shown in Figure 2 also are available at the Web
site http://stat.ethz.ch/vis/log-normal (Gut et al. 2000).
Articles
May 2001 / Vol. 51 No. 5 • BioScience 343
Articles
0 0.005
µ* = 100, * = 2
density
0 50 100 200 300 400
original scaleµ* µ* × *
Figure 3. A log-normal distribution with original scale (a) and with logarithmic
scale (b). Areas under the curve, from the median to both sides, correspond to one and
two standard deviation ranges of the normal distribution.
J. C. Kapteyn designed the direct predecessor of the lognormal
machine
(Kapteyn
1903,
Aitchison and Brown 1957).
For that machine, isosceles triangles were used instead of the
skewed shape described here. Because the triangles’ width is
proportional to their horizontal position, this model also
leads to a log-normal distribution. However, the isosceles
triangles with increasingly wide sides to the right of the entry
point
have
a
hidden
logical
disadvantage:
The median of
the particle flow shifts to the left. In contrast, there is no such
shift and the median remains below the entry point of the particles
in the log-normal board presented here (which was
designed by author E. L.). Moreover, the isosceles triangles in
344 BioScience • May 2001 / Vol. 51 No. 5
(a)
µ= 2, = 0.301
density
0 1
(b)
(a)
0 0.01 0.02
density
1.0 1.5 2.0 2.5 3.0
log scale
µ µ+
(b)
the Kapteyn board create additive effects
at each decision point, in contrast to the
multiplicative, log-normal effects apparent
in
Figure
2b.
Consequently,
the log-normal board
presented here is a physical representation
of
the multiplicative central limit
theorem in probability theory.
Basic properties of lognormal
distributions
The basic properties of log-normal distribution
were established long ago
(Weber 1834, Fechner 1860, 1897, Galton
1879,
McAlister 1879, Gibrat 1931,
Gaddum 1945), and it is not difficult to
characterize log-normal distributions
mathematically. A random variable X is said to be log-normally
distributed if log(X) is normally distributed (see the box on
the facing page). Only positive values are possible for the
variable, and the distribution is skewed to the left (Figure 3a).
Two parameters are needed to specify a log-normal distribution.
Traditionally, the mean µ and the standard deviation
(or the variance
2
) of log(X) are used (Figure 3b). However,
there are clear advantages to using "back-transformed"
values (the values are in terms of x, the measured data):
(1) µ: = e
, : = e .
We then use X (µ, ) as a mathematical expression
µ
meaning that X is distributed according to the log-normal law
with median µ and multiplicative standard
deviation
.
The median of this log-normal distribution
is
med(X)
=
µ
=
e
0 50 100 150 200 250
Figure 4. Density functions of selected log-normal distributions compared with a
normal distribution. Log-normal distributions (
µ
, ) shown for five values of
multiplicative standard deviation,
s*
, are compared with the normal distribution
(100 ± 20, shaded). The values cover most of the range evident in Table 2. While
the median µ is the same for all densities, the modes approach zero with increasing
shape parameters . A change in µ affects the scaling in horizontal and vertical
directions, but the essential shape remains the same.
1.2
1.5
2.0
4.0
8.0
, since µ
is the median of log(X). Thus, the probability
that the value of X is greater
than µ is 0.5, as is the probability that
the value is less than µ . The parameter
, which we call multiplicative
standard deviation, determines the
shape of the distribution. Figure 4
shows density curves for some selected
values of . Note that µ is a scale parameter;
hence, if X is expressed in different
units (or multiplied by a constant
for other reasons), then µ
changes accordingly but * remains
the same.
Distributions are commonly char-
µ
acterized by their expected value µ and
standard deviation . In applications for
which the log-normal distribution adequately
describes
the
data,
these parameters
are
usually
less
easy
to
interpret
than the median µ (McAlister
1879) and the shape parameter . It is
worth noting that is related to the
coefficient of variation by a monotonic, increasing transformation
(see
the
box
below,
eq. 2).
For normally distributed data, the interval µ ± covers a
probability of 68.3%, while µ ± 2 covers 95.5% (Table 1).
The corresponding statements for log-normal quantities are
[µ / , µ ] = µ
x
/ (contains 68.3%) and
[µ /( )
2
, µ ( )
2
] = µ
x
/ ( )
2
(contains 95.5%).
This characterization shows that the operations of multi-
plying and dividing, which we denote with the sign
/
(times/divide), help to determine useful intervals for lognormal
distributions (Figure 3), in the same way that the
operations of adding and subtracting (± , or plus/minus) do
for normal distributions. Table 1 summarizes and compares
some properties of normal and log-normal distributions.
The sum of several independent normal variables is itself
a normal random variable. For quantities with a log-normal
distribution, however, multiplication is the relevant operation
for combining them in most applications; for example, the
product of concentrations determines the speed of a simple
×
chemical reaction. The product of independent log-normal
quantities also follows a log-normal distribution. The median
of this product is the product of the medians of its factors. The
formula for of the product is given in the box below
(eq. 3).
For a log-normal distribution, the most precise (i.e.,
asymptotically most efficient) method for estimating the parameters
µ*
and
* relies on log transformation. The mean
and empirical standard deviation of the logarithms of the data
are calculated and then back-transformed, as in equation 1.
These estimators are called ¯x * and s*, where ¯x * is the
geometric mean of the data (McAlister 1879; eq. 4 in the box
below). More robust but less efficient estimates can be obtained
from the median and the quartiles of the data, as described
in the box below.
As noted previously, it is not uncommon for data with a
log-normal distribution to be characterized in the literature
by the arithmetic mean ¯x and the standard deviation s of a
sample, but it is still possible to obtain estimates for µ* and
Definition and properties of the log-normal distribution
A random variable X is log-normally distributed if log(X) has a normal distribution. Usually, natural logarithms are used, but other bases would lead to the
same family of distributions, with rescaled parameters. The probability density function of such a random variable has the form
A shift parameter can be included to define a three-parameter family. This may be adequate if the data cannot be smaller than a certain bound different
from zero (cf. Aitchison and Brown 1957, page 14). The mean and variance are exp(µ + /2) and (exp(
2
) – 1)exp(2µ+
), respectively, and therefore,
the coefficient of variation is
(2)
cv =
t
which is a function in only.
The product of two independent log-normally distributed random variables has the shape parameter
(3)
since the variances at the log-transformed variables add.
Estimation: The asymptotically most efficient (maximum likelihood) estimators are
(4)
-
-
The quartiles q1 and q2 lead to a more robust estimate (q1/q2)
c
for s*, where 1/c = 1.349 = 2 ·
–1
(0.75),
denoting the inverse
standard normal distribution function. If the mean ¯x and the standard deviation s of a sample are available, i.e. the data is summarized
in
the
form
¯x
± s, the parameters µ* and s* can be estimated from them by using
respectively, with , cv = coefficient of variation. Thus, this estimate of s* is determined only by the cv (eq. 2).
-
.
and
–1
,
2
Articles
May 2001 / Vol. 51 No. 5 • BioScience 345
Articles
Table 1. A bridge between normal and log-normal distributions.
* (see the box on page 345). For example, Stehmann and
De Waard (1996) describe their data as log-normal, with the
arithmetic mean ¯x and standard deviation s as 4.1 ± 3.7.
Taking the log-normal nature of the distribution into account,
the probability of the corresponding ¯x ± s interval
(0.4 to 7.8) turns out to be 88.4% instead of 68.3%. Moreover,
65% of the population are below the mean and almost
exclusively within only one standard deviation. In contrast,
the proposed characterization, which uses the geometric
mean ¯x * and the multiplicative standard deviation s*, reads
3.0
x
Normal distribution Log-normal distribution
(Gaussian, or additive (Multiplicative
Property normal, distribution) normal distribution)
Effects (central limit theorem) Additive Multiplicative
Shape of distribution Symmetrical Skewed
Models
Triangle shape Isosceles Scalene
Effects at each decision point x x ± c x
Characterization
Mean ¯x , Arithmetic ¯¯x *, Geometric
Standard deviation s, Additive s*, Multiplicative
Measure of dispersion cv = s/¯x s*
Interval of confidence
68.3% ¯x ± s ¯x *
/ s*
95.5% ¯x ± 2s ¯x *
99.7% ¯x ± 3s ¯x *
Notes: cv = coefficient of variation;
x
/ 2.2 (1.36 to 6.6). This interval covers approximately
68% of the data and thus is more appropriate than the
other interval for the skewed data.
Comparing log-normal distributions
across the sciences
Examples of log-normal distributions from various branches
of
science reveal interesting patterns (Table 2). In general,
values of s* vary between 1.1 and 33, with most in the
range of approximately 1.4 to 3. The shapes of such distributions
are apparent by comparison with selected instances
shown in Figure 4.
Geology and mining.
In the Earth’s crust, the concentration
of
elements and their radioactivity usually follow a lognormal
distribution.
In geology, values of s* in 27 examples
varied from 1.17 to 5.6 (Razumovsky 1940, Ahrens 1954,
Malanca et al. 1996); nine other examples are given in Table
2. A closer look at extensive data from different reefs (Krige
1966) indicates that values of s* for gold and uranium increase
in concert with the size of the region considered.
Human medicine.
A variety of examples from medicine
fit the log-normal distribution. Latent periods (time from infection
to
first
symptoms)
of infectious diseases have often
346 BioScience • May 2001 / Vol. 51 No. 5
x
/ c
/ = times/divide, corresponding to plus/minus for the
established sign ±.
x
x
x
/ (s*)
/ (s*)
2
3
been shown to be log-normally distributed
(Sartwell 1950, 1952, 1966,
Kondo 1977); approximately 70% of
86 examples reviewed by Kondo (1977)
appear to be log-normal. Sartwell
(1950, 1952, 1966) documents 37 cases
fitting the log-normal distribution. A
particularly impressive one is that of
5914 soldiers inoculated on the same
day with the same batch of faulty vaccine,
1005 of whom developed serum
hepatitis.
Interestingly, despite considerable
differences in the median ¯x * of latency
periods
of
various diseases (ranging
from
2.3
hours
to
several
months;
Table
2),
the majority of s* values were
close to 1.5. It might be worth trying to
account for the similarities and dissimilarities
in
s*.
For instance, the small
s* value of 1.24 in the example of the
Scottish soldiers may be due to limited variability within this
rather homogeneous group of people. Survival time after diagnosis
of
four types of cancer is, compared with latent periods
of
infectious diseases, much more variable, with s* values
between
2.5
and
3.2
(Boag
1949,
Feinleib and McMahon
1960). It would be interesting to see whether ¯x * and s* values
have
changed
in
accord
with
the
changes
in
diagnosis
and
treatment
of
cancer in the last half century. The age of onset
of Alzheimer’s disease can be characterized with the geometric
mean
¯x
* of 60 years and s* of 1.16 (Horner 1987).
Environment.
The distribution of particles, chemicals,
and organisms in the environment is often log-normal. For
example, the amounts of rain falling from seeded and unseeded
clouds differed significantly (Biondini 1976), and
again s* values were similar (seeding itself accounts for the
greater variation with seeded clouds). The parameters for
the content of hydroxymethylfurfurol in honey (see Figure 1b)
show that the distribution of the chemical in 1573 samples can
be described adequately with just the two values. Ott (1978)
presented data on the Pollutant Standard Index, a measure of
air quality. Data were collected for eight US cities; the extremes
of ¯x * and s* were found in Los Angeles, Houston, and Seattle,
allowing interesting comparisons.
Atmospheric sciences and aerobiology.
Another component
of
air quality is its content of microorganisms, which
was—not surprisingly—much higher and less variable in
the air of Marseille than in that of an island (Di Giorgio et al.
1996). The atm
No comments:
Post a Comment