arg min and arg max
arg min and arg max
For a real-valued function with domain , is the set of elements in that achieve the global minimum in ,
Defines:
Manifold premier
arg min and arg max
Primary tabs
arg
min and arg max
For a real-valued function with domain , is the set of elements in that achieve the global
minimum in ,
|
|
Defines:
Jean-Luc Starck, Fionn Murtagh, Emmanuel J. Candes, and David L.
Donoho. Gray and color image contrast enhancement by the curvelet transform.
IEEE Transactions on Image Processing, 12(6):706–717, 2003.
wavelet coefficients, wavelet compression
A wavelet is
a wave-like oscillation with an amplitude that begins at zero, increases, and
then decreases back to zero. It can typically be visualized as a "brief
oscillation" like one might see recorded by a seismograph or heart monitor.
MULTISCALE REPRESENTATIONS FOR MANIFOLD-VALUED DATA INAM UR RAHMAN∗,
IDDO DRORI∗, VICTORIA C. STODDEN∗ DAVID L. DONOHO∗,
PETER SCHRODER ¨ † Abstract. We describe multiscale representations for data
observed on equispaced grids and taking values in manifolds such as: the sphere
S2, the special orthogonal group SO(3), the positive definite matrices SPD(n),
and the Grassmann manifolds G(n, k). The representations are based on the
deployment of Deslauriers-Dubuc and Average-Interpolating pyramids ‘in the
tangent plane’ of such manifolds, using the Exp and Log maps of those
manifolds. The representations provide ‘wavelet coefficients’ which can be
thresholded, quantized, and scaled much as traditional wavelet coefficients.
Tasks such as compression, noise removal, contrast enhancement, and stochastic
simulation are facilitated by this representation. The approach applies to
general manifolds, but is particularly suited to the manifolds we consider,
i.e. Riemannian symmetric spaces, such as Sn−1, SO(n), G(n, k), where the Exp
and Log maps are effectively computable. Applications to manifold-valued data
sources of a geometric nature (motion, orientation, diffusion) seem
particularly immediate. A software toolbox, SymmLab, can reproduce the results
discussed in this paper.
爱: "广义相对性原理 (wyle, 杨进一步发展, (微观)局域规范变换不变規範場)对自然界定律作了一些广泛而具明确性的限制"
super
simply speaking: 广义相对性原理 in a manifold such as GR lorentz manifold, means some kind of gauge
field exists, but for a local observer, all he can observe is using a locally
transformed gauge and and there for his observations have to be modified to be
"right", in the context of this "gauge field"
now, this manifold has
find its way into machine learning, AI and everything else, as it should be.
My near term goal is
to have your lab boss (or somebody else) start to appreciate my work and hire
me as a contractor/consultant, as a start.
Part I. manifold in a
nutshell
1. General concepts
微分几何中,黎曼几何研究具有黎曼度量的光滑流形,即流形切空间上二次形式的选择。它特别关注于角度、弧线长度及体积。把每个微小部分加起来而得出整体的数量.
二次型的系统研究是从18世纪开始的,它起源于对二次曲线和二次曲面的分类问题的讨论,将二次曲线和二次曲面的方程变形,选有主轴方向的轴作为坐标轴以简化方程的形状,这个问题是在18世纪引进的。
高斯独立发现了二项式定理的一般形式、数论上的“二次互反律”、素数定理、及算术-几何平均数.
柯西在其著作中给出结论:当方程是标准型时,二次曲面用二次型的符号来进行分类.
非严谨地说, 黎曼几何,
Riemann manifold, 研究"高维弯曲空间"二次形的度量, 如果不是"弯曲高维空间"二次形的度量, 那就是希尔伯特空间, etc.
I
have managed to come to and earning this beautiful characterization of
Riemann manifold at concept level after years of hard studying of physics
and math, covering almost all major areas of them.
the
following is a google translation:
Differential geometry,
or Riemann geometry, Riemann metric manifold is all about the
"quadratic form". It is particularly concerned about the angle, arc
length and volume in a high dimensional and curved space such as that of
general relativity Lorentz
manifold.
Studies of
"quadratic form" started at the beginning of the 18th
century,
Gauss independently
discovered the general form of the binomial theorem, "Quadratic
Reciprocity Law" on number theory, prime number theorem, and arithmetic -
geometric mean.
, and the famous e Cauchy–Schwarz inequality,
etc.
Hilbert space studies
the "quadratic form" in a "high-dimensional but curved
space", however, biology and many social systems all reside in
high-dimensional and curved space, and we have to figure out how to model
it.
2. How statistics got
started: modeling and measuring of linear systems experiencing disturbances and
stresses
(https://dspace.lboro.ac.uk/dspace-jspui/bitstream/2134/20203/1/Thesis-2016-Oltean.pdf)
"· unsuccessful
predictions of stock prices made by sir Isaac Newton and, consequently, his
terrible loss in 1720 of 20000 pounds in South Sea speculation bubble. · in 1738, Daniel
Bernoulli introduced the idea of utility in order to describe preferences of
people and consumer satisfaction. · successful management of the fund for the
widows of Goettingen professors, performed by Carl Friedrich Gauss. · Giovanni Ceva published
an essay “A mathematical approach of money” in 1711. · Laplace in his work
“Essai philosophique sur les probabilites” (1812) showed that what apparently
might seem random and unpredictable (such as number of letters in the Paris
dead-letter office) is predictable and obeys a simple law. · Adolphe Quetelet (a
former student of Fourier) studied the existence of patterns in data sets
ranging from the frequency of different methods for committing murder to the
chest size of Scottish men. It was him who coined the term “social physics” in
1835. 2 · explanation of the Brownian random walk and the formulation of
the ChapmanKolmogorov condition for Markovian processes by Louis Bachelier in
his PhD thesis on the theory of speculation. This was done 5 years before the
work of Smoluchowski and Einstein on diffusion, based on the observations of
price changes at Paris stockmarket. · Italian physicist Ettore Majorana wrote in 1936
a paper based on analogies between statistical physics laws and the ones from
social sciences "
Boltzmann was very
explicit: “The molecules are like individuals, ... and the properties of gases
only remain unaltered, because the number of these molecules, which on average
has a given state, is constant. In “Populäre Schriften” , Boltzmann said “This
opens a broad perspective, if we do not only think of mechanical objects. Let
us consider the application of this method to the statistics of living beings,
society, sociology and so forth.”
Regression in general
https://www-sop.inria.fr/asclepios/events/MFCA11/Proceedings/MFCA11_3_1
2 Multiple Linear Regression Before formulating geodesic regression on
general manifolds, we begin by reviewing multiple linear regression in R n.
Here we are interested in the relationship between a non-random independent
variable X ∈ R and a random dependent variable Y taking values in R
n. A multiple linear model of this relationship is given by Y = α + Xβ + , (1)
where α ∈ R n is an unobservable intercept parameter, β ∈
R n is an unobservable slope parameter, and is an R n-valued, unobservable random variable
representing the error. Geometrically, this is the equation of a
one-dimensional line through R n (plus noise), parameterized by the scalar
variable X. For the purposes of generalizing to the manifold case, it is useful
to think of α as the starting point of the line and β as a velocity vector.
Given realizations of the above model, i.e., data (xi , yi) ∈
R × R n, for i = 1, . . . , N, the least
squares estimates, ˆα, β, ˆ for the intercept and slope are computed by solving
the minimization problem (ˆα, βˆ) = arg min (α,β) X N i=1 kyi − α − xiβk 2 .
(2) This equation can be solved analytically, yielding βˆ = 1 N Pxi yi − x¯ y¯
Px 2 i − x¯ 2 , αˆ = ¯y − x¯ β, ˆ where ¯x and ¯y are the sample means of the
xi and yi , respectively. If the errors in the model are drawn from distributions
with zero mean and finite variance, then these estimators are unbiased and
consistent. M yi f (x ) = Exp(p, xv) p v Fig. 1. Schematic of the geodesic
regression model.
3. It is still all
about "quadratic form": sigmoid function as the backbone of logistic regression
The probability
distribution used is cumulative logistic distribution which is applied to
cumulated income, expenditure, or wealth on one hand and also to cumulated
probabilities on the other hand. Logistic function or sigmoid function is
defined as 𝑓(𝑥) = 𝐿 1+𝑒𝑥𝑝−𝑘(𝑥−𝑥0) (2.1) where L is the curve's maximum value,
x0 is the x-value of the sigmoid's midpoint, and k = the steepness of the
curve[85]. Logistic map, which is the basis for logistic function, is used to
show how complex, chaotic behaviour can arise from very simple non-linear
dynamical equations [86]. We use logistic cumulative probability distribution
C(x), which is defined as the integral C(x) = ∫ P(x)dx x −∞ (2.2)
www.iro.umontreal.ca/~lisa/publications2/.../205
1.
2.
Université de Montréal
Apr 3, 2009 - quadratic units was strongest in conjunction with sparse and
... Equation 1 looks sigmoidal as
a function of E, but the sharpness of the ...
radar.oreilly.com/.../its-not-exponential-its-sigmoi.ht...
1.
2.
O’Reilly Media, Inc.
Nov 26, 2007 - Exponential vs. linear or quadratic curves. ... In fact, one of the most
important sigmoidal functions is the logistic function, originally developed to .
4. algorithm of
machine learning and AI: putting all there together: statistics on manifold.
"As is common in
applications, we use the Karcher mean. In practice, the Karcher mean can be
efficiently computed using an iterative algorithm [113]. Let µ denote the
intrinsic mean. The value the (sample) Fr´echet function attains at µ, 1 N X N
i=1 d(µ, pi) 2 , (2.82) is called the geodesic variance. The (sample)
covariance is defined through the Euclidean (sample) covariance of the 69 data
as expressed in TµM: Cov({pi} N i=1) def = 1 N − 1 X N i=1 Logµ (pi)Logµ (pi) T
. (2.83) Note that the point of tangency is the intrinsic mean, µ. This echoes
(and in fact, generalizes) the construction of the Euclidean (sample)
covariance in a Euclidean space, which is built from summing outer-products of
vectors following the subtraction of the Euclidean (sample) mean: Cov({pi} N
i=1) def = 1 N − 1 X N i=1 Ñ pi − 1 N X N j=1 pj é Ñpi − 1 N X N j=1 pj
éT"
I kind of completed
reading of "dam.brown.edu/people/freifeld//phd/ThesisOrenFreifeld.pdf"
which I started reading last Friday, and went through most of it Saturday
at oakland air port.
the paragraph I quoted
really "ring the bell": basically, geometrical "distance"
is the backbone of "statistical mean", and all kind of
statistical variances can be expressed as some kind of products of vectors,
with inner product of vectors defining your "distance" measure in a
scalar field or classical vector field, and outer product of vector becomes
more challenging when in a vector field such as Maxwell's electromagnetic field
.
and wrote before, when
modeling a system residing in a high dimensional and curved space, one has to
advance study of scalar and vector field into Riemann manifold, and advance
statistics as we know into Riemann manifold, and this is exactly starting
happening now, with still very few literature available.
one of difficulties
is, as I wrote before, Riemann differential (or smooth enough so we can perform
calculus and statistics) manifold is not math only, it is a multi discipline of advanced mathematics and
physics such as general relativity and gauge field theory, which is still mind
challenging to most physicists of linear physics disciplines, not to mention
all other scientists outside of physics.
the world of science is actually very
interesting, with many of them not really knowing what they are doing?
5. topological approach to manifold
http://stat.fsu.edu/~anuj/pdf/papers/Y2009/TuragaChapterVideoManifolds.pdf
. Boothby, W.M.: An introduction to differentiable manifolds
and Riemannian geometry. Academic Press Inc (1975)
Spivak, M.: A Comprehensive Introduction to Differential
Geometry, Volume 1. Publish or Perish, Inc (1970)
3 Introduction to Manifolds We shall first start with the
topological definition of a manifold in terms of charts and atlases. Using
them, we will show that R n is indeed a differentiable manifold. Then, we state
a theorem that defines sub-manifold of a manifold as a solution of an equation.
This shall be specialized to the case of manifolds that are actually
sub-manifolds of R n, arising as solutions of an equation in R n with some
conditions. Furthermore, we will establish the notions of tangent vectors and
tangent spaces on non-Euclidean manifolds. This will then allow the use of
classical statistical methods on the tangent planes via the exponential map and
its inverse. We shall provide specific examples to illustrate these notions.
3.1 General Background from Differential Geometry We start by considering the
definition of a general differentiable manifold. The material provided here is
brief and by no means comprehensive. We refer the interested readers to two
excellent books [9][38] for a more detailed introduction to differential
geometry and manifold analysis. A topological space M is called a
differentiable manifold if, amongst other properties, it is locally Euclidean.
This means that for each p ∈ M, there exists an open
neighborhood U of p and a mapping φ : U → R n such that φ(U) is open in R n and
φ : U → φ(U) is a diffeomorphism. The pair (U, φ) is called a coordinate chart
for the points that fall in U; for any point y ∈ U, one can view the Euclidean
coordinates φ(y) = (φ1(y), φ2(y), . . . , φn(y)) as the coordinates of y. The
dimension of the manifold M is n. This is a way of flattening the manifold
locally. Using φ and φ −1 , one can move between the sets U and φ(U) and
perform calculations in the more convenient Euclidean space. If there exists
multiple such charts, then they are compatible, i.e. their compositions are
smooth. We look at the some simple manifolds as examples. Statistical Analysis
on Manifolds and its applications to Video Analysis 7 Example 1. (R n is a
manifold) 1. The Euclidean space R n is an n-dimensional differentiable manifold
which can be covered by the single chart (R n, φ), φ(x) = x. 2. Any open subset
of a differentiable manifold is itself a differentiable manifold. A well known
example of this idea comes from linear algebra. Let M(n) be the set of all n ×
n matrices; M(n) can be identified with the set R n×n and is, therefore, a
differentiable manifold. Define the subset GL(n) as the set of non-singular
matrices, i.e. GL(n) = {A ∈ M(n)| det(A) 6= 0}, where det(·)
denotes the determinant of a matrix. Since GL(n) is an open subset of M(n), it
is also a differentiable manifold. Fig. 1 Figure illustrating the notions of
tangent spaces, tangent vectors, and geodesics In order to perform differential
calculus, i.e. to compute gradients, directional derivatives, critical points,
etc., of functions on manifolds, one needs to understand the tangent structure
of those manifolds. Although there are several ways to define tangent spaces,
one intuitive approach is to consider differentiable curves on the manifold
passing through the point of interest, and to study the velocity vectors of
these curves at that point. To help visualize these ideas, we illustrate the
notions of tangent planes, geodesics in figure 1. More formally, let M be an
n-dimensional manifold and, for a point p ∈ M, consider a differentiable
curve γ : (−ǫ, ǫ) → M such that γ(0) = p. The velocity ˙γ(0) denotes the
velocity of γ at p. This vector has the same dimension as the manifold M itself
and is an example of a tangent vector to M at p. The set of all such tangent
vectors is called the tangent space to M at p. Even though the manifold M maybe
nonlinear, the tangent space Tp(M) is always linear and one can impose
probability models on it using more traditional approaches. Example 2. 1. In
case of the Euclid
6. algebra approach to manifold
3.1 Manifolds Everyone knows what a curve is, until he has
studied enough mathematics to become confused through the countless number of
possible exceptions —Felix Klein In this section we introduce one more actor in
multivariable calculus. So far, our mappings have been first linear, then
nonlinear with good linear approximations. But the domain and codomain of our
mappings have been “flat” open subsets of Rn. Now we want to allow “nonlinear
Rn’s,” called smooth manifolds. These familiar objects are by no means simple:
already, the theory of soap bubbles is a diffi- cult topic, with a complicated
partial differential equation controlling the shape of the film. Manifolds are
a generalization of the familiar curves and surfaces of everyday experience. A
one-dimensional manifold is a smooth curve; a twodimensional manifold is a
smooth surface. Smooth curves are idealizations of things like telephone wires
or a tangled garden hose. Particularly beautiful smooth surfaces are produced
when you blow soap bubbles that wobble and slowly vibrate as they drift through
the air. Other examples are shown in figure 3.1.2. Figure 3.1.1. Felix Klein
(1849–1925) Klein’s work in geometry “has become so much a part of our present
mathematical thinking that it is hard for us to realise the novelty of his
results.”—From a biographical sketch by J. O’Connor and E. F. Robertson. Klein
was also instrumental in developing Mathematische Annalen into one of the most
prestigious mathematical journals. Figure 3.1.2. Four surfaces in R3. The top
two are graphs of functions. The bottom two are locally graphs of functions.
All four qualify as smooth surfaces (two-dimensional manifolds) under definition
3.1.2. We will define smooth manifolds mathematically, excluding some objects
that we might think of as smooth: a figure eight, for example. We will see how
to use the implicit function theorem to tell whether the locus defined 3.1
Manifolds 285 by an equation is a smooth manifold. Finally, we will compare
knowing a manifold by equations, and knowing it by a parametrization. Smooth
manifolds in Rn When is a subset X ⊂ Rn a smooth manifold? Our
definition is based on the notion of graph. Remember from the discussion of set
theory notation (section 0.3) that A×B is the set of pairs (a, b) with a ∈
A and b ∈ B. Here x is a point in Rn and y is a point in Rm. The
graph of such a function consists of points µ x f(x) ¶ in Rn+m. (This is the
simplest way to think of it, with the n active variables coming first, followed
by the m passive variables. A manifold M embedded in Rn, denoted M ⊂
Rn, is sometimes called a submanifold of Rn. Strictly speaking, it should not
be referred to simply as a “manifold,” which can mean an abstract manifold, not
embedded in any space. The manifolds in this book are all submanifolds of Rn.
With this definition, which depends on chosen coordinates, it isn’t obvious
that if you rotate a smooth manifold it is still smooth. We will see in theorem
3.1.16 that it is. Definition 3.1.1 (Graph). in The graph Γ(f) of a function f
: Rn → Rm is the set of pairs (x, y) ∈ (Rn ×
Rm) such that f(x) = y. You are familiar with graphs of functions f : R → R;
most often we graph such functions with the horizontal x-axis corresponding to
the input, and the vertical axis corresponding to values of f at different x.
Note that the graph of such a function is a subset of R2. For example, the
graph of f(x) = x2 consists of the points µ x f(x) ¶ ∈ R2, i.e.,
the points µ x x2 ¶ . The top two surfaces shown in figure 3.1.2 are graphs of
functions from R2 to R: the surface on the left is the graph of f ³ x y ´ = x3
− 2xy2; that on the right is the graph of f ³ x y ´ = x2 + y4. Although we
depict these graphs on a flat piece of paper, they are actually subsets of R3.
The first consists of the points x y x3 − 2xy2 , the second of the
points x y x2 + y4 . More generally, the graph of a function f lives in
a space whose dimension is the sum of the dimensions of the domain and codomain
of f: the graph of a function f : Rn → Rm is a subset of Rn × Rm = Rn+m.
Definition 3.1.2 says that if such a function f : Rn → Rm is C1, then its graph
is a smooth n-dimensional manifold in Rn+m. Thus the top two graphs shown in figure
3.1.2 are two-dimensional manifolds in R3. But the torus and helicoid shown in
figure 3.1.2 are also two-dimensional manifolds. Neither one is the graph of a
single function expressing one variable in terms of the other two. But both are
locally graphs of functions. Definition 3.1.2 (Smooth manifold in in Rn). A
subset M ⊂ Rn is a smooth k-dimensional manifold if locally it is
the graph of a C1 mapping expressing n − k variables as functions of the other
k variables. Generally, “smooth” means “as many times differentiable as is
relevant to the problem at hand.” In this and the next section, it means “of
class C1.” (Some authors use “smooth” to mean C∞: “infinitely many times
differentiable.” For our purposes this is overkill.) When speaking of smooth manifolds,
we often omit the word smooth. 286 Chapter 3. Manifolds, Taylor polynomials,
quadratic forms, curvature Especially in higher dimensions, making some kind of
global sense of a patchwork of graphs of functions can be quite challenging
indeed; a mathematician trying to picture a manifold is rather like a
blindfolded person who has never met or seen a picture of an elephant seeking
to identify one by patting first an ear, then the trunk or a leg. It is a
subject full of open questions, some fully as interesting and demanding as, for
example, Fermat’s last theorem, whose solution after more than three centuries
aroused such passionate interest. Three-dimensional and fourdimensional
manifolds are of particular interest, in part because of applications in
representing spacetime. “Locally” means that every point x ∈
M has a neighborhood U ⊂ Rn such that M ∩U (the part of M
in U) is the graph of a mapping expressing n − k of the coordinates of each
point in M ∩ U in terms of the other k. This may sound like an unwelcome
complication, but if we omitted the word “locally” then we would exclude from
our definition most interesting manifolds. We already saw that neither the
torus nor the helicoid of figure 3.1.2 is the graph of a single function
expressing one variable as a function of the other two. Even such a simple
curve as the unit circle is not the graph of a single function expressing one
variable in terms of the other. In figure 3.1.3 we show another smooth curve
that would not qualify as a manifold if we required it to be the graph of a
single function expressing one variable in terms of the other; the caption
justifies our claim that this curve is a smooth curve. I J I J 1 1 Figure
3.1.3. Above, I and I1 are intervals on the x-axis; J and J1 are intervals on
the y-axis. The darkened part of the curve in the shaded rectangle I × J is the
graph of a function expressing x ∈ I as a function of y ∈
J, and the darkened part of the curve in I1 × J1 is the graph of a function
expressing y ∈ J1 as a function of x ∈ I1. (By decreasing the size of J1
a bit, we could also think of the part of the curve in I1 × J1 as the graph of
a function expressing x ∈ I1 as a function of y ∈
J1.) But we cannot think of the darkened part of the curve in I × J as the
graph of a function expressing y ∈ J as a function of x ∈
I; there are values of x that give two different values of y, and others that
give none, so such a “function” is not well defined.
Topic 14
Unbiased Estimation
14.1 Introduction
In creating a parameter estimator, a fundamental question is whether or not the estimator differs from the parameter
in a systematic manner. Let’s examine this by looking a the computation of the mean and the variance of 16 flips of a
fair coin.
Give this task to 10 individuals and ask them report the number of heads. We can simulate this in R as follows
> (x<-rbinom(10,16,0.5))
[1] 8 5 9 7 7 9 7 8 8 10
Our estimate is obtained by taking these 10 answers and averaging them. Intuitively we anticipate an answer
around 8. For these 10 observations, we find, in this case, that
> sum(x)/10
[1] 7.8
The result is a bit below 8. Is this systematic? To assess this, we appeal to the ideas behind Monte Carlo to perform
a 1000 simulations of the example above.
> meanx<-rep(0,1000)
> for (i in 1:1000){meanx[i]<-mean(rbinom(10,16,0.5))}
> mean(meanx)
[1] 8.0049
From this, we surmise that we the estimate of the sample mean x¯ neither systematically overestimates or underestimates
the distributional mean. From our knowledge of the binomial distribution, we know that the mean
µ = np = 16 · 0.5=8. In addition, the sample mean X¯ also has mean
EX¯ = 1
10(8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8) = 80
10 = 8
verifying that we have no systematic error.
The phrase that we use is that the sample mean X¯ is an unbiased estimator of the distributional mean µ. Here is
the precise definition.
Definition 14.1. For observations X = (X1, X2,...,Xn) based on a distribution having parameter value ✓, and for
d(X) an estimator for h(✓), the bias is the mean of the difference d(X) ! h(✓), i.e.,
bd(✓) = E✓d(X) ! h(✓). (14.1)
If bd(✓)=0 for all values of the parameter, then d(X) is called an unbiased estimator. Any estimator that is not
unbiased is called biased.
205
Introduction to the Science of Statistics Unbiased Estimation
Example 14.2. Let X1, X2,...,Xn be Bernoulli trials with success parameter p and set the estimator for p to be
d(X) = X¯, the sample mean. Then,
EpX¯ = 1
n(EX1 + EX2 + ··· + EXn) = 1
n(p + p + ··· + p) = p
Thus, X¯ is an unbiased estimator for p. In this circumstance, we generally write pˆ instead of X¯. In addition, we can
use the fact that for independent random variables, the variance of the sum is the sum of the variances to see that
Var(ˆp) = 1
n2 (Var(X1) + Var(X2) + ··· + Var(Xn))
= 1
n2 (p(1 ! p) + p(1 ! p) + ··· + p(1 ! p)) = 1
n
p(1 ! p).
Example 14.3. If X1,...,Xn form a simple random sample with unknown finite mean µ, then X¯ is an unbiased
estimator of µ. If the Xi have variance "2, then
Var(X¯) = "2
n . (14.2)
We can assess the quality of an estimator by computing its mean square error, defined by
E✓[(d(X) ! h(✓))2]. (14.3)
Estimators with smaller mean square error are generally preferred to those with larger. Next we derive a simple
relationship between mean square error and variance. We begin by substituting (14.1) into (14.3), rearranging terms,
and expanding the square.
E✓[(d(X) ! h(✓))2] = E✓[(d(X) ! (E✓d(X) ! bd(✓)))2] = E✓[((d(X) ! E✓d(X)) + bd(✓))2]
= E✓[(d(X) ! E✓d(X))2]+2bd(✓)E✓[d(X) ! E✓d(X)] + bd(✓)
2
= Var✓(d(X)) + bd(✓)
2
Thus, the representation of the mean square error as equal to the variance of the estimator plus the square of the
bias is called the bias-variance decomposition. In particular:
• The mean square error for an unbiased estimator is its variance.
• Bias always increases the mean square error.
14.2 Computing Bias
For the variance "2, we have been presented with two choices:
1
n
Xn
i=1
(xi ! x¯)
2 and 1
n ! 1
Xn
i=1
(xi ! x¯)
2. (14.4)
Using bias as our criterion, we can now resolve between the two choices for the estimators for the variance "2.
Again, we use simulations to make a conjecture, we then follow up with a computation to verify our guess. For 16
tosses of a fair coin, we know that the variance is np(1 ! p) = 16 · 1/2 · 1/2=4
For the example above, we begin by simulating the coin tosses and compute the sum of squares P10
i=1(xi ! x¯)2,
> ssx<-rep(0,1000)
> for (i in 1:1000){x<-rbinom(10,16,0.5);ssx[i]<-sum((x-mean(x))ˆ2)}
> mean(ssx)
[1] 35.8511
206
Introduction to the Science of Statistics Unbiased Estimation
Histogram of ssx
ssx
Frequency
0 20 40 60 80 100 120
0 50 100 150 200 250
Figure 14.1: Sum of squares about x¯ for 1000 simulations.
The choice is to divide either by 10, for the first
choice, or 9, for the second.
> mean(ssx)/10;mean(ssx)/9
[1] 3.58511
[1] 3.983456
Exercise 14.4. Repeat the simulation above, compute
the sum of squares P10
i=1(xi ! 8)2. Show that these simulations
support dividing by 10 rather than 9. verify that
Pn
i=1(Xi!µ)2/n is an unbiased estimator for "2 for independent
random variable X1,...,Xn whose common
distribution has mean µ and variance "2.
In this case, because we know all the aspects of the
simulation, and thus we know that the answer ought to
be near 4. Consequently, division by 9 appears to be the
appropriate choice. Let’s check this out, beginning with
what seems to be the inappropriate choice to see what
goes wrong..
Example 14.5. If a simple random sample X1, X2,...,
has unknown finite variance "2, then, we can consider the sample variance
S2 = 1
n
Xn
i=1
(Xi ! X¯)
2.
To find the mean of S2, we divide the difference between an observation Xi and the distributional mean into two steps
- the first from Xi to the sample mean x¯ and and then from the sample mean to the distributional mean, i.e.,
Xi ! µ = (Xi ! X¯)+(X¯ ! µ).
We shall soon see that the lack of knowledge of µ is the source of the bias. Make this substitution and expand the
square to obtain
Xn
i=1
(Xi ! µ)
2 = Xn
i=1
((Xi ! X¯)+(X¯ ! µ))2
= Xn
i=1
(Xi ! X¯)
2 + 2Xn
i=1
(Xi ! X¯)(X¯ ! µ) +Xn
i=1
(X¯ ! µ)
2
= Xn
i=1
(Xi ! X¯)
2 + 2(X¯ ! µ)
Xn
i=1
(Xi ! X¯) + n(X¯ ! µ)
2
= Xn
i=1
(Xi ! X¯)
2 + n(X¯ ! µ)
2
(Check for yourself that the middle term in the third line equals 0.) Subtract the term n(X¯ ! µ)2 from both sides and
divide by n to obtain the identity
1
n
Xn
i=1
(Xi ! X¯)
2 = 1
n
Xn
i=1
(Xi ! µ)
2 ! (X¯ ! µ)
2.
207
Introduction to the Science of Statistics Unbiased Estimation
Using the identity above and the linearity property of expectation we find that
ES2 = E
"
1
n
Xn
i=1
(Xi ! X¯)
2
#
= E
"
1
n
Xn
i=1
(Xi ! µ)
2 ! (X¯ ! µ)
2
#
= 1
n
Xn
i=1
E[(Xi ! µ)
2] ! E[(X¯ ! µ)
2]
= 1
n
Xn
i=1
Var(Xi) ! Var(X¯)
= 1
n
n"2 ! 1
n"2 = n ! 1
n "2 6= "2.
The last line uses (14.2). This shows that S2 is a biased estimator for "2. Using the definition in (14.1), we can
see that it is biased downwards.
b("2) = n ! 1
n "2 ! "2 = ! 1
n"2.
Note that the bias is equal to !Var(X¯). In addition, because
E
n
n ! 1
S2
&
= n
n ! 1
E ⇥
S2⇤
= n
n ! 1
✓n ! 1
n "2
◆
= "2
and
S2
u = n
n ! 1
S2 = 1
n ! 1
Xn
i=1
(Xi ! X¯)
2
is an unbiased estimator for "2. As we shall learn in the next section, because the square root is concave downward,
Su = pS2
u as an estimator for " is downwardly biased.
Example 14.6. We have seen, in the case of n Bernoulli trials having x successes, that pˆ = x/n is an unbiased
estimator for the parameter p. This is the case, for example, in taking a simple random sample of genetic markers
at a particular biallelic locus. Let one allele denote the wildtype and the second a variant. If the circumstances in
which variant is recessive, then an individual expresses the variant phenotype only in the case that both chromosomes
contain this marker. In the case of independent alleles from each parent, the probability of the variant phenotype is
p2. Na¨ıvely, we could use the estimator pˆ2. (Later, we will see that this is the maximum likelihood estimator.) To
determine the bias of this estimator, note that
Epˆ2 = (Epˆ)
2 + Var(ˆp) = p2 +
1
n
p(1 ! p). (14.5)
Thus, the bias b(p) = p(1 ! p)/n and the estimator pˆ2 is biased upward.
Exercise 14.7. For Bernoulli trials X1,...,Xn,
1
n
Xn
i=1
(Xi ! pˆ)
2 = ˆp(1 ! pˆ).
Based on this exercise, and the computation above yielding an unbiased estimator, S2
u, for the variance,
E
1
n ! 1
pˆ(1 ! pˆ)
&
= 1
n
E
"
1
n ! 1
Xn
i=1
(Xi ! pˆ)
2
#
= 1
n
E[S2
u] = 1
n
Var(X1) = 1
n
p(1 ! p).
208
Introduction to the Science of Statistics Unbiased Estimation
In other words,
1
n ! 1
pˆ(1 ! pˆ)
is an unbiased estimator of p(1 ! p)/n. Returning to (14.5),
E
pˆ2 ! 1
n ! 1
pˆ(1 ! pˆ)
&
=
✓
p2 +
1
n
p(1 ! p)
◆
! 1
n
p(1 ! p) = p2.
Thus,
bp2
u = ˆp2 ! 1
n ! 1
pˆ(1 ! pˆ)
is an unbiased estimator of p2.
To compare the two estimators for p2, assume that we find 13 variant alleles in a sample of 30, then pˆ = 13/30 =
0.4333,
pˆ2 =
✓13
30◆2
= 0.1878, and bp2
u =
✓13
30◆2
! 1
29 ✓13
30◆ ✓17
30◆
= 0.1878 ! 0.0085 = 0.1793.
The bias for the estimate pˆ2, in this case 0.0085, is subtracted to give the unbiased estimate bp2
u.
The heterozygosity of a biallelic locus is h = 2p(1!p). From the discussion above, we see that h has the unbiased
estimator
hˆ = 2n
n ! 1
pˆ(1 ! pˆ) = 2n
n ! 1
⇣x
n
⌘ ✓n ! x
n
◆
= 2x(n ! x)
n(n ! 1) .
14.3 Compensating for Bias
In the methods of moments estimation, we have used g(X¯) as an estimator for g(µ). If g is a convex function, we
can say something about the bias of this estimator. In Figure 14.2, we see the method of moments estimator for the
estimator g(X¯) for a parameter # in the Pareto distribution. The choice of # = 3 corresponds to a mean of µ = 3/2 for
the Pareto random variables. The central limit theorem states that the sample mean X¯ is nearly normally distributed
with mean 3/2. Thus, the distribution of X¯ is nearly symmetric around 3/2. From the figure, we can see that the
interval from 1.4 to 1.5 under the function g maps into a longer interval above # = 3 than the interval from 1.5 to 1.6
maps below # = 3. Thus, the function g spreads the values of X¯ above # = 3 more than below. Consequently, we
anticipate that the estimator #ˆ will be upwardly biased.
To address this phenomena in more general terms, we use the characterization of a convex function as a differentiable
function whose graph lies above any tangent line. If we look at the value µ for the convex function g, then this
statement becomes
g(x) ! g(µ) # g0
(µ)(x ! µ).
Now replace x with the random variable X¯ and take expectations.
Eµ[g(X¯) ! g(µ)] # Eµ[g0
(µ)(X¯ ! µ)] = g0
(µ)Eµ[X¯ ! µ]=0.
Consequently,
Eµg(X¯) # g(µ) (14.6)
and g(X¯) is biased upwards. The expression in (14.6) is known as Jensen’s inequality.
Exercise 14.8. Show that the estimator Su is a downwardly biased estimator for ".
To estimate the size of the bias, we look at a quadratic approximation for g centered at the value µ
g(x) ! g(µ) ⇡ g0
(µ)(x ! µ) + 1
2
g00(µ)(x ! µ)
2.
209
Introduction to the Science of Statistics Unbiased Estimation
1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 2
2.5
3
3.5
4
4.5
5
x
!
g(x) = x/(x!1)
y=g(µ)+g’(µ)(x!µ)
Figure 14.2: Graph of a convex function. Note that the tangent line is below the graph of g. Here we show the case in which µ = 1.5 and
! = g(µ)=3. Notice that the interval from x = 1.4 to x = 1.5 has a longer range than the interval from x = 1.5 to x = 1.6 Because g spreads
the values of X¯ above ! = 3 more than below, the estimator !ˆ for ! is biased upward. We can use a second order Taylor series expansion to correct
most of this bias.
Again, replace x in this expression with the random variable X¯ and then take expectations. Then, the bias
bg(µ) = Eµ[g(X¯)] ! g(µ) ⇡ Eµ[g0
(µ)(X¯ ! µ)] + 1
2
E[g00(µ)(X¯ ! µ)
2] = 1
2
g00(µ)Var(X¯) = 1
2
g00(µ)
"2
n . (14.7)
(Remember that Eµ[g0
(µ)(X¯ ! µ)] = 0.) Thus, the bias has the intuitive properties of being
• large for strongly convex functions, i.e., ones with a large value for the second derivative evaluated at the mean
µ,
• large for observations having high variance "2, and
• small when the number of observations n is large.
Exercise 14.9. Use (14.7) to estimate the bias in using pˆ2 as an estimate of p2 is a sequence of n Bernoulli trials and
note that it matches the value (14.5).
Example 14.10. For the method of moments estimator for the Pareto random variable, we determined that
g(µ) = µ
µ ! 1
.
and that X¯ has
mean µ = "
""1 and variance #2
n = "
n(""1)2(""2)
By taking the second derivative, we see that g00(µ) = 2(µ ! 1)"3 > 0 and, because µ > 1, g is a convex function.
Next, we have
g00 ✓ #
# ! 1
◆
= 2
⇣ "
""1 ! 1
⌘3 = 2(# ! 1)3.
210
Introduction to the Science of Statistics Unbiased Estimation
Thus, the bias
bg(#) ⇡ 1
2
g00(µ)
"2
n = 1
2
2(# ! 1)3 #
n(# ! 1)2(# ! 2) = #(# ! 1)
n(# ! 2).
So, for # = 3 and n = 100, the bias is approximately 0.06. Compare this to the estimated value of 0.053 from the
simulation in the previous section.
Example 14.11. For estimating the population in mark and recapture, we used the estimate
N = g(µ) = kt
µ
for the total population. Here µ is the mean number recaptured, k is the number captured in the second capture event
and t is the number tagged. The second derivative
g00(µ) = 2kt
µ3 > 0
and hence the method of moments estimate is biased upwards. In this siutation, n = 1 and the number recaptured is a
hypergeometric random variable. Hence its variance
"2 = kt
N
(N ! t)(N ! k)
N(N ! 1) .
Thus, the bias
bg(N) = 1
2
2kt
µ3
kt
N
(N ! t)(N ! k)
N(N ! 1) = (N ! t)(N ! k)
µ(N ! 1) = (kt/µ ! t)(kt/µ ! k)
µ(kt/µ ! 1) = kt(k ! µ)(t ! µ)
µ2(kt ! µ) .
In the simulation example, N = 2000, t = 200, k = 400 and µ = 40. This gives an estimate for the bias of 36.02. We
can compare this to the bias of 2031.03-2000 = 31.03 based on the simulation in Example 13.2.
This suggests a new estimator by taking the method of moments estimator and subtracting the approximation of
the bias.
Nˆ = kt
r ! kt(k ! r)(t ! r)
r2(kt ! r) = kt
r
✓
1 ! (k ! r)(t ! r)
r(kt ! r)
◆
.
The delta method gives us that the standard deviation of the estimator is |g0
(µ)|"/
pn. Thus the ratio of the bias
of an estimator to its standard deviation as determined by the delta method is approximately
g00(µ)"2/(2n)
|g0
(µ)|"/
pn = 1
2
g00(µ)
|g0
(µ)|
"
pn.
If this ratio is ⌧ 1, then the bias correction is not very important. In the case of the example above, this ratio is
36.02
268.40 = 0.134
and its usefulness in correcting bias is small.
14.4 Consistency
Despite the desirability of using an unbiased estimator, sometimes such an estimator is hard to find and at other times
impossible. However, note that in the examples above both the size of the bias and the variance in the estimator
decrease inversely proportional to n, the number of observations. Thus, these estimators improve, under both of these
criteria, with more observations. A concept that describes properties such as these is called consistency.
211
Introduction to the Science of Statistics Unbiased Estimation
Definition 14.12. Given data X1, X2,... and a real valued function h of the parameter space, a sequence of estimators
dn, based on the first n observations, is called consistent if for every choice of ✓
limn!1 dn(X1, X2,...,Xn) = h(✓)
whenever ✓ is the true state of nature.
Thus, the bias of the estimator disappears in the limit of a large number of observations. In addition, the distribution
of the estimators dn(X1, X2,...,Xn) become more and more concentrated near h(✓).
For the next example, we need to recall the sequence definition of continuity: A function g is continuous at a real
number x provided that for every sequence {xn; n # 1} with
xn ! x, then, we have that g(xn) ! g(x).
A function is called continuous if it is continuous at every value of x in the domain of g. Thus, we can write the
expression above more succinctly by saying that for every convergent sequence {xn; n # 1},
limn!1 g(xn) = g( limn!1 xn).
Example 14.13. For a method of moment estimator, let’s focus on the case of a single parameter (d = 1). For
independent observations, X1, X2,..., having mean µ = k(✓), we have that
EX¯n = µ,
i. e. X¯n, the sample mean for the first n observations, is an unbiased estimator for µ = k(✓). Also, by the law of large
numbers, we have that
limn!1 X¯n = µ.
Assume that k has a continuous inverse g = k"1. In particular, because µ = k(✓), we have that g(µ) = ✓. Next,
using the methods of moments procedure, define, for n observations, the estimators
ˆ✓n(X1, X2,...,Xn) = g
✓ 1
n(X1 + ··· + Xn)
◆
= g(X¯n).
for the parameter ✓. Using the continuity of g, we find that
limn!1
ˆ✓n(X1, X2,...,Xn) = limn!1 g(X¯n) = g( limn!1 X¯n) = g(µ) = ✓
and so we have that g(X¯n) is a consistent sequence of estimators for ✓.
14.5 Cramer-Rao Bound ´
This topic is somewhat more advanced and can be skipped for the first reading. This section gives us an introduction to
the log-likelihood and its derivative, the score functions. We shall encounter these functions again when we introduce
maximum likelihood estimation. In addition, the Cramer Rao bound, which is based on the variance of the score ´
function, known as the Fisher information, gives a lower bound for the variance of an unbiased estimator. These
concepts will be necessary to describe the variance for maximum likelihood estimators.
Among unbiased estimators, one important goal is to find an estimator that has as small a variance as possible, A
more precise goal would be to find an unbiased estimator d that has uniform minimum variance. In other words,
d(X) has has a smaller variance than for any other unbiased estimator ˜
d for every value ✓ of the parameter.
212
Introduction to the Science of Statistics Unbiased Estimation
Var✓d(X) Var✓ ˜
d(X) for all ✓ 2 ⇥.
The efficiency e( ˜
d) of unbiased estimator ˜
d is the minimum value of the ratio
Var✓d(X)
Var✓ ˜
d(X)
over all values of ✓. Thus, the efficiency is between 0 and 1 with a goal of finding estimators with efficiency as near to
one as possible.
For unbiased estimators, the Cramer-Rao bound tells us how small a variance is ever possible. The formula is a bit ´
mysterious at first. However, we shall soon learn that this bound is a consequence of the bound on correlation that we
have previously learned
Recall that for two random variables Y and Z, the correlation
⇢(Y,Z) = Cov(Y,Z)
pVar(Y )Var(Z)
. (14.8)
takes values between -1 and 1. Thus, ⇢(Y,Z)2 1 and so
Cov(Y,Z)
2 Var(Y )Var(Z). (14.9)
Exercise 14.14. If EZ = 0, the Cov(Y,Z) = EY Z
We begin with data X = (X1,...,Xn) drawn from an unknown probability P✓. The parameter space ⇥ ⇢ R.
Denote the joint density of these random variables
f(x|✓), where x = (x1 ...,xn).
In the case that the data comes from a simple random sample then the joint density is the product of the marginal
densities.
f(x|✓) = f(x1|✓)··· f(xn|✓) (14.10)
For continuous random variables, the two basic properties of the density are that f(x|✓) # 0 for all x and that
1 = Z
Rn
f(x|✓) dx. (14.11)
Now, let d be the unbiased estimator of h(✓), then by the basic formula for computing expectation, we have for
continuous random variables
h(✓) = E✓d(X) = Z
Rn
d(x)f(x|✓) dx. (14.12)
If the functions in (14.11) and (14.12) are differentiable with respect to the parameter ✓ and we can pass the
derivative through the integral, then we first differentiate both sides of equation (14.11), and then use the logarithm
function to write this derivate as the expectation of a random variable,
0 = Z
Rn
@f(x|✓)
@✓ dx =
Z
Rn
@f(x|✓)/@✓
f(x|✓) f(x|✓) dx =
Z
Rn
@ ln f(x|✓)
@✓ f(x|✓) dx = E✓
@ ln f(X|✓)
@✓ &
. (14.13)
From a similar calculation using (14.12),
h0
(✓) = E✓
d(X)
@ ln f(X|✓)
@✓ &
. (14.14)
213
Introduction to the Science of Statistics Unbiased Estimation
Now, return to the review on correlation with Y = d(X), the unbiased estimator for h(✓) and the score function
Z = @ ln f(X|✓)/@✓. From equations (14.14) and then (14.9), we find that
h0
(✓)
2 = E✓
d(X)
@ ln f(X|✓)
@✓ &2
= Cov✓
✓
d(X),
@ ln f(X|✓)
@✓ ◆
Var✓(d(X))Var✓
✓@ ln f(X|✓)
@✓ ◆
,
or,
Var✓(d(X)) #
h0
(✓)2
I(✓) . (14.15)
where
I(✓) = Var✓
✓@ ln f(X|✓)
@✓ ◆
= E✓
"✓@ ln f(X|✓)
@✓ ◆2
#
is called the Fisher information. For the equality, recall that the variance Var(Z) = EZ2 ! (EZ)2 and recall from
equation (14.13) that the random variable Z = @ ln f(X|✓)/@✓ has mean EZ = 0.
Equation (14.15), called the Cramer-Rao lower bound ´ or the information inequality, states that the lower bound
for the variance of an unbiased estimator is the reciprocal of the Fisher information. In other words, the higher the
information, the lower is the possible value of the variance of an unbiased estimator.
If we return to the case of a simple random sample, then take the logarithm of both sides of equation (14.10)
ln f(x|✓) = ln f(x1|✓) + ··· + ln f(xn|✓)
and then differentiate with respect to the parameter ✓,
@ ln f(x|✓)
@✓ = @ ln f(x1|✓)
@✓ + ··· +
@ ln f(xn|✓)
@✓ .
The random variables {@ ln f(Xk|✓)/@✓; 1 k n} are independent and have the same distribution. Using the fact
that the variance of the sum is the sum of the variances for independent random variables, we see that In, the Fisher
information for n observations is n times the Fisher information of a single observation.
In(✓) = Var ✓@ ln f(X1|✓)
@✓ + ··· +
@ ln f(Xn|✓)
@✓ ◆
= nVar(
@ ln f(X1|✓)
@✓ ) = nE[(@ ln f(X1|✓)
@✓ )
2].
Notice the correspondence. Information is linearly proportional to the number of observations. If our estimator
is a sample mean or a function of the sample mean, then the variance is inversely proportional to the number of
observations.
Example 14.15. For independent Bernoulli random variables with unknown success probability ✓, the density is
f(x|✓) = ✓x(1 ! ✓)
(1"x)
.
The mean is ✓ and the variance is ✓(1 ! ✓). Taking logarithms, we find that
ln f(x|✓) = x ln ✓ + (1 ! x) ln(1 ! ✓),
@
@✓ ln f(x|✓) = x
✓ ! 1 ! x
1 ! ✓ = x ! ✓
✓(1 ! ✓)
.
The Fisher information associated to a single observation
I(✓) = E
"✓ @
@✓ ln f(X|✓)
◆2
#
= 1
✓2(1 ! ✓)2 E[(X ! ✓)
2] = 1
✓2(1 ! ✓)2 Var(X)
= 1
✓2(1 ! ✓)2 ✓(1 ! ✓) = 1
✓(1 ! ✓)
.
214
Introduction to the Science of Statistics Unbiased Estimation
Thus, the information for n observations In(✓) = n/(✓(1 ! ✓)). Thus, by the Cramer-Rao lower bound, any unbiased ´
estimator of ✓ based on n observations must have variance al least ✓(1 ! ✓)/n. Now, notice that if we take d(x)=¯x,
then
E✓X¯ = ✓, and Var✓d(X) = Var(X¯) = ✓(1 ! ✓)
n .
These two equations show that X¯ is a unbiased estimator having uniformly minimum variance.
Exercise 14.16. For independent normal random variables with known variance "2
0 and unknown mean µ, X¯ is a
uniformly minimum variance unbiased estimator.
Exercise 14.17. Take two derivatives of ln f(x|✓) to show that
I(✓) = E✓
"✓@ ln f(X|✓)
@✓ ◆2
#
= !E✓
@2 ln f(X|✓)
@✓2
&
. (14.16)
This identity is often a useful alternative to compute the Fisher Information.
Example 14.18. For an exponential random variable,
ln f(x|&) = ln & ! &x,
@2f(x|&)
@&2 = ! 1
&2 .
Thus, by (14.16),
I(&) = 1
&2 .
Now, X¯ is an unbiased estimator for h(&)=1/& with variance
1
n&2 .
By the Cramer-Rao lower bound, we have that ´
g0
(&)2
nI(&) = 1/&4
n&2 = 1
n&2 .
Because X¯ has this variance, it is a uniformly minimum variance unbiased estimator.
Example 14.19. To give an estimator that does not achieve the Cramer-Rao bound, let ´ X1, X2,...,Xn be a simple
random sample of Pareto random variables with density
fX(x|#) = #
x"+1 , x> 1.
The mean and the variance
µ = #
# ! 1
, "2 = #
(# ! 1)2(# ! 2).
Thus, X¯ is an unbiased estimator of µ = #/(# ! 1)
Var(X¯) = #
n(# ! 1)2(# ! 2).
To compute the Fisher information, note that
ln f(x|#) = ln # ! (# + 1) ln x and thus @2 ln f(x|#)
@#2 = ! 1
#2 .
215
Introduction to the Science of Statistics Unbiased Estimation
Using (14.16), we have that
I(#) = 1
#2 .
Next, for
µ = g(#) = #
# ! 1
, g0
(#) = ! 1
(# ! 1)2 , and g0
(#)
2 = 1
(# ! 1)4 .
Thus, the Cramer-Rao bound for the estimator is ´
g0
(#)2
In(#) = #2
n(# ! 1)4 .
and the efficiency compared to the Cramer-Rao bound is ´
g0
(#)2/In(#)
Var(X¯) = #2
n(# ! 1)4 ·
n(# ! 1)2(# ! 2)
# = #(# ! 2)
(# ! 1)2 = 1 ! 1
(# ! 1)2 .
The Pareto distribution does not have a variance unless # > 2. For # just above 2, the efficiency compared to its
Cramer-Rao bound is low but improves with larger ´ #.
14.6 A Note on Efficient Estimators
For an efficient estimator, we need find the cases that lead to equality in the correlation inequality (14.8). Recall that
equality occurs precisely when the correlation is ±1. This occurs when the estimator d(X) and the score function
@ ln fX(X|✓)/@✓ are linearly related with probability 1.
@
@✓ ln fX(X|✓) = a(✓)d(X) + b(✓).
After integrating, we obtain,
ln fX(X|✓) = Z
a(✓)d✓d(X) + Z
b(✓)d✓ + j(X) = ⇡(✓)d(X) + B(✓) + j(X)
Note that the constant of integration of integration is a function of X. Now exponentiate both sides of this equation
fX(X|✓) = c(✓)h(X) exp(⇡(✓)d(X)). (14.17)
Here c(✓) = exp B(✓) and h(X) = exp j(X).
We shall call density functions satisfying equation (14.17) an exponential family with natural parameter ⇡(✓).
Thus, if we have independent random variables X1, X2,...Xn, then the joint density is the product of the densities,
namely,
f(X|✓) = c(✓)
nh(X1)··· h(Xn) exp(⇡(✓)(d(X1) + ··· + d(Xn)). (14.18)
In addition, as a consequence of this linear relation in (14.18),
d(X) = 1
n(d(X1) + ··· + d(Xn))
is an efficient estimator for h(✓).
Example 14.20 (Poisson random variables).
f(x|&) = &x
x!
e"$ = e"$ 1
x!
exp(x ln &).
216
Introduction to the Science of Statistics Unbiased Estimation
Thus, Poisson random variables are an exponential family with c(&) = exp(!&), h(x)=1/x!, and natural parameter
⇡(&) = ln &. Because
& = E$X, ¯
X¯ is an unbiased estimator of the parameter &.
The score function
@
@& ln f(x|&) = @
@&(x ln & ! ln x! ! &) = x
& ! 1.
The Fisher information for one observation is
I(&) = E$
"✓X
& ! 1
◆2
#
= 1
&2 E$[(X ! &)
2] = 1
&.
Thus, In(&) = n/& is the Fisher information for n observations. In addition,
Var$(X¯) = &
n
and d(x)=¯x has efficiency
Var(X¯)
1/In(&) = 1.
This could have been predicted. The density of n independent observations is
f(x|&) = e"$
x1!
&x1 ··· e"$
xn!
&xn = e"n$&x1···+xn
x1! ··· xn! = e"n$&nx¯
x1! ··· xn!
and so the score function
@
@& ln f(x|&) = @
@&(!n& + nx¯ ln &) = !n +
nx¯
&
showing that the estimate x¯ and the score function are linearly related.
Exercise 14.21. Show that a Bernoulli random variable with parameter p is an exponential family.
Exercise 14.22. Show that a normal random variable with known variance "2
0 and unknown mean µ is an exponential
family.
14.7 Answers to Selected Exercises
14.4. Repeat the simulation, replacing mean(x) by 8.
> ssx<-rep(0,1000)
> for (i in 1:1000){x<-rbinom(10,16,0.5);ssx[i]<-sum((x-8)ˆ2)}
> mean(ssx)/10;mean(ssx)/9
[1] 3.9918
[1] 4.435333
Note that division by 10 gives an answer very close to the correct value of 4. To verify that the estimator is
unbiased, we write
E
"
1
n
Xn
i=1
(Xi ! µ)
2
#
= 1
n
Xn
i=1
E[(Xi ! µ)
2] = 1
n
Xn
i=1
Var(Xi) = 1
n
Xn
i=1
"2 = "2.
217
Introduction to the Science of Statistics Unbiased Estimation
14.7. For a Bernoulli trial note that X2
i = Xi. Expand the square to obtain
Xn
i=1
(Xi ! pˆ)
2 = Xn
i=1
X2
i ! pˆ
Xn
i=1
Xi + npˆ2 = npˆ! 2npˆ2 + npˆ2 = n(ˆp ! pˆ2) = npˆ(1 ! pˆ).
Divide by n to obtain the result.
14.8. Recall that ES2
u = "2. Check the second derivative to see that g(t) = pt is concave down for all t. For concave
down functions, the direction of the inequality in Jensen’s inequality is reversed. Setting t = S2
u, we have that
ESu = Eg(S2
u) g(ES2
u) = g("2) = "
and Su is a downwardly biased estimator of ".
14.9. Set g(p) = p2. Then, g00(p)=2. Recall that the variance of a Bernoulli random variable "2 = p(1 ! p) and the
bias
bg(p) ⇡ 1
2
g00(p)
"2
n = 1
2
2
p(1 ! p)
n = p(1 ! p)
n .
14.14. Cov(Y,Z) = EY Z ! EY · EZ = EY Z whenever EZ = 0.
14.16. For independent normal random variables with known variance "2
0 and unknown mean µ, the density
f(x|µ) = 1
"0
p2⇡ exp !(x ! µ)2
2"2
0
,
ln f(x|µ) = ! ln("0
p
2⇡) ! (x ! µ)2
2"2
0
.
Thus, the score function
@
@µ
ln f(x|µ) = 1
"2
0
(x ! µ).
and the Fisher information associated to a single observation
I(µ) = E
"✓ @
@µ
ln f(X|µ)
◆2
#
= 1
"4
0
E[(X ! µ)
2] = 1
"4
0
Var(X) = 1
"2
0
.
Again, the information is the reciprocal of the variance. Thus, by the Cramer-Rao lower bound, any unbiased estimator ´
based on n observations must have variance al least "2
0/n. However, if we take d(x)=¯x, then
Varµd(X) = "2
0
n .
and x¯ is a uniformly minimum variance unbiased estimator.
14.17. First, we take two derivatives of ln f(x|✓).
@ ln f(x|✓)
@✓ = @f(x|✓)/@✓
f(x|✓) (14.19)
and
@2 ln f(x|✓)
@✓2 = @2f(x|✓)/@✓2
f(x|✓) ! (@f(x|✓)/@✓)2
f(x|✓)2 = @2f(x|✓)/@✓2
f(x|✓) !
✓@f(x|✓)/@✓)
f(x|✓)
◆2
= @2f(x|✓)/@✓2
f(x|✓) !
✓@ ln f(x|✓)
@✓ ◆2
218
Introduction to the Science of Statistics Unbiased Estimation
upon substitution from identity (14.19). Thus, the expected values satisfy
E✓
@2 ln f(X|✓)
@✓2
&
= E✓
@2f(X|✓)/@✓2
f(X|✓)
&
! E✓
"✓@ ln f(X|✓)
@✓ ◆2
#
.
Consquently, the exercise is complete if we show that E✓
h
@2f (X|✓)/@✓2
f (X|✓)
i
= 0. However, for a continuous random
variable,
E✓
@2f(X|✓)/@✓2
f(X|✓)
&
=
Z @2f(x|✓)/@✓2
f(x|✓) f(x|✓) dx =
Z @2f(x|✓)
@✓2 dx = @2
@✓2
Z
f(x|✓) dx = @2
@✓2 1=0.
Note that the computation require that we be able to pass two derivatives with respect to ✓ through the integral sign.
14.21. The Bernoulli density
f(x|p) = px(1 ! p)
1"x = (1 ! p)
✓ p
1 ! p
◆x
= (1 ! p) exp ✓
x ln ✓ p
1 ! p
◆◆ .
Thus, c(p)=1 ! p, h(x)=1 and the natural parameter ⇡(p) = ln ⇣ p
1"p
⌘
, the log-odds.
14.22. The normal density
f(x|µ) = 1
"0
p2⇡ exp !(x ! µ)2
2"2
0
= 1
"0
p2⇡
e"µ2/2#0 e"x2/2#0 exp
xµ
"2
0
Thus, c(µ) = 1
#0
p2⇡ e"µ2/2#0 , h(x) = e"x2/2#0 and the natural parameter ⇡(µ) = µ/"2
0.
No comments:
Post a Comment