4 Properties of Estimators

Now that we’ve developed the tools for deriving estimators of unknown parameters, we can start thinking about different metrics for determining how “good” our estimators actually are. In general, we like our estimators to be:

Unbiased: Our estimate should be estimating what it’s supposed to be estimating, for lack of a better phrase. Bias (or, unbiased-ness, in this case) is related to accuracy. In introductory statistics, you likely discussed sample bias (or, whether or not the data you collect is representative of the population you are trying to make inference on) and information bias (or, whether the values of the data you collect are representative of the people who report them). If you have a biased sample or biased information, your estimates (think, regression coefficients) are likely going to misrepresent true relationships in the population.

Bias of estimates has a very specific definition in statistical theory that is distinct from sample bias and information bias. Questions of sample bias and information bias are important to consider when collecting and analyzing data, and questions of whether or not our estimates are biased are important to consider prior to analyzing data.
Precise: In short, if our estimates are wildly uncertain (think, gigantic confidence intervals), they’ll essentially be of no use to us from a practical perspective. As an extreme example, consider how you would feel if a new cancer drug was released with very severe side-effects, but scientists noted that the drug would increase cancer patients expected survival time by somewhere between 1 and 700 days. Are we really certain enough, in this case, that the benefits of the drug outweigh the potential costs? What if instead, the expected survival time would increase between 650 and 700 days? Would that change your answer?

These types of questions are precisely (ha!) why precision is important. Again, you’ve likely discussed precision (colloquially) in an introductory statistics course. In statistical theory, precision (similar to bias) has a very specific definition. So long as our estimates are unbiased, we want to minimize variance (and therefore increase precision) as much as we possibly can. Even at the same sample size, some estimates are more precise than others, which we’ll explore in this chapter.

The Bias-Variance Trade-off

If you are familiar with machine learning techniques or models for prediction purposes more generally (as opposed to inference), you may have stumbled upon the phrase “bias-variance trade-off.” In scenarios where we want to make good predictions for new observations using a statistical model, one way to measure how “well” our model is predicting new observations is through minimizing mean squared error. Intuitively, this is something we should want to minimize: “errors” (the difference between a predicted value and an observed value) are bad, we square them because the direction of the error (above or below) shouldn’t matter too much, and average over them because we need a summary measure of all our errors combined, and an average seems reasonable. In statistical terms, mean squared error has a very specific definition (see below) as the expected value of what is sometimes called a loss function (where in this case, loss is defined as squared error loss). We’ll return to this in the decision theory chapter of our course notes.

It just so happens that we can decompose mean squared error into a sum of two terms: the variance of our estimator + the bias of our estimator (squared). What this means for us is that two estimators may have the exact same MSE, but very different variances or biases (potentially). In general, if we hold MSE constant and imagine increasing the variance of our estimator, the bias would need to decrease accordingly to maintain the same MSE. This is where the “trade-off” comes from. MSE is an incredibly commonly used metric for assessing prediction models, but as we will see, doesn’t necessarily paint a full picture in terms of how “good” an estimator is. Smaller MSE does not automatically imply “better estimator,” just as smaller bias (in some cases) does not automatically imply “better estimator.”

The idea of “Information”

The information matrix for a collection of random variables is defined as

\[ I(\theta) = E \left[ \left( \frac{\partial}{\partial \theta} \log L(\theta \mid x) \right)^2\right] = -E\left[ \frac{\partial^2}{\partial \theta^2} \log L(\theta \mid x)\right]. \]

To understand intuitively what this means, note that we can write

\[ \frac{\partial}{\partial \theta} \log L(\theta \mid x) = \frac{\frac{\partial}{\partial \theta} L(\theta \mid x)}{L(\theta \mid x)} \]

Since the derivative of the likelihood quantifies how quickly the likelihood changes at a point $\theta$, the information matrix will be large if we shift the parameter $\theta$ slightly. In some sense, this means that the likelihood evaluated at $\theta$ can be fairly easily determined, or distinguished, from another possible distribution for $\theta$. Stated another way, our likelihood gives us a lot of “information” about our parameter!

Sufficiency

Another property we like to have in an estimator (sometimes) is called sufficiency. I like to think about sufficiency in terms of minimizing the amount of information we need to retain in order to get a “complete picture” of what’s going on. Suppose, for example, someone is allergic to tomatoes. Rather than listing every food that contains tomatoes and saying that they’re allergic to each of them individually, they could just say that they’re allergic to tomatoes and call it a day. Stating “tomatoes” is sufficient information in this case for us to get the whole picture of their allergies!

A similar concept applies to estimators. Recall from the MLE chapter of the notes that we previously showed that the MLE of a sample proportion is given by $\bar{X}$. If I want someone to be able to obtain the MLE for a sample proportion, I then have a few options. I could give them:

Every observation I know, $x_1, \dots,x_n$
Just one number, the sample mean, $\frac{1}{n}\sum_{i = 1}^n x_i$
All my observations plus some extra information, just for fun!

It should hopefully be obvious that you don’t need extra information for fun, but we also don’t need to know the value of each individual observation. The sample mean is sufficient! Formal definitions and a relevant theorem to follow.

4.1 Learning Objectives

By the end of this chapter, you should be able to…

Calculate bias and variance of various estimators for unknown parameters
Explain the distinction between bias and variance colloquially in terms of precision and accuracy, and why these properties are important
Compare estimators in terms of their relative efficiency
Justify why there exists a bias-variance trade-off, and explain what consequences this may have when comparing estimators
Be able to determine / show when an estimator is sufficient
Use RBLS to be able to find UMVUEs

4.2 Concept Questions

Intuitively, what is the difference between bias and precision?
What are the typical steps to checking if an estimator is unbiased?
How can we construct unbiased estimators?
If an estimator is unbiased, is it also asymptotically unbiased? If an estimator is asymptotically unbiased, is it necessarily unbiased?
When comparing estimators, how can we determine which estimator is more efficient?
Why might we care about sufficiency, particularly when thinking about the variance of unbiased estimators?
Describe, in your own words, what the Cramér-Rao inequality tells us.
What is the difference between a UMVUE and an efficient estimator? Does one imply the other?

4.3 Definitions

You are expected to know the following definitions:

Unbiased

An estimator $\hat{\theta} = g(X_1, \dots, X_n)$ is an unbiased estimator for $\theta$ if $E[\hat{\theta}] = \theta$, for all $\theta$.

Asymptotically Unbiased

An estimator $\hat{\theta} = g(X_1, \dots, X_n)$ is an asymptotically unbiased estimator for $\theta$ if $\underset{n \to \infty}{\text{lim}} E[\hat{\theta}] = \theta$.

Precision

The precision of a random variable $X$ is given by $\frac{1}{Var(X)}$.

Mean Squared Error (MSE)

The mean squared error of an estimator is given by

\[ MSE(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = Var(\hat{\theta}) + Bias(\hat{\theta})^2 \]

Sufficient

For some function $T$, $T(X)$ is a sufficient statistic for an unknown parameter $\theta$ if the conditional distribution of $X$ given $T(X)$ does not depend on $\theta$. A “looser” definition is that the distribution of $X$ must depend on $\theta$ only through $T(X)$.

Minimal Sufficiency

For some function $T^*$, $T^*(X)$ is a minimal sufficient statistic for an unknown parameter $\theta$ if $T^*(X)$ is sufficient, and for every other sufficient statistic $T(X)$. $T^*(X) = f(T(X))$ for some function $f$.

Complete

A statistic $T(X)$ is complete for an unknown parameter $\theta$ if \[ E[g(T(x))] \text{ is } \theta-\text{free} \implies g(T(x)) \text{ is constant, almost everywhere} \] for a nice function $g$.

Importantly, it is equivalent to say that $T(X)$ is complete for an unknown parameter $\theta$ if

\[ E[g(T(x))] = 0 \implies g(T(x)) = 0 \quad\text{ almost everywhere} \]

Relative Efficiency

The relative efficiency of an estimator $\hat{\theta}_1$ with respect to an estimator $\hat{\theta}_2$ is the ratio $Var(\hat{\theta}_2)/Var(\hat{\theta}_1)$.

Uniformly Minimum-Variance Unbiased Estimator (UMVUE)

An estimator $\hat{\theta}^*$ is the UMVUE if, for all estimators $\hat{\theta}$ in the class of unbiased estimators $\Theta$,

\[ Var(\hat{\theta}^*) \leq Var(\hat{\theta}) \]

NOTE: the UMVUE is not necessarily efficient (defined below).

Efficient

An unbiased estimator $\hat{\theta}$ is efficient if it achieves the C-R lower bound, i.e. if

\[ Var(\hat{\theta}) = \frac{1}{I(\theta)} \]

Score

The score is defined as the first partial derivative with respect to $\theta$ of the log-likelihood function, given by

\[ \frac{\partial}{\partial \theta} \log L(\theta \mid x) \]

Information Matrix

The information matrix* $I(\theta)$ for a collection of iid random variables $X_1, \dots, X_n$ is the variance of the score, given by

\[ I(\theta) = E \left[ \left( \frac{\partial}{\partial \theta} \log L(\theta \mid x) \right)^2\right] = -E\left[ \frac{\partial^2}{\partial \theta^2} \log L(\theta \mid x)\right] \]

Note that the above formula is in fact the variance of the score, since we can show that the expectation of the score is 0 (under some regularity conditions). This is shown as part of the proof of the C-R lower bound in the Theorems section of this chapter.

The information matrix is sometimes written in terms of a pdf for a single random variable as opposed to a likelihood. In this case, we have $I(\theta) = n I_1(\theta)$, where the $I_1(\theta)$ on the right-hand side is defined as $E \left[ \left( \frac{\partial}{\partial \theta} \log f_X(x \mid \theta) \right)^2\right]$. Sometimes $I_1(\theta)$ is written without the subscript $1$ which is a slight abuse of notation that is endlessly confusing (to me, at least). For this set of course notes, we’ll always specify the information matrix in terms of a pdf for a single random variable with the subscript $1$, for clarity.

*The information matrix is often referred to as the Fisher Information matrix, as it was developed by Sir Ronald Fisher. Fisher developed much of the core, statistical theory that we use today. He was also the founding chairman of the University of Cambridge Eugenics Society, and contributed to a large body of scientific work and public policy that promoted racist and classist ideals.

4.4 Theorems

Covariance Inequality (based on the Cauchy-Schwarz inequality)

Let $X$ and $Y$ be random variables. Then,

\[ Var(X) \geq \frac{Cov(X, Y)^2}{Var(Y)} \]

The proof is quite clear on Wikipedia.

The Factorization Criterion for sufficiency

Consider a pdf for a random variable $X$ that depends on an unknown parameter $\theta$, given by $\pi(x \mid \theta)$. The statistic $T(x)$ is sufficient for $\theta$ if and only if $\pi(x \mid \theta)$ factors as:

\[ \pi(x \mid \theta) = g(T(x) \mid \theta) h(x) \] where $g(T(x) \mid \theta)$ depends on $x$ only through $T(x)$, and $h(x)$ does not depend on $\theta$.

Note that in the statistics literature this criterion is sometimes referred to as the Fisher-Neyman Factorization Criterion.

Two proofs available on Wikipedia. The one for the discrete-only case is more intuitive, if you’d like to look through one of them.

Lehmann-Scheffe Theorem

Suppose that a random variable $X$ has pdf given by $f(x \mid \theta)$, and that $T^*(X)$ is such that for every* pair of points $(x,y)$, the ratio of pdfs

\[ \frac{f(y \mid \theta)}{f(x \mid \theta)} \] does not depend on $\theta$ if and only if $T^*(x) = T^*(y)$. Then $T^*(X)$ is a minimal sufficient statistic for $\theta$.

*every pair of points that have the same support as $X$.

Proof.

We’ll utilize something called a likelihood ratio (literally a ratio of likelihoods) to prove this theorem. We’ll also come back to likelihood ratios later in the Hypothesis Testing chapter!

Let $\theta_1$ and $\theta_2$ be two possible values of our unknown parameter $\theta$. Then a likelihood ratio comparing densities evaluated at these two values is defined as

\[ L_{\theta_1, \theta_2}(x) \equiv \frac{f(x \mid \theta_2)}{f(x \mid \theta_1)} \] Our proof will proceed as follows:

We’ll show that if $T(X)$ is sufficient, then $L_{\theta_1, \theta_2}(X)$ is a function of $T(X)$ $\forall$ $\theta_1, \theta_2$.
We’ll show the converse: If $L_{\theta_1, \theta_2}(X)$ is a function of $T(X)$ $\forall$ $\theta_1, \theta_2$, then $T(X)$ is sufficient. This combined with (1) will show that $L_{\theta_1, \theta_2}(X)$ is a minimal sufficient statistic.
We’ll use the above two statements to prove the theorem!

First, suppose that $T(X)$ is sufficient for $\theta$. Then, by definition we can write

\[ L_{\theta_1, \theta_2}(x) = \frac{f(x \mid \theta_2)}{f(x \mid \theta_1)} = \frac{g(T(x) \mid \theta_1)h(x)}{g(T(x) \mid \theta_2)h(x)} = \frac{g(T(x) \mid \theta_1)}{g(T(x) \mid \theta_2)} \] and so $L_{\theta_1, \theta_2}(X)$ is a function of $T(x)$ $\forall$ $\theta_1, \theta_2$.

Second, assume WLOG that $\theta_1$ is fixed, and denote our unknown parameter $\theta_2 = \theta$. We can rearrange our likelihood ratio as

\[\begin{align*} L_{\theta_1, \theta}(x) & = \frac{f(x \mid \theta)}{f(x \mid \theta_1)} \\ f(x \mid \theta) & = L_{\theta_1, \theta}(x) f(x \mid \theta_1) \end{align*}\]

and note that $L_{\theta_1, \theta}(x)$ is a function of $T(X)$ by assumption, and $f(x \mid \theta_1)$ is a function of $x$ that does not depend on our unknown parameter $\theta$. Then $T(X)$ satisfies the factorization criterion, and is therefore sufficient.

Let $T^{**}(X) \equiv L_{\theta_1, \theta_2}(X)$. Then the first two statements we have shown give us that

\[ T(X) \text{ is sufficient } \iff T^{**}(X) \text{ is a function of } T(X) \]

and therefore $T^{**}(X)$ is a minimal sufficient statistic, by definition.

We’ll now (officially) prove our theorem. By hypothesis of the theorem,

\[\begin{align*} T^*(x) = T^*(y) & \iff \frac{f(y \mid \theta)}{f(x \mid \theta)} \text{ is } \theta-free \\ & \iff \frac{f(y \mid \theta_1)}{f(x \mid \theta_1)} = \frac{f(y \mid \theta_2)}{f(x \mid \theta_2)} \quad \forall \theta_1, \theta_2 \\ & \iff \frac{f(y \mid \theta_2)}{f(y \mid \theta_1)} = \frac{f(x \mid \theta_2)}{f(x \mid \theta_1)} \quad \forall \theta_1, \theta_2 \\ & \iff L_{\theta_1, \theta_2}(y) = L_{\theta_1, \theta_2} (x) \quad \forall \theta_1, \theta_2 \\ & \iff T^{**}(y) = T^{**}(x) \end{align*}\]

Therefore $T^*(X)$ and $T^{**}(X)$ are equivalent. Since $T^{**}(X)$ is a minimal sufficient statistic, $T^*(X)$ is therefore also minimal sufficient.

Complete, Sufficient, Minimal

If $T(X)$ is complete and sufficient, then $T(X)$ is minimal sufficient.

Proof.

Just kidding! Prove it on your own and show it to me, if you want bonus points in my heart :)

Rao-Blackwell-Lehmann-Scheffe (RBLS)

Let $T(X)$ be a complete and sufficient statistic for unknown parameter $\theta$, and let $\tau(\theta)$ be some function of $\theta$. If there exists at least one unbiased estimator $\tilde{\tau}(X)$ for $\tau(\theta)$, then there exists a unique UMVUE $\hat{\tau}(T(X))$ for $\tau(\theta)$ given by

\[ \hat{\tau}(T(X)) = E[\tilde{\tau}(X) \mid T(X)] \]

Why do we care? An important consequence of the RBLS Theorem is that if $T(X)$ is a complete and sufficient statistic for $\theta$, then any function $\phi(T(X))$ is the UMVUE of its expectation $E[\phi(T(X))]$ (so long as the expectation is finite for all $\theta$). This Theorem is therefore a very convenient way to find UMVUEs: (1) Find a complete and sufficient statistic for an unknown parameter, and (2) functions of that statistic are then the UMVUE for their expectation!

Proof.

To prove RBLS, we first must prove an Improvement Lemma and a Uniqueness Lemma.

Improvement Lemma. Suppose that $T(X)$ is a sufficient statistic for $\theta$. If $\tilde{\tau}(X)$ is an unbiased estimator of $\tau(\theta)$, then $E[\tilde{\tau}(X) \mid T(X)]$ does not depend on $\theta$ (by sufficiency) and is also an estimator of $\tau(\theta)$, which (importantly) has smaller variance than $\tilde{\tau}(X)$.

Proof of Lemma. First, note that $E[\tilde{\tau}(X) \mid T(X)]$ is an unbiased estimator for $\tau(\theta)$, since \[\begin{align*} E[E[\tilde{\tau}(X) \mid T(X)]] & = E[\tilde{\tau}(X)] \quad \quad \text{(Law of Iterated Expectation)} \\ & = \tau(\theta) \quad \quad (\tilde{\tau}(X) \text{ is unbiased}) \end{align*}\] Then, \[\begin{align*} Var(\tilde{\tau}(X)) & = E[Var(\tilde{\tau}(X) \mid T(X))] + Var(E[\tilde{\tau}(X) \mid T(X)]) \\ & \geq Var(E[\tilde{\tau}(X) \mid T(X)]) \end{align*}\] and we’re done! $E[\tilde{\tau}(X) \mid T(X)]$ has a smaller variance than $\tilde{\tau(X)}$. Since both are unbiased, this is considered an “improvement” (hence the name of the Lemma).

Uniqueness Lemma. If $T(X)$ is complete, then for some unknown parameter $\theta$ and function of it $\tau(\theta)$, $\tau(\theta)$ has at most one unbiased estimator $\hat{\tau}(T(X))$ that depends on $T(X)$.

Proof of Lemma. Suppose, toward contradiction, that $\tau(\theta)$ has more than one unbiased estimator that depends on $T(X)$, given by $\tilde{\tau}(T(X))$ and $\hat{\tau}(T(X))$, $\tilde{\tau}(T(X)) \neq \hat{\tau}(T(X))$. Then

\[ E[\tilde{\tau}(T(X)) - \hat{\tau}(T(X))] = \tau(\theta) - \tau(\theta) = 0 \quad \forall \theta \] Let $g(T(X)) = \tilde{\tau}(T(X)) - \hat{\tau}(T(X))$. Since $T(X)$ is complete, and $E[g(T(X))] = 0$, this implies $\tilde{\tau}(T(X)) - \hat{\tau}(T(X)) = 0$, which means $\tilde{\tau}(T(X)) = \hat{\tau}(T(X))$. Contradiction.

Back to the proof of RBLS.

We’ve shown previously that $\hat{\tau}(T(X))$ is an unbiased estimator for $\tau({\theta})$ (law of iterated expectation). Let $\tau_1(X)$ be any other unbiased estimator for $\tau(\theta)$, and let $\tau_2(T(X)) = E[\tau_1(X) \mid T(X)]$. Then $\tau_2(T(X))$ is also unbiased for $\tau(\theta)$ (again, iterated expectation), and by the Uniqueness Lemma (since $T$ is complete by supposition), $\hat{\tau}(T(X)) = \tau_2(T(X))$. But,

\[\begin{align*} Var(\hat{\tau}(T(X))) & = Var(\tau_2(T(X))) \quad \quad (\hat{\tau} = \tau_2) \\ & \leq Var(\tau_1(T(X))) \quad \quad \text{(Improvement Lemma)} \end{align*}\] so $\hat{\tau}(T(X))$ is the UMVUE for $\tau(\theta)$, as desired.

Cramér-Rao Lower Bound

Let $f_Y(y \mid \theta)$ be a pdf with nice* conditions, and let $Y_1, \dots, Y_n$ be a random sample from $f_Y(y \mid \theta)$. Let $\hat{\theta}$ be any unbiased estimator of $\theta$. Then

\[\begin{align*} Var(\hat{\theta}) & \geq \left\{ E\left[ \left( \frac{\partial \log( L(\theta \mid y))}{\partial \theta}\right)^2\right]\right\}^{-1} \\ & = -\left\{ E\left[ \frac{\partial^2 \log( L(\theta \mid y))}{\partial \theta^2} \right] \right\}^{-1} \\ & = \frac{1}{I(\theta)} \end{align*}\]

*our nice conditions that we need are that $f_Y(y \mid \theta)$ has continuous first- and second-order derivatives, which would quickly discover we need by looking at the form for the C-R lower bound, and that the set of values $y$ where $f_Y(y \mid \theta) \neq 0$ does not depend on $\theta$. If you are familiar with the concept of the “support” of a function, that is where this second condition comes from. The key here is that this condition allows to interchange derivatives and integrals, in particular, $\frac{\partial}{\partial \theta} \int f(x) dx = \int \frac{\partial}{\partial \theta} f(x)dx$, which we’ll need to complete the proof.

Proof.

Let $X = \frac{\partial \log L(\theta \mid \textbf{y})}{\partial \theta}$. By the Covariance Inequality,

\[ Var(\hat{\theta}) \geq \frac{Cov(\hat{\theta},X)^2}{Var(X)} \]

and so if we can show

\[\begin{align*} \frac{Cov(\hat{\theta},X)^2}{Var(X)} & = \left\{ E\left[ \left( \frac{\partial \log( L(\theta \mid \textbf{y}))}{\partial \theta}\right)^2\right]\right\}^{-1} \\ & = \frac{1}{I(\theta)} \end{align*}\]

then we’re done, as this is the C-R lower bound. Note first that

\[\begin{align*} E[X] & = \int x f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \int \left( \frac{\partial \log L(\theta \mid \textbf{y})}{\partial \theta} \right) f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \int \left( \frac{\partial \log f_Y(\textbf{y} \mid \theta)}{\partial \theta} \right) f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \int \frac{\frac{\partial}{\partial \theta} f_Y(\textbf{y} \mid \theta)}{ f_Y(\textbf{y} \mid \theta)} f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \int \frac{\partial}{\partial \theta} f_Y (\textbf{y} \mid \theta) d\textbf{y} \\ & = \frac{\partial}{\partial \theta} \int f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \frac{\partial}{\partial \theta} 1 \\ & = 0 \end{align*}\]

This means that

\[\begin{align*} Var[X] & = E[X^2] - E[X]^2 \\ & = E[X^2] \\ & = E \left[ \left( \frac{\partial \log L(\theta \mid \textbf{y})}{\partial \theta} \right)^2\right ] \end{align*}\]

and

\[\begin{align*} Cov(\hat{\theta}, X) & = E[\hat{\theta} X] - E[\hat{\theta}] E[X] \\ & = E[\hat{\theta}X] \\ & = \int \hat{\theta} x f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \int \hat{\theta} \left( \frac{\partial \log L(\theta \mid \textbf{y})}{\partial \theta} \right) f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \int \hat{\theta} \left( \frac{\partial \log f_Y(\textbf{y} \mid \theta)}{\partial \theta} \right) f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \int \hat{\theta} \frac{\frac{\partial}{\partial \theta} f_Y(\textbf{y} \mid \theta)}{ f_Y(\textbf{y} \mid \theta)} f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \int \hat{\theta} \frac{\partial}{\partial \theta} f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \frac{\partial}{\partial \theta} \int \hat{\theta} f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \frac{\partial}{\partial \theta} E[\hat{\theta}] \\ & = \frac{\partial}{\partial \theta} \theta \\ & = 1 \end{align*}\]

where $E[\hat{\theta}] = \theta$ since our estimator is unbiased. Putting this all together, we have

\[\begin{align*} Var[\hat{\theta}] & \geq \frac{Cov(\hat{\theta},X)^2}{Var(X)} \\ & = \frac{1^2}{E \left[ \left( \frac{\partial \log L(\theta \mid \textbf{y})}{\partial \theta} \right)^2\right ]} \\ & = \frac{1}{I(\theta)} \end{align*}\]

as desired.

Comment: Note that what the Cramér-Rao lower bound tells us is that, if the variance of an unbiased estimator is equal to the Cramér-Rao lower bound, then that estimator has the minimum possible variance among all unbiased estimators there could possibly be. This allows us to prove, for example, whether or not an unbiased estimator is the UMVUE: If an unbiased estimator’s variance achieves the C-R lower bound, then it is optimal according to the UMVUE criterion.

4.5 Worked Examples

Problem 1: Suppose $X_1, \dots, X_n \overset{iid}{\sim} Exponential(1/\theta)$. Compute the MLE of $\theta$, and show that it is an unbiased estimator of $\theta$.

Solution:

Note that we can write

\[\begin{align*} L(\theta) & = \prod_{i = 1}^n \frac{1}{\theta} e^{-x_i / \theta} \\ \log L(\theta) & = \sum_{i = 1}^n \log(\frac{1}{\theta} e^{-x_i / \theta}) \\ & = \sum_{i = 1}^n \log(\frac{1}{\theta}) - \sum_{i = 1}^n x_i / \theta \\ & = -n \log(\theta) - \frac{1}{\theta} \sum_{i = 1}^n x_i \\ \frac{\partial}{\partial \theta} \log L(\theta) & = \frac{\partial}{\partial \theta} \left( -n \log(\theta) - \frac{1}{\theta} \sum_{i = 1}^n x_i \right) \\ & = -\frac{n}{\theta} + \frac{\sum_{i = 1}^n x_i }{\theta^2} \end{align*}\]

Setting this equal to $0$ and solving for $\theta$ we obtain

\[\begin{align*} 0 & \equiv -\frac{n}{\theta} + \frac{\sum_{i = 1}^n x_i }{\theta^2} \\ \frac{n}{\theta} & = \frac{\sum_{i = 1}^n x_i }{\theta^2} \\ n & = \frac{\sum_{i = 1}^n x_i }{\theta} \\ \theta & = \frac{1}{n} \sum_{i = 1}^n x_i \end{align*}\]

and so the MLE for $\theta$ is the sample mean. To show that the MLE is unbiased, we note that

\[\begin{align*} E \left[ \frac{1}{n} \sum_{i = 1}^n X_i \right] & = \frac{1}{n} \sum_{i = 1}^n E[X_i] = \frac{1}{n} \sum_{i = 1}^n \theta = \theta \end{align*}\]

as desired.

Problem 2: Suppose again that $X_1, \dots, X_n \overset{iid}{\sim} Exponential(1/\theta)$. Let $\hat{\theta}_2 = X_1$, and $\hat{\theta}_3 = nX_{(1)}$. Show that $\hat{\theta}_2$ and $\hat{\theta}_3$ are unbiased estimators of $\theta$. Hint: use the fact that $X_{(1)} \sim Exponential(n/\theta)$

Solution:

Note that the mean of a random variable $X \sim Exponential(\lambda)$ is given by $1/\lambda$. Then we can write

\[ E[\hat{\theta}_2] = E[X_1] = \frac{1}{1/\theta} = \theta \]

and

\[ E[\hat{\theta}_3] = E[nX_{(1)}] = \frac{n}{n/\theta} = \theta \] as desired.

Problem 3: Compare the variance of the estimators from Problems 1 and 2. Which is most efficient?

Solution:

Recall that the variance of a random variable $Y \sim Exponential(\lambda)$ is given by $1/\lambda^2$. Let the MLE from Problem 1 be denoted $\hat{\theta}_1 = \bar{X}$. Then we can write

\[ Var\left[\hat{\theta}_1\right] = Var\left[\frac{1}{n} \sum_{i = 1}^n X_i\right] = \frac{1}{n^2} \sum_{i = 1}^n Var[X_i] = \frac{1}{n^2} \left( \frac{n}{(1/\theta)^2} \right) = \frac{\theta^2}{n} \]

and

\[ Var\left[\hat{\theta}_2\right] = Var[Y_1] = \frac{1}{(1/\theta)^2} = \theta^2 \]

and

\[ Var\left[\hat{\theta}_3\right] = Var[nY_{(1)}] = n^2 Var[Y_{(1)}] = \frac{n^2}{(n/\theta)^2} = \theta^2 \]

Thus, the variance of the MLE, $\hat{\theta}_1$, is most efficient, and is $n$ times smaller than the variance of both $\hat{\theta}_2$ and $\hat{\theta}_3$.

Problem 4: Suppose $X_1, \dots, X_n \overset{iid}{\sim} N(\mu, \sigma^2)$. Show that the estimator $\hat{\mu} = \frac{1}{n} \sum_{i = 1}^n X_i$ and the estimator $\hat{\mu}_w = \sum_{i = 1}^n w_i X_i$ are both unbiased estimators of $\mu$, where $\sum_{i = 1}^n w_i = 1$.

Solution:

We can write

\[ E[\hat{\mu}] = E\left[ \frac{1}{n} \sum_{i = 1}^n X_i \right] = \frac{1}{n}\sum_{i = 1}^n E[X_i] = \frac{1}{n}\sum_{i = 1}^n \mu = \mu \]

and

\[ E[\hat{\mu}_w] = E \left[ \sum_{i = 1}^n w_i X_i \right] = \sum_{i = 1}^n w_i E \left[ X_i \right] = \sum_{i = 1}^n w_i \mu = \mu \sum_{i = 1}^n w_i = \mu \]

as desired.

Problem 5: Determine whether the estimator $\hat{\mu}$ or $\hat{\mu}_w$ is more efficient, in Problem 4, if we additionally impose the constraint $w_i \geq 0$ $\forall i$. (Hint: use the Cauchy-Schwarz inequality)

Solution:

To determine relative efficiency, we must compute the variance of each estimator. We have

\[ Var[\hat{\mu}] = Var \left[ \frac{1}{n} \sum_{i = 1}^n X_i \right] = \frac{1}{n^2} \sum_{i = 1}^n Var[X_i] = \frac{1}{n^2} \sum_{i = 1}^n \sigma^2 = \sigma^2 / n \]

and

\[\begin{align*} Var[\hat{\mu}_w] & = Var \left[ \sum_{i = 1}^n w_i X_i \right] \\ & = \sum_{i = 1}^n Var[w_i X_i] \\ & = \sum_{i = 1}^n w_i^2 Var[X_i] \\ & = \sum_{i = 1}^n w_i^2 \sigma^2 \\ & = \sigma^2 \sum_{i = 1}^n w_i^2 \end{align*}\]

And so to determine which estimator is more efficient, we need to determine if $\frac{1}{n}$ is less than $\sum_{i = 1}^n w_i^2$ (or not). The Cauchy-Schwarz inequality tells us that

\[\begin{align*} \left( \sum_{i = 1}^n w_i \cdot 1\right)^2 & \leq \left( \sum_{i = 1}^n w_i^2 \right) \left( \sum_{i = 1}^n 1^2 \right) \\ \left( \sum_{i = 1}^n w_i \right)^2 & \leq \left( \sum_{i = 1}^n w_i^2 \right) n \\ 1 & \leq \left( \sum_{i = 1}^n w_i^2 \right) n \\ \frac{1}{n} & \leq \sum_{i = 1}^n w_i^2 \end{align*}\]

and therefore, $\hat{\mu}$ is more efficient than $\hat{\mu}_w$.

Problem 6: Suppose $X_1, \dots, X_n \overset{iid}{\sim} N(\mu, \sigma^2)$. Show that the MLE for $\sigma^2$ is biased, and suggest a modified variance estimator for $\sigma^2$ that is unbiased.

Solution:

Recall that the MLE for $\sigma^2$ is given by $\frac{1}{n} \sum_{i = 1}^n (X_i - \bar{X})^2$. Then

\[\begin{align*} E\left[ \frac{1}{n} \sum_{i = 1}^n (X_i - \bar{X})^2\right] & = \frac{1}{n} \sum_{i = 1}^n E\left[ (X_i - \bar{X})^2\right] \\ & = \frac{1}{n} \sum_{i = 1}^n E\left[ X_i^2 - 2X_i \bar{X} + \bar{X}^2\right] \\ & = \frac{1}{n} \sum_{i = 1}^n E[X_i^2] - 2 E\left[ \frac{1}{n} \sum_{i = 1}^n X_i \bar{X} \right] + E[\bar{X}^2] \\ & = \frac{1}{n} \sum_{i = 1}^n E[X_i^2] - 2 E\left[ \bar{X} \frac{1}{n} \sum_{i = 1}^n X_i \right] + E[\bar{X}^2] \\ & = \frac{1}{n} \sum_{i = 1}^n E[X_i^2] - 2 E\left[ \bar{X}^2 \right] + E[\bar{X}^2] \\ & = \frac{1}{n} \sum_{i = 1}^n E[X_i^2] - E\left[ \bar{X}^2 \right] \end{align*}\]

Recall that since $X_i \overset{iid}{\sim} N(\mu, \sigma^2)$, $\bar{X} \sim N(\mu, \sigma^2/n)$, and that we can write $Var[Y] + E[Y]^2 = E[Y^2]$ (definition of variance). Then we can write

\[\begin{align*} E\left[ \frac{1}{n} \sum_{i = 1}^n (X_i - \bar{X})^2 \right] & = \frac{1}{n} \sum_{i = 1}^n E[X_i^2] - E\left[ \bar{X}^2 \right] \\ & = \frac{1}{n} \sum_{i = 1}^n \left( \sigma^2 + \mu^2 \right) - \left( \frac{\sigma^2}{n} + \mu^2 \right) \\ & = \sigma^2 + \mu^2 - \frac{\sigma^2}{n} - \mu^2 \\ & = \sigma^2 - \frac{\sigma^2}{n} \\ & = \sigma^2 \left( 1 - \frac{1}{n} \right) \\ & = \sigma^2 \left( \frac{n-1}{n} \right) \end{align*}\]

Therefore, since $E[\hat{\sigma}^2_{MLE}] \neq \sigma^2$, the MLE is unbiased. Note that

\[\begin{align*} E\left[ \left( \frac{n}{n-1} \right)\frac{1}{n} \sum_{i = 1}^n (X_i - \bar{X})^2\right] & = \left( \frac{n}{n-1} \right) \left( \frac{n-1}{n} \right) \sigma^2 \\ & = \sigma^2 \end{align*}\]

and so the estimator $\frac{1}{n-1} \sum_{i = 1}^n (X_i - \bar{X})^2$ is an unbiased estimator for $\sigma^2$. This estimator is often called the “sample variance”, and is denoted by $S^2$.

Problem 7: Suppose $X_1, \dots, X_n \overset{iid}{\sim} Uniform(0, \theta)$. Show that $X_{(n)}$ is sufficient for $\theta$.

Solution:

We can write the joint distribution of $X_1, \dots, X_n$ as

\[\begin{align*} f(x_1, \dots, x_n \mid \theta) & = \left( \prod_{i = 1}^n \frac{1}{\theta} \right) I \{0 \leq x_1, \dots, x_n \leq \theta \} \\ & = \frac{1}{\theta^n} I \{0 \leq x_{(1)}, \dots, x_{(n)} \leq \theta \} \\ & = \frac{1}{\theta^n} I \{ x_{n} \leq \theta \} \end{align*}\]

since if $x_{(n)} \leq \theta$, all observations are less than $\theta$. Then $f(x_1, \dots, x_n \mid \theta)$ depends on $\theta$ only through $x_{(n)}$, and there $x_{(n)}$ is sufficient for $\theta$.

Problem 8: Suppose that $X_1, \dots, X_n \overset{iid}{\sim} Gamma(\alpha, \beta)$. Find a jointly sufficient statistic for $(\alpha, \beta)$.

Solution:

We can write

\[\begin{align*} f(x_1, \dots, x_n \mid \alpha, \beta) & = \prod_{i = 1}^n \frac{\beta^\alpha}{\Gamma(\alpha)} x_i^{\alpha - 1} e^{-\beta x_i} \\ & = \left( \frac{\beta^\alpha}{\Gamma(\alpha)} \right)^n \left( \prod_{i = 1}^n x_i^{\alpha - 1} \right) e^{-\beta \sum_{i = 1}^n x_i} \\ & = \left( \frac{\beta^\alpha}{\Gamma(\alpha)} \right)^n \left( \prod_{i = 1}^n x_i \right)^{\alpha - 1} e^{-\beta \sum_{i = 1}^n x_i} \end{align*}\]

and so $f(x_1, \dots, x_n \mid \alpha, \beta)$ depends on $(\alpha, \beta)$ on $x_i$ only through $(\prod_{i = 1}^n x_i, \sum_{i = 1}^n x_i)$, and therefore $(\prod_{i = 1}^n x_i, \sum_{i = 1}^n x_i)$ is sufficient for $(\alpha, \beta)$.

Problem 9: The density of a random variable $X \sim Cauchy(x_0, \gamma)$ is given by \[ f_X(x \mid x_0, \gamma) = \frac{1}{\pi \gamma \left[ 1 + \left( \frac{x - x_0}{\gamma}\right)^2 \right]}. \] Use the Lehmann-Scheffe theorem to show that the set of all order statistics is minimal sufficient for $\gamma$. (Note: This is sort of wild, since the set of all order statistics is literally every observation we have! Colloquially, this means that in order to get a “complete picture” of what is going on with $\gamma$, we need to know the exact value of every single data point!)

Hint: If a ratio of two polynomials has identical roots, then the denominator and numerator are equal up to a constant of proportionality.

Solution:

Let $X_1, \dots, X_n, Y_1, \dots, Y_n \overset{iid}{\sim} Cauchy(0, \gamma)$. Then we can write the likelihood ratio of $f(\textbf{x} \mid \gamma)$ and $f(\textbf{y} \mid \gamma)$ as

\[\begin{align*} \frac{f(\textbf{x} \mid \gamma)}{f(\textbf{y} \mid \gamma)} & = \frac{\prod_{i = 1}^n \pi \gamma [1 + (y_i/\gamma)^2]}{\prod_{i = 1}^n \pi \gamma [1 + (x_i/\gamma)^2]} \\ & = \frac{\prod_{i = 1}^n (1 + y_i^2/\gamma^2)}{\prod_{i = 1}^n (1 + x_i^2/ \gamma^2)} \\ & = \frac{\prod_{i = 1}^n \frac{1}{\gamma^2}(\gamma^2 + y_i^2)}{\prod_{i = 1}^n \frac{1}{\gamma^2} (\gamma^2 + x_i^2)} \\ & = \frac{\prod_{i = 1}^n (\gamma^2 + y_i^2)}{\prod_{i = 1}^n (\gamma^2 + x_i^2)} \end{align*}\]

We must now show the following two things:

First, suppose that the order statistics $X_{(i)} = Y_{(i)}$ are equal for all $i$. Then we can write

\[\begin{align*} \frac{f(\textbf{x} \mid \gamma)}{f(\textbf{y} \mid \gamma)} & = \frac{\prod_{i = 1}^n (\gamma^2 + y_i^2)}{\prod_{i = 1}^n (\gamma^2 + x_i^2)} \\ & = \frac{\prod_{i = 1}^n (\gamma^2 + y_{(i)}^2)}{\prod_{i = 1}^n (\gamma^2 + x_{(i)}^2)} \quad \text{(since we are taking a product over all $i$)} \\ & = \frac{\prod_{i = 1}^n (\gamma^2 + x_{(i)}^2)}{\prod_{i = 1}^n (\gamma^2 + x_{(i)}^2)} \quad \text{(by assumption)} \\ & = 1 \end{align*}\]

and so the likelihood ratio does not depend on $\lambda$ (it is equal to 1 under this assumption!). Fun fact: this is true for the set of all order statistics in every distribution, since it’s literally every observation we have access to!

Now, assume that $\frac{f(\textbf{x} \mid \gamma)}{f(\textbf{y} \mid \gamma)}$ does not depend on $\gamma$. Then $f(\textbf{x} \mid \gamma)$ and $f(\textbf{y} \mid \gamma)$ have identical roots, and therefore we can write,

\[ f(\textbf{x} \mid \gamma) = a f(\textbf{y} \mid \gamma) \]

for some constant $a$. This further implies $f(x_i \mid \gamma) = a^{1/n} f(y_i \mid \gamma)$. Now note that since both $f(x_i \mid \gamma)$ and $f(y_i \mid \gamma)$ are pdfs, they therefore must both integrate to 1, which is the case if and only if $a = 1$. Finally we can conclude that the polynomials $\prod_{i = 1}^n (\gamma^2 + x_{(i)}^2)$ and $\prod_{i = 1}^n (\gamma^2 + y_{(i)}^2)$ must be equal, and therefore $X_1, \dots, X_n$ must be a permutation of $Y_1, \dots, Y_n$, which is the case only when the order statistics of $X_1, \dots, X_n$ and $Y_1, \dots, Y_n$ are equal.

Problem 10: Suppose $X_1, \dots, X_n \overset{iid}{\sim} Poisson(\lambda)$. Find a sufficient statistic for $\lambda$, and construct a UMVUE (you may assume the sufficient statistic you find is complete).

Solution:

We can write the joint pdf of $X_1, \dots, X_n$ as

\[\begin{align*} f(\textbf{x} \mid \lambda) & = \prod_{i = 1}^n \frac{\lambda^{x_i} e^{-\lambda}}{x_i!} \\ & = \frac{\lambda^{\sum_{i = 1}^n x_i} e^{-\lambda n}}{\prod_{i = 1}^n x_i !} \end{align*}\]

and so if we let $g(T(x) \mid \lambda) = \lambda^{T(x)} e^{-\lambda n}$ and $h(x) = 1/\prod_{i = 1}^n x_i!$, by the Factorization criterion, $\sum_{i = 1}^n x_i$ is sufficient for $\lambda$.

Now we assume from the problem set-up that $\sum_{i = 1}^n x_i$ is a complete and sufficient statistic. Then by RBLS, we know that $\sum_{i = 1}^n x_i$ is the UMVUE for it’s expectation. We have,

\[\begin{align*} E\left[\sum_{i = 1}^n X_i\right] & = \sum_{i = 1}^n E[X_i] \\ & = \sum_{i = 1}^n \lambda \\ & = n \lambda \end{align*}\]

By RBLS, we know that a function of our sufficient statistic for it’s expectation. Letting our function be $\tau(T(x)) = \frac{1}{n}T(x)$ and noting that

\[\begin{align*} E\left[\frac{1}{n}T(x)\right] & = \frac{1}{n} \sum_{i = 1}^n E[X_i]\\ & = \lambda \end{align*}\]

gives us that the sample mean is the UMVUE for $\lambda$!