4  Properties of Estimators

Now that we’ve developed the tools for deriving estimators of unknown parameters, we can start thinking about different metrics for determining how “good” our estimators actually are. In general, we like our estimators to be:

The Bias-Variance Trade-off

If you are familiar with machine learning techniques or models for prediction purposes more generally (as opposed to inference), you may have stumbled upon the phrase “bias-variance trade-off.” In scenarios where we want to make good predictions for new observations using a statistical model, one way to measure how “well” our model is predicting new observations is through minimizing mean squared error. Intuitively, this is something we should want to minimize: “errors” (the difference between a predicted value and an observed value) are bad, we square them because the direction of the error (above or below) shouldn’t matter too much, and average over them because we need a summary measure of all our errors combined, and an average seems reasonable. In statistical terms, mean squared error has a very specific definition (see below) as the expected value of what is sometimes called a loss function (where in this case, loss is defined as squared error loss). We’ll return to this in the decision theory chapter of our course notes.

It just so happens that we can decompose mean squared error into a sum of two terms: the variance of our estimator + the bias of our estimator (squared). What this means for us is that two estimators may have the exact same MSE, but very different variances or biases (potentially). In general, if we hold MSE constant and imagine increasing the variance of our estimator, the bias would need to decrease accordingly to maintain the same MSE. This is where the “trade-off” comes from. MSE is an incredibly commonly used metric for assessing prediction models, but as we will see, doesn’t necessarily paint a full picture in terms of how “good” an estimator is. Smaller MSE does not automatically imply “better estimator,” just as smaller bias (in some cases) does not automatically imply “better estimator.”

Sufficiency

Another property we like to have in an estimator (sometimes) is called sufficiency. I like to think about sufficiency in terms of minimizing the amount of information we need to retain in order to get a “complete picture” of what’s going on. Suppose, for example, someone is allergic to tomatoes. Rather than listing every food that contains tomatoes and saying that they’re allergic to each of them individually, they could just say that they’re allergic to tomatoes and call it a day. Stating “tomatoes” is sufficient information in this case for us to get the whole picture of their allergies!

A similar concept applies to estimators. Recall from the MLE chapter of the notes that we previously showed that the MLE of a sample proportion is given by \(\bar{X}\). If I want someone to be able to obtain the MLE for a sample proportion, I then have a few options. I could give them:

  • Every observation I know, \(x_1, \dots,x_n\)
  • Just one number, the sample mean, \(\frac{1}{n}\sum_{i = 1}^n x_i\)
  • All my observations plus some extra information, just for fun!

It should hopefully be obvious that you don’t need extra information for fun, but we also don’t need to know the value of each individual observation. The sample mean is sufficient! Formal definitions and a relevant theorem to follow.

4.1 Learning Objectives

By the end of this chapter, you should be able to…

  • Calculate bias and variance of various estimators for unknown parameters

  • Explain the distinction between bias and variance colloquially in terms of precision and accuracy, and why these properties are important

  • Compare estimators in terms of their relative efficiency

  • Justify why there exists a bias-variance trade-off, and explain what consequences this may have when comparing estimators

4.2 Concept Questions

  1. Intuitively, what is the difference between bias and precision?

  2. What are the typical steps to checking if an estimator is unbiased?

  3. How can we construct unbiased estimators?

  4. If an estimator is unbiased, is it also asymptotically unbiased? If an estimator is asymptotically unbiased, is it necessarily unbiased?

  5. When comparing estimators, how can we determine which estimator is more efficient?

  6. Why might we care about sufficiency, particularly when thinking about the variance of unbiased estimators?

  7. Describe, in your own words, what the Cramér-Rao inequality tells us.

  8. What is the difference between a UMVUE and an efficient estimator? Does one imply the other?

4.3 Definitions

You are expected to know the following definitions:

Unbiased

An estimator \(\hat{\theta} = g(X_1, \dots, X_n)\) is an unbiased estimator for \(\theta\) if \(E[\hat{\theta}] = \theta\), for all \(\theta\).

Asymptotically Unbiased

An estimator \(\hat{\theta} = g(X_1, \dots, X_n)\) is an asymptotically unbiased estimator for \(\theta\) if \(\underset{n \to \infty}{\text{lim}} E[\hat{\theta}] = \theta\).

Precision

The precision of a random variable \(X\) is given by \(\frac{1}{Var(X)}\).

Mean Squared Error (MSE)

The mean squared error of an estimator is given by

\[ MSE(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = Var(\hat{\theta}) + Bias(\hat{\theta})^2 \]

Sufficient

For some function \(T\), \(T(X)\) is a sufficient statistic for an unknown parameter \(\theta\) if the conditional distribution of \(X\) given \(T(X)\) does not depend on \(\theta\). A “looser” definition is that the distribution of \(X\) must depend on \(\theta\) only through \(T(X)\).

Minimal Sufficiency

For some function \(T^*\), \(T^*(X)\) is a minimal sufficient statistic for an unknown parameter \(\theta\) if \(T^*(X)\) is sufficient, and for every other sufficient statistic \(T(X)\). \(T^*(X) = f(T(X))\) for some function \(f\).

Complete

A statistic \(T(X)\) is complete for an unknown parameter \(\theta\) if \[ E[g(T(x))] \text{ is } \theta-\text{free} \implies g(T(x)) \text{ is constant, almost everywhere} \] for a nice function \(g\).

Importantly, it is equivalent to say that \(T(X)\) is complete for an unknown parameter \(\theta\) if

\[ E[g(T(x))] = 0 \implies g(T(x)) = 0 \quad\text{ almost everywhere} \]

Relative Efficiency

The relative efficiency of an estimator \(\hat{\theta}_1\) with respect to an estimator \(\hat{\theta}_2\) is the ratio \(Var(\hat{\theta}_2)/Var(\hat{\theta}_1)\).

Uniformly Minimum-Variance Unbiased Estimator (UMVUE)

An estimator \(\hat{\theta}^*\) is the UMVUE if, for all estimators \(\hat{\theta}\) in the class of unbiased estimators \(\Theta\),

\[ Var(\hat{\theta}^*) \leq Var(\hat{\theta}) \]

Score

The score is defined as the first partial derivative with respect to \(\theta\) of the log-likelihood function, given by

\[ \frac{\partial}{\partial \theta} \log L(\theta \mid x) \]

Information Matrix

The information matrix* \(I(\theta)\) for a collection of iid random variables \(X_1, \dots, X_n\) is the variance of the score, given by

\[ I(\theta) = E \left[ \left( \frac{\partial}{\partial \theta} \log L(\theta \mid x) \right)^2\right] = -E\left[ \frac{\partial^2}{\partial \theta^2} \log L(\theta \mid x)\right] \]

Note that the above formula is in fact the variance of the score, since we can show that the expectation of the score is 0 (under some regularity conditions). This is shown as part of the proof of the C-R lower bound in the Theorems section of this chapter.

The information matrix is sometimes written in terms of a pdf for a single random variable as opposed to a likelihood (this is what our textbook does, for example). In this case, we have \(I(\theta) = n I_1(\theta)\), where the \(I_1(\theta)\) on the right-hand side is defined as \(E \left[ \left( \frac{\partial}{\partial \theta} \log f_X(x \mid \theta) \right)^2\right]\). Sometimes (as in the textbook) \(I_1(\theta)\) is written without the subscript \(1\) which is a slight abuse of notation that is endlessly confusing (to me, at least). For this set of course notes, we’ll always specify the information matrix in terms of a pdf for a single random variable with the subscript \(1\), for clarity.

*The information matrix is often referred to as the Fisher Information matrix, as it was developed by Sir Ronald Fisher. Fisher developed much of the core, statistical theory that we use today. He was also the founding chairman of the University of Cambridge Eugenics Society, and contributed to a large body of scientific work and public policy that promoted racist and classist ideals.

4.4 Theorems

Covariance Inequality (based on the Cauchy-Schwarz inequality)

Let \(X\) and \(Y\) be random variables. Then,

\[ Var(X) \geq \frac{Cov(X, Y)^2}{Var(Y)} \]

The proof is quite clear on Wikipedia.

The Factorization Criterion for sufficiency

Consider a pdf for a random variable \(X\) that depends on an unknown parameter \(\theta\), given by \(\pi(x \mid \theta)\). The statistic \(T(x)\) is sufficient for \(\theta\) if and only if \(\pi(x \mid \theta)\) factors as:

\[ \pi(x \mid \theta) = g(T(x) \mid \theta) h(x) \] where \(g(T(x) \mid \theta)\) depends on \(x\) only through \(T(x)\), and \(h(x)\) does not depend on \(\theta\).

Note that in the statistics literature this criterion is sometimes referred to as the Fisher-Neyman Factorization Criterion.

Two proofs available on Wikipedia. The one for the discrete-only case is more intuitive, if you’d like to look through one of them.

Lehmann-Scheffe Theorem

Suppose that a random variable \(X\) has pdf given by \(f(x \mid \theta)\), and that \(T^*(X)\) is such that for every* pair of points \((x,y)\), the ratio of pdfs

\[ \frac{f(y \mid \theta)}{f(x \mid \theta)} \] does not depend on \(\theta\) if and only if \(T^*(x) = T^*(y)\). Then \(T^*(X)\) is a minimal sufficient statistic for \(\theta\).

*every pair of points that have the same support as \(X\).

Proof.

We’ll utilize something called a likelihood ratio (literally a ratio of likelihoods) to prove this theorem. We’ll also come back to likelihood ratios later in the Hypothesis Testing chapter!

Let \(\theta_1\) and \(\theta_2\) be two possible values of our unknown parameter \(\theta\). Then a likelihood ratio comparing densities evaluated at these two values is defined as

\[ L_{\theta_1, \theta_2}(x) \equiv \frac{f(x \mid \theta_2)}{f(x \mid \theta_1)} \] Our proof will proceed as follows:

  1. We’ll show that if \(T(X)\) is sufficient, then \(L_{\theta_1, \theta_2}(X)\) is a function of \(T(X)\) \(\forall\) \(\theta_1, \theta_2\).

  2. We’ll show the converse: If \(L_{\theta_1, \theta_2}(X)\) is a function of \(T(X)\) \(\forall\) \(\theta_1, \theta_2\), then \(T(X)\) is sufficient. This combined with (1) will show that \(L_{\theta_1, \theta_2}(X)\) is a minimal sufficient statistic.

  3. We’ll use the above two statements to prove the theorem!

First, suppose that \(T(X)\) is sufficient for \(\theta\). Then, by definition we can write

\[ L_{\theta_1, \theta_2}(x) = \frac{f(x \mid \theta_2)}{f(x \mid \theta_1)} = \frac{g(T(x) \mid \theta_1)h(x)}{g(T(x) \mid \theta_2)h(x)} = \frac{g(T(x) \mid \theta_1)}{g(T(x) \mid \theta_2)} \] and so \(L_{\theta_1, \theta_2}(X)\) is a function of \(T(x)\) \(\forall\) \(\theta_1, \theta_2\).

Second, assume WLOG that \(\theta_1\) is fixed, and denote our unknown parameter \(\theta_2 = \theta\). We can rearrange our likelihood ratio as

\[\begin{align*} L_{\theta_1, \theta}(x) & = \frac{f(x \mid \theta)}{f(x \mid \theta_1)} \\ f(x \mid \theta) & = L_{\theta_1, \theta}(x) f(x \mid \theta_1) \end{align*}\]

and note that \(L_{\theta_1, \theta}(x)\) is a function of \(T(X)\) by assumption, and \(f(x \mid \theta_1)\) is a function of \(x\) that does not depend on our unknown parameter \(\theta\). Then \(T(X)\) satisfies the factorization criterion, and is therefore sufficient.

Let \(T^{**}(X) \equiv L_{\theta_1, \theta_2}(X)\). Then the first two statements we have shown give us that

\[ T(X) \text{ is sufficient } \iff T^{**}(X) \text{ is a function of } T(X) \]

and therefore \(T^{**}(X)\) is a minimal sufficient statistic, by definition.

We’ll now (officially) prove our theorem. By hypothesis of the theorem,

\[\begin{align*} T^*(x) = T^*(y) & \iff \frac{f(y \mid \theta)}{f(x \mid \theta)} \text{ is } \theta-free \\ & \iff \frac{f(y \mid \theta_1)}{f(x \mid \theta_1)} = \frac{f(y \mid \theta_2)}{f(x \mid \theta_2)} \quad \forall \theta_1, \theta_2 \\ & \iff \frac{f(y \mid \theta_2)}{f(y \mid \theta_1)} = \frac{f(x \mid \theta_2)}{f(x \mid \theta_1)} \quad \forall \theta_1, \theta_2 \\ & \iff L_{\theta_1, \theta_2}(y) = L_{\theta_1, \theta_2} (x) \quad \forall \theta_1, \theta_2 \\ & \iff T^{**}(y) = T^{**}(x) \end{align*}\]

Therefore \(T^*(X)\) and \(T^{**}(X)\) are equivalent. Since \(T^{**}(X)\) is a minimal sufficient statistic, \(T^*(X)\) is therefore also minimal sufficient.

Complete, Sufficient, Minimal

If \(T(X)\) is complete and sufficient, then \(T(X)\) is minimal sufficient.

Proof.

Just kidding! Prove it on your own and show it to me, if you want bonus points in my heart :)

Rao-Blackwell-Lehmann-Scheffe (RBLS)

Let \(T(X)\) be a complete and sufficient statistic for unknown parameter \(\theta\), and let \(\tau(\theta)\) be some function of \(\theta\). If there exists at least one unbiased estimator \(\tilde{\tau}(X)\) for \(\tau(\theta)\), then there exists a unique UMVUE \(\hat{\tau}(T(X))\) for \(\tau(\theta)\) given by

\[ \hat{\tau}(T(X)) = E[\tilde{\tau}(X) \mid T(X)] \]

Why do we care? An important consequence of the RBLS Theorem is that if \(T(X)\) is a complete and sufficient statistic for \(\theta\), then any function \(\phi(T(X))\) is the UMVUE of its expectation \(E[\phi(T(X))]\) (so long as the expectation is finite for all \(\theta\)). This Theorem is therefore a very convenient way to find UMVUEs: (1) Find a complete and sufficient statistic for an unknown parameter, and (2) functions of that statistic are then the UMVUE for their expectation!

Proof.

To prove RBLS, we first must prove an Improvement Lemma and a Uniqueness Lemma.

Improvement Lemma. Suppose that \(T(X)\) is a sufficient statistic for \(\theta\). If \(\tilde{\tau}(X)\) is an unbiased estimator of \(\tau(\theta)\), then \(E[\tilde{\tau}(X) \mid T(X)]\) does not depend on \(\theta\) (by sufficiency) and is also an estimator of \(\tau(\theta)\), which (importantly) has smaller variance than \(\tilde{\tau}(X)\).

Proof of Lemma. First, note that \(E[\tilde{\tau}(X) \mid T(X)]\) is an unbiased estimator for \(\tau(\theta)\), since \[\begin{align*} E[E[\tilde{\tau}(X) \mid T(X)]] & = E[\tilde{\tau}(X)] \quad \quad \text{(Law of Iterated Expectation)} \\ & = \tau(\theta) \quad \quad (\tilde{\tau}(X) \text{ is unbiased}) \end{align*}\] Then, \[\begin{align*} Var(\tilde{\tau}(X)) & = E[Var(\tilde{\tau}(X) \mid T(X))] + Var(E[\tilde{\tau}(X) \mid T(X)]) \\ & \geq Var(E[\tilde{\tau}(X) \mid T(X)]) \end{align*}\] and we’re done! \(E[\tilde{\tau}(X) \mid T(X)]\) has a smaller variance than \(\tilde{\tau(X)}\). Since both are unbiased, this is considered an “improvement” (hence the name of the Lemma).

Uniqueness Lemma. If \(T(X)\) is complete, then for some unknown parameter \(\theta\) and function of it \(\tau(\theta)\), \(\tau(\theta)\) has at most one unbiased estimator \(\hat{\tau}(T(X))\) that depends on \(T(X)\).

Proof of Lemma. Suppose, toward contradiction, that \(\tau(\theta)\) has more than one unbiased estimator that depends on \(T(X)\), given by \(\tilde{\tau}(T(X))\) and \(\hat{\tau}(T(X))\), \(\tilde{\tau}(T(X)) \neq \hat{\tau}(T(X))\). Then

\[ E[\tilde{\tau}(T(X)) - \hat{\tau}(T(X))] = \tau(\theta) - \tau(\theta) = 0 \quad \forall \theta \] Let \(g(T(X)) = \tilde{\tau}(T(X)) - \hat{\tau}(T(X))\). Since \(T(X)\) is complete, and \(E[g(T(X))] = 0\), this implies \(\tilde{\tau}(T(X)) - \hat{\tau}(T(X)) = 0\), which means \(\tilde{\tau}(T(X)) = \hat{\tau}(T(X))\). Contradiction.

Back to the proof of RBLS.

We’ve shown previously that \(\hat{\tau}(T(X))\) is an unbiased estimator for \(\tau{\theta}\) (law of iterated expectation). Let \(\tau_1(X)\) be any other unbiased estimator for \(\tau(\theta)\), and let \(\tau_2(T(X)) = E[\tau_1(X) \mid T(X)]\). Then \(\tau_2(T(X))\) is also unbiased for \(\tau(\theta)\) (again, iterated expectation), and by the Uniqueness Lemma (since \(T\) is complete by supposition), \(\hat{\tau}(T(X)) = \tau_2(T(X))\). But,

\[\begin{align*} Var(\hat{\tau}(T(X))) & = Var(\tau_2(T(X))) \quad \quad (\hat{\tau} = \tau_2) \\ & \leq Var(\tau_1(T(X))) \quad \quad \text{(Improvement Lemma)} \end{align*}\] so \(\hat{\tau}(T(X))\) is the UMVUE for \(\tau(\theta)\), as desired.

Cramér-Rao Lower Bound

Let \(f_Y(y \mid \theta)\) be a pdf with nice* conditions, and let \(Y_1, \dots, Y_n\) be a random sample from \(f_Y(y \mid \theta)\). Let \(\hat{\theta}\) be any unbiased estimator of \(\theta\). Then

\[\begin{align*} Var(\hat{\theta}) & \geq \left\{ E\left[ \left( \frac{\partial \log( L(\theta \mid y))}{\partial \theta}\right)^2\right]\right\}^{-1} \\ & = -\left\{ E\left[ \frac{\partial^2 \log( L(\theta \mid y))}{\partial \theta^2} \right] \right\}^{-1} \\ & = \frac{1}{I(\theta)} \end{align*}\]

*our nice conditions that we need are that \(f_Y(y \mid \theta)\) has continuous first- and second-order derivatives, which would quickly discover we need by looking at the form for the C-R lower bound, and that the set of values \(y\) where \(f_Y(y \mid \theta) \neq 0\) does not depend on \(\theta\). If you are familiar with the concept of the “support” of a function, that is where this second condition comes from. The key here is that this condition allows to interchange derivatives and integrals, in particular, \(\frac{\partial}{\partial \theta} \int f(x) dx = \int \frac{\partial}{\partial \theta} f(x)dx\), which we’ll need to complete the proof.

Proof.

Let \(X = \frac{\partial \log L(\theta \mid \textbf{y})}{\partial \theta}\). By the Covariance Inequality,

\[ Var(\hat{\theta}) \geq \frac{Cov(\hat{\theta},X)^2}{Var(X)} \]

and so if we can show

\[\begin{align*} \frac{Cov(\hat{\theta},X)^2}{Var(X)} & = \left\{ E\left[ \left( \frac{\partial \log( L(\theta \mid \textbf{y}))}{\partial \theta}\right)^2\right]\right\}^{-1} \\ & = \frac{1}{I(\theta)} \end{align*}\]

then we’re done, as this is the C-R lower bound. Note first that

\[\begin{align*} E[X] & = \int x f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \int \left( \frac{\partial \log L(\theta \mid \textbf{y})}{\partial \theta} \right) f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \int \left( \frac{\partial \log f_Y(\textbf{y} \mid \theta)}{\partial \theta} \right) f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \int \frac{\frac{\partial}{\partial \theta} f_Y(\textbf{y} \mid \theta)}{ f_Y(\textbf{y} \mid \theta)} f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \int \frac{\partial}{\partial \theta} f_Y (\textbf{y} \mid \theta) d\textbf{y} \\ & = \frac{\partial}{\partial \theta} \int f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \frac{\partial}{\partial \theta} 1 \\ & = 0 \end{align*}\]

This means that

\[\begin{align*} Var[X] & = E[X^2] - E[X]^2 \\ & = E[X^2] \\ & = E \left[ \left( \frac{\partial \log L(\theta \mid \textbf{y})}{\partial \theta} \right)^2\right ] \end{align*}\]

and

\[\begin{align*} Cov(\hat{\theta}, X) & = E[\hat{\theta} X] - E[\hat{\theta}] E[X] \\ & = E[\hat{\theta}X] \\ & = \int \hat{\theta} x f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \int \hat{\theta} \left( \frac{\partial \log L(\theta \mid \textbf{y})}{\partial \theta} \right) f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \int \hat{\theta} \left( \frac{\partial \log f_Y(\textbf{y} \mid \theta)}{\partial \theta} \right) f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \int \hat{\theta} \frac{\frac{\partial}{\partial \theta} f_Y(\textbf{y} \mid \theta)}{ f_Y(\textbf{y} \mid \theta)} f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \int \hat{\theta} \frac{\partial}{\partial \theta} f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \frac{\partial}{\partial \theta} \int \hat{\theta} f_Y(\textbf{y} \mid \theta) d\textbf{y} \\ & = \frac{\partial}{\partial \theta} E[\hat{\theta}] \\ & = \frac{\partial}{\partial \theta} \theta \\ & = 1 \end{align*}\]

where \(E[\hat{\theta}] = \theta\) since our estimator is unbiased. Putting this all together, we have

\[\begin{align*} Var[\hat{\theta}] & \geq \frac{Cov(\hat{\theta},X)^2}{Var(X)} \\ & = \frac{1^2}{E \left[ \left( \frac{\partial \log L(\theta \mid \textbf{y})}{\partial \theta} \right)^2\right ]} \\ & = \frac{1}{I(\theta)} \end{align*}\]

as desired.

Comment: Note that what the Cramér-Rao lower bound tells us is that, if the variance of an unbiased estimator is equal to the Cramér-Rao lower bound, then that estimator has the minimum possible variance among all unbiased estimators there could possibly be. This allows us to prove, for example, whether or not an unbiased estimator is the UMVUE: If an unbiased estimator’s variance achieves the C-R lower bound, then it is optimal according to the UMVUE criterion.

4.5 Worked Examples

Problem 1: Suppose \(X_1, \dots, X_n \overset{iid}{\sim} Exponential(1/\theta)\). Compute the MLE of \(\theta\), and show that it is an unbiased estimator of \(\theta\).

Solution:

Note that we can write

\[\begin{align*} L(\theta) & = \prod_{i = 1}^n \frac{1}{\theta} e^{-x_i / \theta} \\ \log L(\theta) & = \sum_{i = 1}^n \log(\frac{1}{\theta} e^{-x_i / \theta}) \\ & = \sum_{i = 1}^n \log(\frac{1}{\theta}) - \sum_{i = 1}^n x_i / \theta \\ & = -n \log(\theta) - \frac{1}{\theta} \sum_{i = 1}^n x_i \\ \frac{\partial}{\partial \theta} \log L(\theta) & = \frac{\partial}{\partial \theta} \left( -n \log(\theta) - \frac{1}{\theta} \sum_{i = 1}^n x_i \right) \\ & = -\frac{n}{\theta} + \frac{\sum_{i = 1}^n x_i }{\theta^2} \end{align*}\]

Setting this equal to \(0\) and solving for \(\theta\) we obtain

\[\begin{align*} 0 & \equiv -\frac{n}{\theta} + \frac{\sum_{i = 1}^n x_i }{\theta^2} \\ \frac{n}{\theta} & = \frac{\sum_{i = 1}^n x_i }{\theta^2} \\ n & = \frac{\sum_{i = 1}^n x_i }{\theta} \\ \theta & = \frac{1}{n} \sum_{i = 1}^n x_i \end{align*}\]

and so the MLE for \(\theta\) is the sample mean. To show that the MLE is unbiased, we note that

\[\begin{align*} E \left[ \frac{1}{n} \sum_{i = 1}^n X_i \right] & = \frac{1}{n} \sum_{i = 1}^n E[X_i] = \frac{1}{n} \sum_{i = 1}^n \theta = \theta \end{align*}\]

as desired.

Problem 2: Suppose again that \(X_1, \dots, X_n \overset{iid}{\sim} Exponential(1/\theta)\). Let \(\hat{\theta}_2 = Y_1\), and \(\hat{\theta}_3 = nY_{(1)}\). Show that \(\hat{\theta}_2\) and \(\hat{\theta}_3\) are unbiased estimators of \(\theta\). Hint: use the fact that \(Y_{(1)} \sim Exponential(n/\theta)\)

Solution:

Note that the mean of a random variable \(Y \sim Exponential(\lambda)\) is given by \(1/\lambda\). Then we can write

\[ E[\hat{\theta}_2] = E[Y_1] = \frac{1}{1/\theta} = \theta \]

and

\[ E[\hat{\theta}_3] = E[nY_{(1)}] = \frac{n}{n/\theta} = \theta \] as desired.

Problem 3: Compare the variance of the estimators from Problems 1 and 2. Which is most efficient?

Solution:

Recall that the variance of a random variable \(Y \sim Exponential(\lambda)\) is given by \(1/\lambda^2\). Let the MLE from Problem 1 be denoted \(\hat{\theta}_1 = \bar{X}\). Then we can write

\[ Var\left[\hat{\theta}_1\right] = Var\left[\frac{1}{n} \sum_{i = 1}^n X_i\right] = \frac{1}{n^2} \sum_{i = 1}^n Var[X_i] = \frac{1}{n^2} \left( \frac{n}{(1/\theta)^2} \right) = \frac{\theta^2}{n} \]

and

\[ Var\left[\hat{\theta}_2\right] = Var[Y_1] = \frac{1}{(1/\theta)^2} = \theta^2 \]

and

\[ Var\left[\hat{\theta}_3\right] = Var[nY_{(1)}] = n^2 Var[Y_{(1)}] = \frac{n^2}{(n/\theta)^2} = \theta^2 \]

Thus, the variance of the MLE, \(\hat{\theta}_1\), is most efficient, and is \(n\) times smaller than the variance of both \(\hat{\theta}_2\) and \(\hat{\theta}_3\).

Problem 4: Suppose \(X_1, \dots, X_n \overset{iid}{\sim} N(\mu, \sigma^2)\). Show that the estimator \(\hat{\mu} = \frac{1}{n} \sum_{i = 1}^n X_i\) and the estimator \(\hat{\mu}_w = \sum_{i = 1}^n w_i X_i\) are both unbiased estimators of \(\mu\), where \(\sum_{i = 1}^n w_i = 1\).

Solution:

We can write

\[ E[\hat{\mu}] = E\left[ \frac{1}{n} \sum_{i = 1}^n X_i \right] = \frac{1}{n}\sum_{i = 1}^n E[X_i] = \frac{1}{n}\sum_{i = 1}^n \mu = \mu \]

and

\[ E[\hat{\mu}_w] = E \left[ \sum_{i = 1}^n w_i X_i \right] = \sum_{i = 1}^n w_i E \left[ X_i \right] = \sum_{i = 1}^n w_i \mu = \mu \sum_{i = 1}^n w_i = \mu \]

as desired.

Problem 5: Determine whether the estimator \(\hat{\mu}\) or \(\hat{\mu}_w\) is more efficient, in Problem 5, if we additionally impose the constraint \(w_i \geq 0\) \(\forall i\). (Note that this is a more “general” example based on Example 5.4.5 in the course textbook) (Hint: use the Cauchy-Schwarz inequality)

Solution:

To determine relative efficiency, we must compute the variance of each estimator. We have

\[ Var[\hat{\mu}] = Var \left[ \frac{1}{n} \sum_{i = 1}^n X_i \right] = \frac{1}{n^2} \sum_{i = 1}^n Var[X_i] = \frac{1}{n^2} \sum_{i = 1}^n \sigma^2 = \sigma^2 / n \]

and

\[\begin{align*} Var[\hat{\mu}_w] & = Var \left[ \sum_{i = 1}^n w_i X_i \right] \\ & = \sum_{i = 1}^n Var[w_i X_i] \\ & = \sum_{i = 1}^n w_i^2 Var[X_i] \\ & = \sum_{i = 1}^n w_i^2 \sigma^2 \\ & = \sigma^2 \sum_{i = 1}^n w_i^2 \end{align*}\]

And so to determine which estimator is more efficient, we need to determine if \(\frac{1}{n}\) is less than \(\sum_{i = 1}^n w_i^2\) (or not). The Cauchy-Schwarz inequality tells us that

\[\begin{align*} \left( \sum_{i = 1}^n w_i \cdot 1\right)^2 & \leq \left( \sum_{i = 1}^n w_i^2 \right) \left( \sum_{i = 1}^n 1^2 \right) \\ \left( \sum_{i = 1}^n w_i \right)^2 & \leq \left( \sum_{i = 1}^n w_i^2 \right) n \\ 1 & \leq \left( \sum_{i = 1}^n w_i^2 \right) n \\ \frac{1}{n} & \leq \sum_{i = 1}^n w_i^2 \end{align*}\]

and therefore, \(\hat{\mu}\) is more efficient than \(\hat{\mu}_w\).

Problem 6: Suppose \(X_1, \dots, X_n \overset{iid}{\sim} N(\mu, \sigma^2)\). Show that the MLE for \(\sigma^2\) is biased, and suggest a modified variance estimator for \(\sigma^2\) that is unbiased. (Note that this is example 5.4.4 in our course textbook)

Solution:

Recall that the MLE for \(\sigma^2\) is given by \(\frac{1}{n} \sum_{i = 1}^n (X_i - \bar{X})^2\). Then

\[\begin{align*} E\left[ \frac{1}{n} \sum_{i = 1}^n (X_i - \bar{X})^2\right] & = \frac{1}{n} \sum_{i = 1}^n E\left[ (X_i - \bar{X})^2\right] \\ & = \frac{1}{n} \sum_{i = 1}^n E\left[ X_i^2 - 2X_i \bar{X} + \bar{X}^2\right] \\ & = \frac{1}{n} \sum_{i = 1}^n E[X_i^2] - 2 E\left[ \frac{1}{n} \sum_{i = 1}^n X_i \bar{X} \right] + E[\bar{X}^2] \\ & = \frac{1}{n} \sum_{i = 1}^n E[X_i^2] - 2 E\left[ \bar{X} \frac{1}{n} \sum_{i = 1}^n X_i \right] + E[\bar{X}^2] \\ & = \frac{1}{n} \sum_{i = 1}^n E[X_i^2] - 2 E\left[ \bar{X}^2 \right] + E[\bar{X}^2] \\ & = \frac{1}{n} \sum_{i = 1}^n E[X_i^2] - E\left[ \bar{X}^2 \right] \end{align*}\]

Recall that since \(X_i \overset{iid}{\sim} N(\mu, \sigma^2)\), \(\bar{X} \sim N(\mu, \sigma^2/n)\), and that we can write \(Var[Y] + E[Y]^2 = E[Y^2]\) (definition of variance). Then we can write

\[\begin{align*} E\left[ \frac{1}{n} \sum_{i = 1}^n (X_i - \bar{X})^2 \right] & = \frac{1}{n} \sum_{i = 1}^n E[X_i^2] - E\left[ \bar{X}^2 \right] \\ & = \frac{1}{n} \sum_{i = 1}^n \left( \sigma^2 + \mu^2 \right) - \left( \frac{\sigma^2}{n} + \mu^2 \right) \\ & = \sigma^2 + \mu^2 - \frac{\sigma^2}{n} - \mu^2 \\ & = \sigma^2 - \frac{\sigma^2}{n} \\ & = \sigma^2 \left( 1 - \frac{1}{n} \right) \\ & = \sigma^2 \left( \frac{n-1}{n} \right) \end{align*}\]

Therefore, since \(E[\hat{\sigma}^2_{MLE}] \neq \sigma^2\), the MLE is unbiased. Note that

\[\begin{align*} E\left[ \left( \frac{n}{n-1} \right)\frac{1}{n} \sum_{i = 1}^n (X_i - \bar{X})^2\right] & = \left( \frac{n}{n-1} \right) \left( \frac{n-1}{n} \right) \sigma^2 \\ & = \sigma^2 \end{align*}\]

and so the estimator \(\frac{1}{n-1} \sum_{i = 1}^n (X_i - \bar{X})^2\) is an unbiased estimator for \(\sigma^2\). This estimator is often called the “sample variance”, and is denoted by \(S^2\).