# The quantile function for a given distribution gives us the value at which
# a given percentage of the distributions lies ahead of that value, which
# is exactly what we want in this case!
qchisq(1 - 0.05, df = 1)
[1] 3.841459
The goal of hypothesis testing is to make a decision between two conflicting theories, or “hypotheses.” The process of hypothesis testing involves the following steps:
State the hypotheses: \(H_0\) (null hypothesis) vs \(H_1\) (alternative hypothesis)
Investigate: are data compatible with \(H_0\)? assuming \(H_0\) were true, are data extreme?
Make a decision: reject \(H_0\) or fail to reject \(H_0\)
The first step is relatively straightforward. For the purposes of this course, our null hypothesis will always be that some unknown parameter we are interested in \((\theta)\) is equal to a fixed point \((\theta_0)\). We’ll consider two possible alternatives hypothesis:
\(H_1: \theta = \theta_1\) (“simple” alternative)
\(H_1: \theta \neq \theta_0\) (two-sided alternative)
The former is the simplest, non-trivial alternative hypothesis we can consider, and we can prove some nice things in this setting (and hence build intuition for hypothesis testing broadly). The latter is perhaps more relevant, particularly in linear regression.
If you recall from introductory statistics, the latter alternative provides the set-up we have when testing if the linear relationship between a predictor \(X\) and outcome \(Y\) are “statistically significantly” associated; we test the null hypothesis \(H_0: \beta_1 = 0\) against the alternative, \(H_1: \beta_1 \neq 0\), where \(E[Y \mid X] = \beta_0 + \beta_1 X\). In this example, we’d have the unknown parameter \(\beta_1\), and the fixed point of our null hypothesis as \(\theta_0 = 0\).
The second step of hypothesis testing is the investigation. In determining whether the data are compatible with the null hypothesis, we must first derive a test statistic. Test statistics are typically functions of (1) our estimators and (2) the distribution of our estimator under the null hypothesis. Intuitively, if we can determine the distribution of our estimator under the null hypothesis, we can then observe whether or not the data we actually have is “extreme” or not, given a certain threshold, \(\alpha\), for our hypothesis test. This threshold \(\alpha\) is directly related to a \(100(1 - \alpha)\%\) confidence interval, where anything observed outside the confidence interval bounds is considered to lie in the “rejection region” (where you would thus reject the null hypothesis).
There are three classical forms of test statistics that have varying finite-sample properties, and can be shown to be asymptotically equivalent: the Wald test, the likelihood ratio test (LRT), and the score test (also sometimes called the Lagrange multiplier test). Each of these is explained in further detail below.
Suppose we are interested in testing the hypotheses \(H_0: \theta = \theta_0\) vs. \(H_1: \theta \neq \theta_0\). The Wald test is the hypothesis test that uses the Wald test statistic \(\lambda_{W}\), where
\[ \lambda_W = \left( \frac{\hat{\theta}_{MLE} - \theta_0}{se(\hat{\theta}_{MLE})}\right)^2. \]
Intuitively, the Wald test measures the difference between the estimated value for \(\theta\) and the null value for \(\theta\), standardized by the variation of your estimator. If this reminds you (once again) of a z-score, it should! In linear regression, with normally distributed standard errors, it turns out that \(\sqrt{W}\) follows a \(t\) distribution (we’ll show this on your problem set!).
Wald tests statistics are extremely straightforward to compute from the Central Limit Theorem. The CLT states that, for iid \(X_1, \dots, X_n\) with expectation \(\mu\) and variance \(\sigma\),
\[ \sqrt{n} (\overline{X} - \mu) \overset{d}{\to} N(0, \sigma^2). \]
Slutsky’s theorem allows us to write
\[ \left( \frac{\overline{X} - \mu}{\sigma / \sqrt{n}}\right) \overset{d}{\to} N(0,1), \]
and note that the left-hand side is an estimator minus it’s expectation, divided by it’s standard error. When \(\overline{X}\) is the MLE for \(\mu\), this is the square root of the Wald test statistic! The final thing to note is that the right-hand side tells us this quantity converges in distribution to a standard normal distribution. Think about what we’ve previously shown about standard normals “squared” to intuit the asymptotic distribution of a Wald test statistic: a \(\chi^2_\nu\) random variable, where the degrees of freedom \(\nu\) in this case is one! For a single parameter restriction (i.e. one hypothesis for one unknown parameter), the asymptotic distribution of a Wald test statistic will always be \(\chi^2_1\).
Note that there is also a multivariate version of the Wald test, used to jointly test multiple hypotheses on multiple parameters. In this case, we can write our null and alternative hypotheses using matrices and vectors.
As a simple example, consider a linear regression model where we have a single, categorical predictor with three categories. Our regression model looks something like this:
\[ E[Y \mid X] = \beta_0 + \beta_1 X_{Cat2} + \beta_2 X_{Cat3} \]
If we want to test if there is a significant association between \(Y\) and \(X\), we can’t look at \(\hat{\beta_1}\) and \(\hat{\beta}_2\) separately. Rather, we need to test the joint null hypothesis \(\beta_1 = \beta_2 = 0\), vs. the alternative where at least one of our coefficients is not equal to zero. In introductory statistics, we did this using the anova
function in R. In matrix form, we can write our null and alternative hypotheses as:
\(H_0: R \boldsymbol{\beta} = \textbf{r}\)
\(H_1: R \boldsymbol{\beta} \neq \textbf{r}\)
where \(R\) in this case is the identity matrix, \(\boldsymbol{\beta} = (\beta_0, \beta_1)^\top\), and \(\textbf{r} = (0,0)^\top\). The Wald test statistic in this multi-hypothesis, multi-parameter case can then be written as
\[ (R\hat{\boldsymbol{\theta}} - \textbf{r})^\top [R (\hat{V}/n) R^\top]^{-1} (R\hat{\boldsymbol{\theta}} - \textbf{r}) \]
where \(\hat{V}\) is an estimator of the covariance matrix for \(\hat{\boldsymbol{\theta}}\). We won’t focus on multi-hypothesis, multi-parameter tests in this course, but I do want you to be able to draw connections between statistical theory and things you learned way back in your introductory statistics course, hence why this is included in the notes.
Suppose we are interested in testing the hypotheses \(H_0: \theta = \theta_0\) vs. \(H_1: \theta \neq \theta_0\). The likelihood ratio test is the hypothesis test that uses the likelihood ratio test statistic \(\lambda_{LRT}\), where
\[ \lambda_{LRT} = -2 \log\left(\frac{\underset{\theta = \theta_0}{\text{sup}} \hspace{1mm} L(\theta)}{\underset{\theta \in \Theta}{\text{sup}} \hspace{1mm} L(\theta)} \right). \]
Since the ratio of the likelihoods is bounded between 0 and 1 (since the denominator will always be at least as large as the numerator), the LRT statistic is always positive. When \(\lambda_{LRT}\) is large, it suggests that the data are not compatible with \(H_0\), and values of \(\lambda_{LRT}\) close to \(0\) suggest the data are compatible with \(H_0\). Therefore, we’ll reject \(H_0\) for large values of \(\lambda_{LRT}\) and fail to reject \(H_0\) if \(\lambda_{LRT}\) is small. The likelihood ratio test is the most “powerful” of all level \(\alpha\) tests when we have a simple alternative hypothesis, and we can prove this using the Neyman-Pearson Lemma. For the simple null hypothesis on a single parameter that we consider, it can be shown that \(\lambda_{LRT} \overset{d}{\to} \chi^2_1\), just as with the Wald test statistic.
Suppose we are interested in testing the hypotheses \(H_0: \theta = \theta_0\) vs. \(H_1: \theta \neq \theta_0\). The score test is the hypothesis test that uses the score test statistic \(\lambda_S\),
\[ \lambda_S = \frac{\left( \frac{\partial}{\partial \theta_0} \log L(\theta_0 \mid x) \right)^2}{I(\theta_0)} \]
as its test statistic. Note that the score test statistic depends only on the distribution of the estimator under the null hypothesis, rather than the maximum likelihood estimator. This is sometimes referred to as a test that requires only computation of a restricted estimator (where \(\theta_0\) is “restricted” by the null distribution). The score test statistic is particularly useful when the MLE is on the boundary of the parameter space (think: order statistics).
Intuitively, if \(\theta_0\) is near the estimator that maximizes the log likelihood function, the derivative of the log likelihood function should be close to \(0\). The score statistic “standardizes” this derivative by a measure of the variation of the estimator, contained in the information matrix. Values of \(\lambda_S\) that are closer to zero are then more compatible with \(H_0\), since because it suggests \(\theta_0\) is close to the estimator that maximizes the log likelihood function. We’ll reject \(H_0\) for large values of \(\lambda_S\). For the simple null hypothesis on a single parameter that we consider, it can be shown that \(\lambda_{S} \overset{d}{\to} \chi^2_1\), just as with the Wald test statistic and LRT statistic.
By the end of this chapter, you should be able to…
Wald Test Statistic
The Wald test statistic \(\lambda_W\) for testing the hypothesis \(H_0: \theta = \theta_0\) vs. \(H_1: \theta \neq \theta_0\) is given by
\[ \lambda_W = \left(\frac{\hat{\theta}_{MLE} - \theta_0}{se(\hat{\theta}_{MLE})}\right)^2, \]
where \(\hat{\theta}_{MLE}\) is a maximum likelihood estimator.
Likelihood Ratio Test (LRT) Statistic
The likelihood ratio test statistic \(\lambda_{LRT}\) for testing the hypothesis \(H_0: \theta = \theta_0\) vs. \(H_1: \theta \neq \theta_0\) is given by
\[ \lambda_{LRT} = -2 \log\left(\frac{\underset{\theta = \theta_0}{\text{sup}} \hspace{1mm} L(\theta)}{\underset{\theta \in \Theta}{\text{sup}} \hspace{1mm} L(\theta)}\right), \]
where we note that the denominator, \(\underset{\theta \in \Theta}{\text{sup}} \hspace{1mm} L(\theta)\), is the likelihood evaluated at the maximum likelihood estimator.
Score Test Statistic
The score test statistic \(\lambda_S\) for testing the hypothesis \(H_0: \theta = \theta_0\) vs. \(H_1: \theta \neq \theta_0\) is given by
\[ \lambda_S = \frac{\left( \frac{\partial}{\partial \theta_0} \log L(\theta_0 \mid x) \right)^2}{I(\theta_0)}. \]
Power
Power is the probability that we correctly reject the null hypothesis; aka, the probability that we reject the null hypothesis, when the null hypothesis is actually false. As a conditional probability statement: \(\Pr(\text{Reject }H_0 \mid H_0 \text{ False})\). Note that
\[ \text{Power} = 1 - \text{Type II Error} \]
Type I Error (“False positive”)
Type I Error is the probability that the null hypothesis is rejected, when the null hypothesis is actually true. As a conditional probability statement: \(\Pr(\text{Reject }H_0 \mid H_0 \text{ True})\)
Type II Error (“False negative”)
Type II Error is the probability that we fail to reject the null hypothesis, given that the null hypothesis is actually false. As a conditional probability statement: \(\Pr(\text{Fail to reject }H_0 \mid H_0 \text{ False})\)
Critical Region / Rejection Region
The critical/rejection region is defined as the set of values for which the null hypothesis would be rejected. This set is often denoted with a capital \(R\).
Critical Value
The critical value is the point that separates the rejection region from the “acceptance” region (i.e., the value at which the decision for your hypothesis test would change). Acceptance is in quotes because we should never “accept” the null hypothesis… but we still call the “fail-to-reject” region the acceptance region for short.
Significance Level
The significance level, denoted \(\alpha\), is the probability that, under the null hypothesis, the test statistic lies in the critical/rejection region.
P-value
The p-value associated with a test statistic is the probability of obtaining a value as or more extreme than the observed test statistic, under the null hypothesis.
Uniformly Most Powerful (UMP) Test
A “most powerful” test is a hypothesis test that has the greatest power among all possible tests of a given significance threshold \(\alpha\). A uniformly most powerful (UMP) test is a test that is most powerful for all possible values of parameters in the restricted parameter space, \(\Theta_0\).
More formally, let the set \(R\) denote the rejection region of a hypothesis test. Let
\[ \phi(x) = \begin{cases} 1 & \quad \text{if } x \in R \\ 0 & \quad \text{if } x \in R^c \end{cases} \]
Then \(\phi(x)\) is an indicator function. Recalling that expectations of indicator functions are probabilities, note that \(E[\phi(x)] = \Pr(\text{Reject } H_0)\). \(\phi(x)\) then represents our hypothesis test. A hypothesis test \(\phi(x)\) is UMP of size \(\alpha\) if, for any other hypothesis test \(\phi'(x)\) of size (at most) \(\alpha\),
\[ \underset{\theta \in \Theta_0}{\text{sup}} E[\phi'(X) \mid \theta] \leq \underset{\theta \in \Theta_0}{\text{sup}} E[\phi(X) \mid \theta] \]
we have that \(\forall \theta \in \Theta_1\),
\[ E[\phi'(X) \mid \theta] \leq E[\phi(X) \mid \theta], \]
where \(\Theta_0\) is the set of all values for \(\theta\) that align with the null hypothesis (sometimes just a single point, sometimes a region), and \(\Theta_1\) is the set of all values for \(\theta\) that align with the alternative hypothesis (sometimes just a single point, sometimes a region). Note: In general, UMP tests do not exist for two-sided alternative hypotheses. The Neyman-Pearson lemma tells us about UMP tests for simple null and alternative hypotheses, and the Karlin-Rubin theorem extends this to one-sided null and alternative hypotheses.
Neyman-Pearson Lemma
Consider a hypothesis test with \(H_0: \theta = \theta_0\) and \(H_1: \theta = \theta_1\). Let \(\phi\) be a likelihood ratio test of level \(\alpha\), where \(\alpha = E[\phi(X) \mid \theta_0]\). Then \(\phi\) is a UMP level \(\alpha\) test for thee hypotheses \(H_0: \theta = \theta_0\) and \(H_1: \theta = \theta_1\).
Proof.
Let \(\alpha = E[\phi(X) \mid \theta_0]\). Note that the LRT statistic is simplified in the case of these simple hypotheses, and can be written just as \(\frac{f(x \mid \theta_1)}{f(x \mid \theta_0)}\).* If the likelihood under the alternative is greater than some constant \(c\) (which depends on \(\alpha\)), then we reject the null in favor of the alternative, and vice versa. Then the hypothesis testing function \(\phi\) can be written as
\[ \phi(x) = \begin{cases} 0 & \quad \text{if } \lambda_{LRT} = \frac{f(x \mid \theta_1)}{f(x \mid \theta_0)} < c\\ 1 & \quad \text{if } \lambda_{LRT} = \frac{f(x \mid \theta_1)}{f(x \mid \theta_0)} > c\\ \text{Flip a coin} & \quad \text{if } \lambda_{LRT} = \frac{f(x \mid \theta_1)}{f(x \mid \theta_0)} = c \end{cases} \]Suppose \(\phi'\) is any other test such that \(E[\phi'(X) \mid \theta_0] \leq \alpha\) (another level \(\alpha\) test). Then we must show that \(E[\phi'(X) \mid \theta_1] \leq E[\phi(X) \mid \theta_1]\).
By assumption, we have
\[\begin{align*} E[\phi(X) \mid \theta_0] &= \int \phi(x) f_X(x \mid \theta_0) dx = \alpha \\ E[\phi'(X) \mid \theta_0] &= \int \phi'(x) f_X(x \mid \theta_0) dx \leq \alpha \end{align*}\]
Therefore we can write
\[\begin{align*} E[\phi(X) & \mid \theta_1] - E[\phi'(X) \mid \theta_1] \\ & = \int \phi(x) f_X(x \mid \theta_1) dx - \int \phi'(x) f_X(x \mid \theta_1) dx \\ & = \int [\phi(x) - \phi'(x)] f_X(x \mid \theta_1) dx \\ & = \int_{\left\{ \frac{f(x \mid \theta_1)}{f(x \mid \theta_0)} > c \right\}} \underbrace{[\phi(x) - \phi'(x)]}_{\geq 0} f_X(x \mid \theta_1) dx + \int_{\left\{ \frac{f(x \mid \theta_1)}{f(x \mid \theta_0)} < c \right\}} \underbrace{[\phi(x) - \phi'(x)]}_{\leq 0} f_X(x \mid \theta_1) dx + \int_{\left\{ \frac{f(x \mid \theta_1)}{f(x \mid \theta_0)} = c \right\}} [\phi(x) - \phi'(x)] f_X(x \mid \theta_1) dx \\ & \geq \int_{\left\{ \frac{f(x \mid \theta_1)}{f(x \mid \theta_0)} > c \right\}} [\phi(x) - \phi'(x)] cf_X(x \mid \theta_0) dx + \int_{\left\{ \frac{f(x \mid \theta_1)}{f(x \mid \theta_0)} < c \right\}} [\phi(x) - \phi'(x)] cf_X(x \mid \theta_0) dx + \int_{\left\{ \frac{f(x \mid \theta_1)}{f(x \mid \theta_0)} = c \right\}} [\phi(x) - \phi'(x)] cf_X(x \mid \theta_0) dx \\ & = c \int [\phi(x) - \phi'(x)] f_X(x \mid \theta_0) dx \\ & = c \int \phi(x) f_X(x \mid \theta_0) dx - c \int \phi'(x) f_X(x \mid \theta_0) dx \\ & \geq c(\alpha - \alpha) \\ & = 0 \end{align*}\]
And rearranging yields
\[\begin{align*} E[\phi(X) \mid \theta_1] - E[\phi'(X) \mid \theta_1] & \geq 0 \\ E[\phi(X) \mid \theta_1] & \geq E[\phi'(X) \mid \theta_1] \end{align*}\]
as desired.
*Note: The \(-2\log(\dots)\) piece comes into play for the LRT statistic to ensure that the test statistic converges in distribution to a \(\chi^2\) random variable. When we’re just comparing the LRT statistic to another LRT test statistic, we can (more simply) just compare the ratio of likelihoods. Think: comparing \(X\) vs. \(Y\) is equivalent to comparing \(\log(X)\) vs. \(\log(Y)\) if we are only interested in the direction of the difference between them, since \(\log\) is a monotone function.
Problem 1: Let \(Y_i \overset{iid}{\sim} N(\mu, \sigma^2)\), where \(\sigma^2 = 25\) is known. Suppose we want to test the hypotheses \(H_0: \mu = 8\) vs. \(H_1: \mu \neq 8\) and we observe \(\overline{Y} = 10\) across \(n = 64\) observations. Can we reject \(H_0\), with a significance threshold of \(\alpha = 0.05\)? (Use a Wald test statistic)
Our hypotheses are already stated in the problem set-up. The next thing we should do is derive a Wald test statistic. We know that the MLE for \(\mu\) is given by \(\hat{\mu}_{MLE} = \overline{Y}\) (we have shown this is previous problem sets/worked examples). Then the Wald test statistic can be written as
\[ \lambda_W = \left( \frac{\hat{\mu}_{MLE} - \mu_0}{\sigma/\sqrt{n}}\right)^2 = \left( \frac{10 - 8}{5/\sqrt{64}}\right)^2 = 10.24 \]
We can compare this test statistic to the critical value from a \(\chi^2_1\) distribution since, by properties of normal distributions and recalling that standard normal distributions squared are \(\chi^2_1\),
\[\begin{align*} \overline{Y} & \sim N(\mu, \sigma^2/n) \\ \frac{\overline{Y} - \mu}{\sigma/\sqrt{n}} & \sim N(0,1) \\ \left( \frac{\overline{Y} - \mu}{\sigma/\sqrt{n}} \right)^2 & \sim \chi^2_1 \end{align*}\]
To calculate the critical value when \(\alpha = 0.05\), we turn to R.
# The quantile function for a given distribution gives us the value at which
# a given percentage of the distributions lies ahead of that value, which
# is exactly what we want in this case!
qchisq(1 - 0.05, df = 1)
[1] 3.841459
Finally, noting that our test statistic is greater than the critical value, we reject \(H_0\).
Problem 2: Suppose we wanted to use a different significance level \(\alpha\). How would the procedure in Problem 1 change if we let \(\alpha = 0.001\)? How would the procedure in Problem 1 change if we let \(\alpha = 0.1\)?
Changing the significance level changes the critical value, and may change whether or not we reject \(H_0\), depending on the difference between our critical value and the test statistic. We can calculate what the critical value would be if we let \(\alpha = 0.01\) and \(\alpha = 0.1\) again in R:
# alpha = 0.01
qchisq(1 - 0.001, df = 1)
[1] 10.82757
# alpha = 0.1
qchisq(1 - 0.1, df = 1)
[1] 2.705543
Note that when \(\alpha = 0.1\), we still reject \(H_0\). This should make intuitive sense, since increasing \(\alpha\) only can only increase our rejection region. However, when \(\alpha = 0.001\), we would fail to reject \(H_0\), as our test statistic is not “more extreme” (greater) than the critical value.
Problem 3: Suppose we have a random sample \(X_1, \dots, X_n \sim Bernoulli(p)\), and we want to test the hypotheses \(H_0:p = 0.5\), \(H_1:p \neq 0.5\). Suppose we calculate an estimator for \(p\) as \(\hat{p} = \frac{1}{n} \sum_{i = 1}^n X_i\). Derive a Wald test statistic for this hypothesis testing scenario (simplifying as much as you can).
Recall that \(\hat{p}\) as defined in the problem set-up is the MLE for \(p\). Then the Wald test statistic can be written as
\[ \lambda_W = \left( \frac{\hat{p} - p_0}{se(\hat{p})}\right)^2. \]
We can simplify a little further by calculating \(se(\hat{p})\) and plugging in \(p_0\). Recall from the CLT (and Slutsky) that we have
\[ \left( \frac{\hat{p} - p_0}{\sqrt{\hat{p}(1 - \hat{p})/n}} \right) \overset{d}{\to} N(0,1) \]
Then the standard error of \(\hat{p}\) is given by \(\sqrt{\hat{p}(1 - \hat{p})/n}\), and our Walt test statistic simplifies to
\[ \lambda_W = \left( \frac{\hat{p} - 0.5}{\sqrt{\hat{p}(1 - \hat{p})/n}}\right)^2. \]
(Note that this is as “simplified” as we can get without knowing \(\hat{p}\) or \(n\))
Problem 4: Derive a LRT statistic for the hypothesis testing scenario described in Problem 3 (simplifying as much as you can).
The LRT statistic is given by
\[ \lambda_{LRT} = -2 \log\left(\frac{\underset{p = p_0}{\text{sup}} \hspace{1mm} L(p)}{\underset{p \in \Theta}{\text{sup}} \hspace{1mm} L(p)} \right)= -2 \log\left(\frac{ L(0.5)} {L(\hat{p}_{MLE})} \right) \]
The likelihood for our observations can be written as
\[ L(p) = \prod_{i = 1}^n p^{x_i} (1 - p)^{(1 - x_i)} \]
And so our LRT statistic simplifies to
\[\begin{align*} \lambda_{LRT} & = -2 \log\left(\frac{ L(0.5)} {L(\hat{p})} \right) \\ & = -2 \left[\log L(0.5) - \log L(\hat{p}) \right] \\ & = -2 \left[ \log(0.5) \sum_{i = 1}^n X_i + \log(1 - 0.5)\sum_{i = 1}^n(1 - X_i) - \log(\hat{p}) \sum_{i = 1}^n X_i - \log(1 - \hat{p})\sum_{i = 1}^n(1 - X_i)\right] \\ & = -2 \left[ \log(0.5) \left( \sum_{i = 1}^n X_i + \sum_{i = 1}^n(1 - X_i)\right) - \log(\hat{p}) \sum_{i = 1}^n X_i - \log(1 - \hat{p})\sum_{i = 1}^n(1 - X_i)\right] \\ & = -2 \left[ n\log(0.5) - \log(\hat{p}) \sum_{i = 1}^n X_i - \log(1 - \hat{p})\sum_{i = 1}^n(1 - X_i)\right] \\ & = -2 \left[ n\log(0.5) - \log(\hat{p}) n \overline{X} - \log(1 - \hat{p}) (n - n \overline{X})\right] \\ & = -2 n \left[ \log(0.5) - \log(\hat{p}) \hat{p} - \log(1 - \hat{p}) (1 - \hat{p})\right] \end{align*}\]
(Note that this is as “simplified” as we can get without knowing \(\hat{p}\) or \(n\))
Problem 5: Derive a score test statistic for the hypothesis testing scenario described in Problem 3 (simplifying as much as you can).
The score test statistic is given by
\[ \lambda_S = \frac{\left( \frac{\partial}{\partial p_0} \log L(p_0 \mid x) \right)^2}{I(p_0)}. \]
We can simplify by deriving the score and information matrix, and then plugging in \(p_0 = 0.5\). We have,
\[\begin{align*} \frac{\partial}{\partial p_0} \log L(p_0 \mid x) & = \frac{\partial}{\partial p_0} \left[ \log(p_0) \sum_{i = 1}^n X_i + \log(1 - p_0) \sum_{i = 1}^n (1 - X_i) \right] \\ & = \frac{\sum_{i = 1}^n X_i}{p_0} - \frac{n - \sum_{i = 1}^n X_i}{1 - p_0} \\ \left( \frac{\partial}{\partial p_0} \log L(p_0 \mid x) \right)^2 & = \left(\frac{\sum_{i = 1}^n X_i}{p_0} - \frac{n - \sum_{i = 1}^n X_i}{1 - p_0} \right)^2 \end{align*}\]
and plugging in \(p_0 = 0.5\), we have,
\[ \left( \frac{\partial}{\partial p_0} \log L(p_0 \mid x)\right)^2 = \left(\frac{\sum_{i = 1}^n X_i}{0.5} - \frac{n - \sum_{i = 1}^n X_i}{1 - 0.5} \right)^2 = \left( \frac{-n + 2\sum_{i = 1}^n X_i}{0.5}\right)^2 = \left( -2n + 4\sum_{i = 1}^n X_i\right)^2. \]
The information matrix is given by \(-E\left[ \frac{\partial^2}{\partial p_0^2} \log L(p_0 \mid x)\right]\). Piecing this together,
\[\begin{align*} \frac{\partial^2}{\partial p_0^2} \log L(p_0 \mid x) & = \frac{\partial}{\partial p_0} \left[ \frac{\sum_{i = 1}^n X_i}{p_0} - \frac{n - \sum_{i = 1}^n X_i}{1 - p_0} \right] \\ & = \frac{-\sum_{i = 1}^n X_i}{p_0^2} - \frac{n - \sum_{i = 1}^n X_i}{(1 - p_0)^2} \end{align*}\]
And to get \(I(p_0)\), we take the negative expectation of the above quantity under the null hypothesis (that is, where \(E[X] = p_0\)) to obtain
\[\begin{align*} I(p_0) & = -E \left[ \frac{-\sum_{i = 1}^n X_i}{p_0^2} - \frac{n - \sum_{i = 1}^n X_i}{(1 - p_0)^2} \right] \\ & = \frac{1}{p_0^2} \sum_{i = 1}^n E[X_i] + \frac{1}{(1 - p_0)^2} \left( n - \sum_{i = 1}^n E[X_i]\right) \\ & = \frac{1}{p_0^2} \sum_{i = 1}^n p_0 + \frac{1}{(1 - p_0)^2} \left( n - \sum_{i = 1}^n p_0\right) \\ & = \frac{n}{p_0} + \frac{n}{(1 - p_0)^2} \left( 1 - p_0\right) \\ & = \frac{n}{p_0} + \frac{n}{(1 - p_0)} \end{align*}\]
And plugging in \(p_0 = 0.5\) we have
\[ I(0.5) = \frac{n}{0.5} + \frac{n}{(1 - 0.5)} = 2n + 2n = 4n. \]
Then, finally, the score test statistic (simplified as much as possible) is given by
\[\begin{align*} \lambda_S & = \frac{\left( \frac{\partial}{\partial p_0} \log L(p_0 \mid x) \right)^2}{I(p_0)} \\ & = \frac{\left( -2n + 4\sum_{i = 1}^n X_i\right)^2}{4n} \\ & = \frac{\left( -2n + 4n \hat{p}\right)^2}{4n} \\ & = \frac{4n^2\left( -1 + 2 \hat{p}\right)^2}{4n} \\ & = n\left( -1 + 2 \hat{p}\right)^2 \end{align*}\]
Problem 6: For each of Problems 3, 4, and 5, calculate the p-values from each test when \(\hat{p} = 0.4\) and \(n = 300\).
We’ll again use R to obtain the critical values for these hypothesis tests, noting that in each case, the test statistic follows a \(\chi^2_1\) distribution asymptotically:
<- 0.4
p_hat <- 300
n
# Wald test statistic
<- ((p_hat - 0.5)/(sqrt(p_hat * (1 - p_hat) / n)))^2
lambda_w
# LRT statistic
<- -2 * n * (log(0.5) - log(p_hat) * p_hat - log(1 - p_hat) * (1 - p_hat))
lambda_lrt
# Score test statistic
<- n * (-1 + 2 * p_hat)^2
lambda_s
# Compare statistics
lambda_w
[1] 12.5
lambda_lrt
[1] 12.08131
lambda_s
[1] 12
# Calculate p-values
# Recall: probability that we observe something *as or more extreme*
1 - pchisq(lambda_w, df = 1)
[1] 0.000406952
1 - pchisq(lambda_lrt, df = 1)
[1] 0.0005092985
1 - pchisq(lambda_s, df = 1)
[1] 0.0005320055
Things to note:
When \(n\) is large (300, in this case), each of the three classical test statistics are approximately equal! This makes sense, as they all converge in distribution to the same random variable, asymptotically.
P-values are the probability that we would observe something as or more extreme than what we actually did observe, under the null hypothesis. In R, we can use the p
function (for a given pdf) to calculate this.
Problem 7: Repeat Problem 6 but with \(\hat{p} = 0.4\) and \(n = 95\). If your significance threshold were \(\alpha = 0.05\), would your conclusion to the hypothesis test be the same regardless of which test statistic you chose?
To answer this question, we can again calculate p-values, and compare them to 0.05 (note that we could have also calculated a critical value, and compared our test statistics to the critical value, as these are equivalent).
<- 0.4
p_hat <- 95
n
# Wald test statistic
<- ((p_hat - 0.5)/(sqrt(p_hat * (1 - p_hat) / n)))^2
lambda_w
# LRT statistic
<- -2 * n * (log(0.5) - log(p_hat) * p_hat - log(1 - p_hat) * (1 - p_hat))
lambda_lrt
# Score test statistic
<- n * (-1 + 2 * p_hat)^2
lambda_s
# Compare statistics
lambda_w
[1] 3.958333
lambda_lrt
[1] 3.825748
lambda_s
[1] 3.8
# Calculate p-values
# Recall: probability that we observe something *as or more extreme*
1 - pchisq(lambda_w, df = 1)
[1] 0.04663986
1 - pchisq(lambda_lrt, df = 1)
[1] 0.05047083
1 - pchisq(lambda_s, df = 1)
[1] 0.05125258
In this case, we would reject \(H_0\) using the Wald test statistic, but fail to reject using the LRT statistic and score test statistic, since the only p-value that was below our significance threshold was the one calculated from the Wald test statistic. Finite-sample distributions of the three classical test statistics are generally unknown; only asymptotically have they been shown to be equivalent, and therefore, can provide different answers to hypothesis tests when sample sizes are relatively small.