library(tidyverse)
library(broom)
Lab 5: Exploring the impact of correlation
STAT 494: Statistical Genetics
Getting Started
Access the source code for this document (click the
</> Code
button above)Copy-paste into a new QMD
Add your name as the
author
in the YAML header, aboveLoad the packages that we’ll need for this lab. (And install them first if needed.)
- Note the continued usage of customized Quarto callouts, like below, for typing your answers. Please keep all answers within these blocks. This will make it easier for us to review your work!
Part 1: Independent Observations
Exercise 1
Suppose we fit a linear regression model exploring the relationship between a quantitative trait \(y\) and a set of predictors \(x_1, x_2, ..., x_p\). Derive the variance of the least squares estimator we derived in Lab 1, \[\hat{\boldsymbol\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y},\] where \[\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \mathbf{X} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{p1} \\ 1 & x_{12} & \cdots & x_{p2} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn}\end{bmatrix}, \boldsymbol\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix}\] in the context of independent observations.
You may make the following assumptions:
- \(\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon\)
- \(E[\boldsymbol\epsilon] = \mathbf{0}\)
- \(\text{Var}(\boldsymbol\epsilon) = \sigma^2 \mathbf{I}_{n \times n}\), where \(\mathbf{I}_{n \times n}\) is the \(n\)-by-\(n\) identity matrix
Exercise 2
Suppose we find ourselves in a simplified scenario with just a single predictor \(x_1\). We now have the following: \[\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \ \mathbf{X} = \begin{bmatrix} 1 & x_{11} \\ 1 & x_{12} \\ \vdots & \vdots \\ 1 & x_{1n} \end{bmatrix} = \begin{bmatrix} \mathbf{1} & \mathbf{x}_1 \end{bmatrix}, \text{ and } \boldsymbol\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}\] Using your work from Exercise 1, determine the standard error of the slope estimator \(\hat\beta_1\).
Suggestions:
- plug in the design matrix, \(\mathbf{X} = \begin{bmatrix} \mathbf{1} & \mathbf{x}_1 \end{bmatrix}\), to your answer from Exercise 1
- simplify (you will need to remember how to invert a 2-by-2 matrix!)
- you will end up with a 2-by-2 variance-covariance matrix for \(\hat\beta_0\) and \(\hat\beta_1\) – the standard error of \(\hat\beta_1\) is the square root of the second diagonal element
Exercise 3
Let’s check our work by comparing these derivations to some lm
output! We’ll simulate a small “toy” dataset to use as an example.
# define parameters
<- 100
n <- 0
b0 <- 1
b1
# generate data
set.seed(494)
<- sample(0:2, size = n, replace = T, prob = c(.64, .32, .04))
x <- rnorm(n)
e <- b0 + b1 * x + e
y
# combine into data frame
<- cbind(y, x) %>%
dat as.data.frame()
# check it out
head(dat, 10)
y x
1 0.47813179 0
2 -2.97211789 0
3 1.54235459 0
4 1.03226746 0
5 1.28625948 0
6 -0.02388712 0
7 -0.64840383 0
8 0.32668469 0
9 1.95681007 2
10 1.89940920 2
Read the code above carefully. Make sure you understand what each line of code is doing. Explain below.
Exercise 4
Now, fit a linear regression model to the toy dataset we created in Exercise 3. Our outcome variable is y
and predictor is x
. Print out a table of coefficient estimates, standard errors, test statistics, and p-values for the fitted model.
# fit model
# table of coefficient estimates, SE, etc.
How close are the coefficient estimates to their true values? (Remember: we simulated these data so we know the truth!)
Exercise 5
There’s one more piece of information we’ll need before we can compare our standard error derivation from Exercise 2 to the lm
output.
Notice that your answer to Exercise 2 involves a parameter, \(\sigma^2\), that would typically be unknown. Thus, in order to estimate the standard error of \(\hat\beta_1\), we need to plug in an estimate for \(\sigma^2\) as well. Since \(\sigma^2\) represents the variance of our error terms, a reasonable way to estimate \(\sigma^2\) is to look at the sample variance of our residuals. We’ll use the following estimator1: \[\hat\sigma^2 = s^2 = \frac{1}{n-p} \boldsymbol\epsilon^T \boldsymbol\epsilon = \frac{1}{n-p}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})\]
- If we just have a single predictor \(x_1\), what does \(s^2\) look like? Simplify here.
- Then, calculate \(s^2\) in our toy data. How does it compare to the true value of \(\sigma^2\)? (Remember: we simulated these data so we know the truth!)
# your code here
Exercise 6
Now we can put the pieces together!
Based on your answers to Exercises 2 and 5, what is your estimate for the standard error of \(\hat\beta_1\) in our toy example? Calculate below.
# your code here
How does your answer compare to what you got from lm
? (Hint: They should be the same. If not, revisit your work above!)
Exercise 7
How can we assess if this standard error estimate is correct? Recall what standard errors represent: how much our estimate might vary if we had a different sample of data (i.e., the variability of our estimate across repeated samples). Simulations are, yet again, a useful tool for evaluating standard error estimates. The idea is this:
- generate a dataset
- fit a linear regression model to that dataset
- record the estimate of interest
- repeat steps 1–3 many times
- look at the variability (standard deviation) of those estimates across the many datasets — this is the standard error of that estimator!
Let’s try this out, focusing on the slope coefficient \(\hat\beta_1\). Fill in the blanks in the code below.
# create a function to do one simulation replicate
<- function(n = 100, b0 = 0, b1 = 1){
do_one # step 1: generate data
<- sample(0:2, size = n, replace = T, prob = c(.64, .32, .04))
x <- rnorm(n)
e <- ___
y <- cbind(y, x) %>%
dat as.data.frame()
# step 2: fit model
<- lm(___)
mod
# step 3: record slope estimate
<- ___
betahat return(betahat)
}
Error in parse(text = input): <text>:6:9: unexpected input
5: e <- rnorm(n)
6: y <- __
^
# repeat 1000 times
set.seed(494)
<- replicate(n = ___, do_one()) betahats
Error in parse(text = input): <text>:3:28: unexpected input
2: set.seed(494)
3: betahats <- replicate(n = __
^
Visualize your results, calculate the variability of your estimates, and compare this to the standard error you derived above. What do you see?
# your code here
Footnotes
\(s^2\) isn’t exactly the sample variance of our residuals. It’s been modified slightly (multiplied by a constant) to make it an unbiased estimator of \(\sigma^2\). For those of you that have taken MATH/STAT 355: Statistical Theory, you’ve seen something similar in the derivation of an unbiased estimator for the variance of a single random variable \(\hat\sigma^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2\).↩︎