Lab 5: Exploring the impact of correlation

STAT 494: Statistical Genetics

Author

Solutions

Published

April 11, 2025

Getting Started

Access the source code for this document (click the </> Code button above)
Copy-paste into a new QMD
Add your name as the author in the YAML header, above
Load the packages that we’ll need for this lab. (And install them first if needed.)

library(tidyverse)
library(broom)

Note the continued usage of customized Quarto callouts, like below, for typing your answers. Please keep all answers within these blocks. This will make it easier for us to review your work!

Your Answer

Here’s an example of one of our customized “Your Answer” callous blocks. Whenever you are answering a question, please put all text within the provided block.

Part 1: Independent Observations

Exercise 1

Suppose we fit a linear regression model exploring the relationship between a quantitative trait $y$ and a set of predictors $x_1, x_2, ..., x_p$. Derive the variance of the least squares estimator we derived in Lab 1, \[\hat{\boldsymbol\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y},\] where \[\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \mathbf{X} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{p1} \\ 1 & x_{12} & \cdots & x_{p2} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn}\end{bmatrix}, \boldsymbol\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix}\] in the context of independent observations.

You may make the following assumptions:

$\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon$
$E[\boldsymbol\epsilon] = \mathbf{0}$
$\text{Var}(\boldsymbol\epsilon) = \sigma^2 \mathbf{I}_{n \times n}$, where $\mathbf{I}_{n \times n}$ is the $n$-by-$n$ identity matrix

Useful Facts

We’ll use the following useful results about vectors/matrices in our derivations:

$\text{Var}(\mathbf{Ay}) = \mathbf{A}\text{Var}(\mathbf{y})\mathbf{A}^T$
$(\textbf{AB})^T = \mathbf{B}^T \mathbf{A}^T$
$(\textbf{A}^T)^T = \textbf{A}$
$(\textbf{A}^{-1})^T = (\textbf{A}^T)^{-1}$
$\text{Var}(\mathbf{A}) = \textbf{0}$ if $\textbf{A}$ is constant
$\text{Var}(\mathbf{x}+ \mathbf{y}) = \text{Var}(\mathbf{x})+\text{Var}(\mathbf{y})$ if $\mathbf{x}, \mathbf{y}$ are independent
$\text{Var}(\mathbf{a}+ \mathbf{y}) = \text{Var}(\mathbf{y})$ if $\mathbf{a}$ is constant

(Convince/remind yourself that all of these results are in fact true!)

Answer to Exercise 1

The variance of the least squares estimator is:

\[ \begin{align*} \text{Var}(\hat{\boldsymbol\beta}) & = \text{Var}\left( (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \right), \text{ plugging in the form of } \hat{\boldsymbol\beta} \text{ we derived in Lab 1} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{y}) \left((\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \right)^T, \text{ using Useful Fact 1} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{y}) \left(\mathbf{X}^T \right)^T \left( (\mathbf{X}^T\mathbf{X})^{-1} \right)^T, \text{ using Useful Fact 2} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{y}) \left( \mathbf{X} \right) \left( (\mathbf{X}^T\mathbf{X})^{T} \right)^{-1}, \text{ using Useful Facts 3 and 4} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{y}) \mathbf{X} \left( (\mathbf{X})^T\left(\mathbf{X}^T \right)^T \right)^{-1}, \text{ using Useful Fact 2} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{y}) \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1}, \text{ using Useful Fact 3} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon) \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1}, \text{ using Assumption 1} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\boldsymbol\epsilon) \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1}, \text{ using Useful Fact 7} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \left(\sigma^2 \mathbf{I} \right)\mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1}, \text{ using Assumption 3} \\ & = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{I} \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1}, \text{ since } \sigma^2 \text{ is a scalar} \\ & = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1}, \text{ by definition of } \textbf{I} \\ & = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \end{align*} \]

Exercise 2

Suppose we find ourselves in a simplified scenario with just a single predictor $x_1$. We now have the following: \[\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \ \mathbf{X} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} = \begin{bmatrix} \mathbf{1} & \mathbf{x} \end{bmatrix}, \text{ and } \boldsymbol\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}\] Using your work from Exercise 1, determine the standard error of the slope estimator $\hat\beta_1$.

Suggestions:

plug in the design matrix, $\mathbf{X} = \begin{bmatrix} \mathbf{1} & \mathbf{x} \end{bmatrix}$, to your answer from Exercise 1
simplify (you will need to remember how to invert a 2-by-2 matrix!)
you will end up with a 2-by-2 variance-covariance matrix for $\hat\beta_0$ and $\hat\beta_1$ – the standard error of $\hat\beta_1$ is the square root of the second diagonal element

Answer to Exercise 2

The variance-covariance matrix for $\hat{\boldsymbol\beta} = \begin{bmatrix} \hat\beta_0 \\ \hat\beta_1 \end{bmatrix}$ simplifies as follows:

\[ \begin{align*} \text{Var}(\hat{\boldsymbol\beta}) & = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1}, \text{ by Exercise 1} \\ & = \sigma^2 \left(\begin{bmatrix} \mathbf{1} & \mathbf{x} \end{bmatrix}^T \begin{bmatrix} \mathbf{1} & \mathbf{x} \end{bmatrix} \right)^{-1}, \text{ plugging in our definition of } \textbf{X} \\ & = \sigma^2 \left(\begin{bmatrix} \mathbf{1}^T \\ \mathbf{x}^T \end{bmatrix} \begin{bmatrix} \mathbf{1} & \mathbf{x} \end{bmatrix} \right)^{-1} \\ & = \sigma^2 \left(\begin{bmatrix} \mathbf{1}^T\mathbf{1} & \mathbf{1}^T\mathbf{x} \\ \mathbf{x}^T \mathbf{1} & \mathbf{x}^T \mathbf{x} \end{bmatrix} \right)^{-1} \\ & = \sigma^2 \left(\begin{bmatrix} n & \sum_{i=1}^n x_i \\ \sum_{i=1}^n x_i & \sum_{i=1}^n x_i^2 \end{bmatrix} \right)^{-1} \\ & = \sigma^2 \left(\frac{1}{n\sum_{i=1}^n x_i^2 - \left( \sum_{i=1}^n x_i \right)^2 } \begin{bmatrix} \sum_{i=1}^n x_i^2 & -\sum_{i=1}^n x_i \\ -\sum_{i=1}^n x_i & n \end{bmatrix} \right)\\ & = \begin{bmatrix} \frac{\sigma^2 \sum_{i=1}^n x_i^2}{n\sum_{i=1}^n x_i^2 - \left( \sum_{i=1}^n x_i \right)^2 } & \frac{-\sigma^2\sum_{i=1}^n x_i}{n\sum_{i=1}^n x_i^2 - \left( \sum_{i=1}^n x_i \right)^2 } \\ \frac{-\sigma^2\sum_{i=1}^n x_i}{n\sum_{i=1}^n x_i^2 - \left( \sum_{i=1}^n x_i \right)^2 } & \frac{n\sigma^2}{n\sum_{i=1}^n x_i^2 - \left( \sum_{i=1}^n x_i \right)^2 } \end{bmatrix} \end{align*} \]

Thus, the standard error of $\hat\beta_1$ is \[\sqrt{\frac{n\sigma^2}{n\sum_{i=1}^n x_i^2 - \left( \sum_{i=1}^n x_i \right)^2 }}\]

Exercise 3

Let’s check our work by comparing these derivations to some lm output! We’ll simulate a small “toy” dataset to use as an example.

# define parameters
n <- 100
b0 <- 0
b1 <- 1

# generate data
set.seed(494)
x <- sample(0:2, size = n, replace = T, prob = c(.64, .32, .04))
e <- rnorm(n)
y <- b0 + b1 * x + e

# combine into data frame
dat <- cbind(y, x) %>%
  as.data.frame()

# check it out
head(dat, 10)

             y x
1   0.47813179 0
2  -2.97211789 0
3   1.54235459 0
4   1.03226746 0
5   1.28625948 0
6  -0.02388712 0
7  -0.64840383 0
8   0.32668469 0
9   1.95681007 2
10  1.89940920 2

Read the code above carefully. Make sure you understand what each line of code is doing. Explain below.

Answer to Exercise 3

We’re simulating a dataset with $n = 100$ observations and two variables: x and y.

The predictor x is generated to mimic a SNP with a minor allele frequency of 20% (hence the probability of carrying 2 minor alleles is $0.2 \times 0.2 = 0.04$, the probability of carrying 1 minor allele is $0.2 \times (1 - 0.2) + (1 - 0.2) \times 0.2 = 0.32$, etc.).

The outcome variable y is generated according to the linear model $y_i = 0 + 1 \times x_i + e_i$, where the “noise” $e_i$ is randomly generated from a standard normal distribution (mean = 0, sd = 1).

Translating this to the notation we’ve been using above, this is what the true parameter values are in this setting:

$\boldsymbol\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} = \begin{bmatrix} 0 \\ 1 \end{bmatrix}$
$\sigma^2 = \text{Var}(e_i) = 1$

Exercise 4

Now, fit a linear regression model to the toy dataset we created in Exercise 3. Our outcome variable is y and predictor is x. Print out a table of coefficient estimates, standard errors, test statistics, and p-values for the fitted model.

# fit model
mod <- lm(y ~ x, data = dat)

# table of coefficient estimates, SE, etc.
tidy(mod)

# A tibble: 2 × 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)   0.0643     0.106     0.605 0.546       
2 x             0.930      0.158     5.87  0.0000000585

How close are the coefficient estimates to their true values? (Remember: we simulated these data so we know the truth!)

Answer to Exercise 4

Very close! Our intercept estimate $\hat\beta_0$ = 0.0643231 is quite close to the truth ($\beta_0 = 0$). Our slope estimate $\hat\beta_1$ = 0.9303532 is also pretty close to the truth ($\beta_1 = 1$).

Exercise 5

There’s one more piece of information we’ll need before we can compare our standard error derivation from Exercise 2 to the lm output.

Notice that your answer to Exercise 2 involves a parameter, $\sigma^2$, that would typically¹ be unknown. Thus, in order to estimate the standard error of $\hat\beta_1$, we need to plug in an estimate for $\sigma^2$ as well. Since $\sigma^2$ represents the variance of our error terms, a reasonable way to estimate $\sigma^2$ is to look at the sample variance of our residuals²:

\[\hat\sigma^2 = s^2 = \frac{1}{n-p} \boldsymbol\epsilon^T \boldsymbol\epsilon = \frac{1}{n-p}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})\]

If we just have a single predictor x, what does $s^2$ look like? Simplify here.

Answer to Exercise 5 - Part A

\[ \begin{align*} s^2 & = \frac{1}{n-p}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) \\ & = \frac{1}{n-p}\left(\mathbf{y}^T \mathbf{y} - \mathbf{y}^T\mathbf{X}\hat{\boldsymbol{\beta}} - (\mathbf{X}\hat{\boldsymbol{\beta}})^T \mathbf{y} + (\mathbf{X}\hat{\boldsymbol{\beta}})^T \mathbf{X}\hat{\boldsymbol{\beta}} \right) \\ & = \frac{1}{n-p}\left(\mathbf{y}^T \mathbf{y} - \mathbf{y}^T\mathbf{X}\hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}^T \mathbf{X}^T \mathbf{y} + \hat{\boldsymbol{\beta}}^T \mathbf{X}^T \mathbf{X}\hat{\boldsymbol{\beta}} \right) \\ \end{align*} \]

We could stop here, but this can actually be simplified a little further! Let’s plug in what we know about $\hat\beta$:

\[ \begin{align*} s^2 & = \frac{1}{n-p}\left(\mathbf{y}^T \mathbf{y} - \mathbf{y}^T\mathbf{X}\hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}^T \mathbf{X}^T \mathbf{y} + \hat{\boldsymbol{\beta}}^T \mathbf{X}^T \mathbf{X}\hat{\boldsymbol{\beta}} \right) \\ & = \frac{1}{n-p}(\mathbf{y}^T \mathbf{y} - \mathbf{y}^T\mathbf{X} \left[(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\right] - \left[(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\right]^T \mathbf{X}^T \mathbf{y} \\ & \qquad \qquad \qquad + \left[(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\right]^T \mathbf{X}^T \mathbf{X}\left[(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\right] ) \\ & = \frac{1}{n-p}(\mathbf{y}^T \mathbf{y} - \mathbf{y}^T\mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} - \left[\mathbf{y}^T \mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \right] \mathbf{X}^T \mathbf{y} \\ & \qquad \qquad \qquad + \left[\mathbf{y}^T \mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \right] \mathbf{X}^T \mathbf{X}(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} ) \\ & = \frac{1}{n-p}(\mathbf{y}^T \mathbf{y} - 2\mathbf{y}^T\mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} + \mathbf{y}^T \mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} ) \\ & = \frac{1}{n-p}(\mathbf{y}^T \mathbf{y} - \mathbf{y}^T\mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}) \\ & = \frac{1}{n-p} \left[ \mathbf{y}^T \left(\mathbf{I} - \mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \right)\mathbf{y} \right] \\ \end{align*} \]

Then, calculate $s^2$ in our toy data. How does it compare to the true value of $\sigma^2$? (Remember: we simulated these data so we know the truth!)

# define our constants (n, p)
# n was already defined above (nrow(dat); length(y))
p <- 2

# set up y, X, beta-hat as vectors
yvec <- as.matrix(y, ncol = 1)
Xmat <- cbind(1, as.matrix(x, ncol = 1))
#bhat <- as.matrix(coef(mod), ncol = 1) # using lm output from earlier
xTx <- t(Xmat) %*% Xmat
bhat <- solve(xTx) %*% t(Xmat) %*% yvec # by hand

# calculate s^2
s2 <- (1 / (n - p)) * ( # \frac{1}{n-p} * (...
  t(yvec) %*% yvec - # y^T y
    t(yvec) %*% Xmat %*% bhat - # y^T X \hat{B}
    t(bhat) %*% t(Xmat) %*% yvec + # \hat{B}^T X^T y
    t(bhat) %*% t(Xmat) %*% Xmat %*% bhat ) # \hat{B}^T X^T X \hat{B}
print(s2)

          [,1]
[1,] 0.8216415

# alternate approach
# \frac{1}{n-p} y^T (I - X(X^T X)^{-1} X^T ) y
s2_v2 <- (1 / (n - p)) * (t(yvec) %*% (diag(n) - Xmat %*% solve(xTx) %*% t(Xmat)) %*% yvec)
print(s2_v2)

          [,1]
[1,] 0.8216415

Answer to Exercise 5 - Part B

Recall from above that the true $\sigma^2$ in this example is 1. Our estimate $s^2 \approx 0.8$ is reasonably close!

Exercise 6

Now we can put the pieces together!

Based on your answers to Exercises 2 and 5, what is your estimate for the standard error of $\hat\beta_1$ in our toy example? Calculate below.

# our standard error estimate
sqrt(n * s2 / (n * sum(x^2) - (sum(x))^2 ) )

          [,1]
[1,] 0.1583928

# compare to lm
tidy(mod)

# A tibble: 2 × 5
  term        estimate std.error statistic      p.value
  <chr>          <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)   0.0643     0.106     0.605 0.546       
2 x             0.930      0.158     5.87  0.0000000585

How does your answer compare to what you got from lm? (Hint: They should be the same. If not, revisit your work above!)

Answer to Exercise 6

YAY, they’re the same!! 🎉

Exercise 7

How can we assess if this standard error estimate is correct? Recall what standard errors represent: how much our estimate might vary if we had a different sample of data (i.e., the variability of our estimate across repeated samples). Simulations are, yet again, a useful tool for evaluating standard error estimates. The idea is this:

generate a dataset
fit a linear regression model to that dataset
record the estimate of interest
repeat steps 1–3 many times
look at the variability (standard deviation) of those estimates across the many datasets — this is the standard error of that estimator!

Let’s try this out, focusing on the slope coefficient $\hat\beta_1$. Fill in the blanks in the code below.

# create a function to do one simulation replicate
do_one <- function(n = 100, b0 = 0, b1 = 1){
  # step 1: generate data
  x <- sample(0:2, size = n, replace = T, prob = c(.64, .32, .04))
  e <- rnorm(n)
  y <- b0 + b1 * x + e
  dat <- cbind(y, x) %>%
   as.data.frame()
  
  # step 2: fit model
  mod <- lm(y ~ x, data = dat)
  
  # step 3: record slope estimate
  betahat <- coef(mod)[2]
  return(betahat)
}

# repeat 1000 times
set.seed(494)
betahats <- replicate(n = 1000, do_one())

Visualize your results, calculate the variability of your estimates, and compare this to the standard error you derived above. What do you see?

# histogram of coefficient estimates
hist(betahats, xlab = expression(hat(beta)[1]), 
     main = 'Histogram of coefficient estimates')
# add dashed lines to indicate standard deviation
abline(v=1 + sd(betahats), lty=2, col="red")
abline(v=1 - sd(betahats), lty=2, col="red")

# calculate std deviation of estimates ("true" std error)
sd(betahats)

[1] 0.181747

Answer to Exercise 7

Our estimates $\hat\beta_1$ are centered around the truth ($\beta_1 = 1$) and have a standard deviation of 0.181747. Notice how close this is to our estimate of the standard error of $\hat\beta_1$ from earlier. It’s even closer if we plug in the true value of $\sigma^2 = 1$ instead of our estimate $s^2$:

sqrt(n * 1 / (n * sum(x^2) - (sum(x))^2 ) ) # replace s^2 with true sigma^2

[1] 0.1747408

Warning

The code above is randomly generating a new x for each simulation replicate. However, our derivations above assume that x is fixed. The code below is more appropriate.

While I was at it, I also updated the code to:

increase the sample size to 1000 (our standard estimates will get better the larger our sample size)
keep track of not just the coefficient estimate from each replicate but also the standard error estimates — this way, we can get a sense of the typical size of our standard error estimates and compare them to the true variability of the coefficient estimates

# create a function to do one simulation replicate
do_one_v2 <- function(x, b0 = 0, b1 = 1){
  # step 1: generate data
  n <- length(x)
  e <- rnorm(n)
  y <- b0 + b1 * x + e
  dat <- cbind(y, x) %>%
   as.data.frame()
  
  # step 2: fit model
  mod <- lm(y ~ x, data = dat)
  
  # step 3: record slope estimate
  betahat <- coef(mod)[2]
  sehat <- tidy(mod)$std.error[2]
  return(c(betahat,sehat))
}

# repeat 500 times, using the same x each time
set.seed(494)
x_fixed <- sample(0:2, size = 1000, replace = T, prob = c(.64, .32, .04)) # increasing sample size to 1000
betahats_v2 <- replicate(500, do_one_v2(x = x_fixed))

# visualize results
hist(betahats_v2[1,], 
     xlab = expression(hat(beta)[1]), 
     main = 'Histogram of coefficient estimates')

# calculate the "true" standard error
sd(betahats_v2[1,])

[1] 0.05487173

This represents the amount of variability we actually see in the coefficient estimates ($\hat\beta_1$) across our simulation replicates.

Let’s compare this “true” standard error to the estimates of the standard error that we get from lm:

# compare to the std err *estimates*
hist(betahats_v2[2,], 
     xlab = expression(paste(hat(se), "(", hat(beta)[1], ")")), 
     main = 'Histogram of standard error estimates')

summary(betahats_v2[2,])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.05241 0.05474 0.05543 0.05545 0.05622 0.05893

“Good” standard error estimates should be close to the true amount of variability in $\hat\beta_1$. In this case, they are!

Part 2: Correlated Observations

Exercise 8

Now, suppose our observations are not independent. In other words, they’re correlated. Our assumptions from Exercise 1 are updated as follows:

$\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon$
$E[\boldsymbol\epsilon] = \mathbf{0}$
$\text{Var}(\boldsymbol\epsilon) = \boldsymbol\Sigma$, for some positive semidefinite $n$-by-$n$ matrix $\boldsymbol\Sigma$ where the off-diagonal elements of $\boldsymbol\Sigma$ are no longer required to be zero

Repeat your deviations from Exercise 1 under this updated set of assumptions.

Answer to Exercise 8

The variance of the least squares estimator is:

\[ \begin{align*} \text{Var}(\hat{\boldsymbol\beta}) & = \text{Var}\left( (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \right)\\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{y}) \left[(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \right]^T \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{y}) \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon) \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1}, \text{ using Assumption 1} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\boldsymbol\epsilon) \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \boldsymbol\Sigma \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1}, \text{ using Assumption 3} \\ \end{align*} \]

This is as far as we can simplify!

Exercise 9

Compare your answer to Exercise 8 to the one you got in Exercise 1. Are they the same? If not, what implications will this have for our inference?

In your answer, please address the implications for confidence intervals, hypothesis testing, and type I and II errors.

Answer to Exercise 9

The answer to Exercise 8 is not equivalent to the result from Exercise 1. This will have multiple implications for inference.

If our observations are correlated and we proceed “as usual” — that is, using linear regression and the “traditional” standard error estimates (based on $\widehat{\text{Var}}(\hat{\boldsymbol\beta}) = s^2 (\mathbf{X}^T \mathbf{X})^{-1}$) that assume observations are independent — our standard error estimates will be wrong! Having incorrect standard error estimates has multiple downstream implications. In particular…

If the form of $\boldsymbol\Sigma$ is such that our standard error estimates end up being too small (i.e., $\widehat{\text{Std.Err}}(\hat\beta_j) < \text{Std.Err}(\hat\beta_j)$ for $j \in \{0, ..., p\}$):

our confidence intervals will be too narrow
our p-values will be too small
out Type I Error rate will be higher than intended

If our standard error estimates are too large (i.e., $\widehat{\text{Std.Err}}(\hat\beta_j) > \text{Std.Err}(\hat\beta_j)$):

our confidence intervals will be too wide
our p-values will be too large
our Type II Error rate will be higher than intended (and power will be lower)

Going Deeper

How do we avoid the issues identified in Exercise 9? Two of the most common statistical modeling techniques to address correlation are mixed effects models and generalized estimating equations.

Want to learn more? Take STAT 452: Correlated Data with Prof. Brianna Heggeseth in Fall 2025! (Or check out her course notes here.)

Excercise 10

What’s the point of all the work you did above? Summarize your takeaways from this lab here. In writing your response, imagine you are writing out the learning goals for this assignment (like I’ve done for you at the top of previous labs).

Answer to Exercise 10

Answers will vary, but here are my thoughts…

Lab 5 Learning Goals:

Derive the variance of the ordinary least squares (OLS) coefficient estimator $\hat\beta_{OLS}$ under the assumption of independence
Use this result to estimate the standard error of the slope coefficient estimate in a simple (one predictor) linear regression model
Confirm theoretical derivations by comparing to R output (and thus understand what R is doing “under the hood” when it gives us standard error estimates)
Perform a simulation study to assess the performance of a standard error estimate
Revisit the the variance of the ordinary least squares (OLS) coefficient estimator $\hat\beta_{OLS}$ in the presence of correlation
Understand the impact that incorrect standard error estimates can have on downstream inference (confidence intervals, p-values, etc.)

In short:

Understand what standard errors represent and how they are estimated
Understand the impact that correlation has on statistical inference

Footnotes

Although “typically” unknown in practice, because we simulated these data we know the truth ($\sigma^2 = 1$). Let’s pretend we don’t know this.↩︎
$s^2$ isn’t exactly the sample variance of our residuals. It’s been modified slightly (multiplied by a constant) to make it an unbiased estimator of $\sigma^2$. For those of you that have taken MATH/STAT 355: Statistical Theory, you’ve seen something similar in the derivation of an unbiased estimator for the variance of a single random variable $\hat\sigma^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2$.↩︎

--- title: "Lab 5: Exploring the impact of correlation" subtitle: "STAT 494: Statistical Genetics" image: "../images/std_err.png" date: "April 11, 2025" author: "Solutions" format: html: toc: true toc-depth: 3 embed-resources: true code-tools: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, error = TRUE) ``` \ \ # Getting Started 1. Access the source code for this document (click the `</> Code` button above) 2. Copy-paste into a new QMD 3. Add your name as the `author` in the YAML header, above 4. Load the packages that we'll need for this lab. (And install them first if needed.) ```{r} #| message: false library(tidyverse) library(broom) ``` 5. Note the continued usage of customized Quarto [callouts](https://quarto.org/docs/authoring/callouts.html), like below, for typing your answers. Please keep all answers within these blocks. This will make it easier for us to review your work! :::{.callout-note icon=false title="Your Answer"} Here's an example of one of our customized "Your Answer" callous blocks. Whenever you are answering a question, please put all text within the provided block. ::: \ \ # Part 1: Independent Observations ## Exercise 1 Suppose we fit a linear regression model exploring the relationship between a quantitative trait $y$ and a set of predictors $x_1, x_2, ..., x_p$. Derive the **variance** of the least squares estimator we derived in Lab 1, $$\hat{\boldsymbol\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y},$$ where $$\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \mathbf{X} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{p1} \\ 1 & x_{12} & \cdots & x_{p2} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn}\end{bmatrix}, \boldsymbol\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix}$$ in the context of **independent** observations. You may make the following assumptions: 1. $\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon$ 2. $E[\boldsymbol\epsilon] = \mathbf{0}$ 3. $\text{Var}(\boldsymbol\epsilon) = \sigma^2 \mathbf{I}_{n \times n}$, where $\mathbf{I}_{n \times n}$ is the $n$-by-$n$ identity matrix :::{.callout-tip title="Useful Facts"} We'll use the following useful results about vectors/matrices in our derivations: 1. $\text{Var}(\mathbf{Ay}) = \mathbf{A}\text{Var}(\mathbf{y})\mathbf{A}^T$ 2. $(\textbf{AB})^T = \mathbf{B}^T \mathbf{A}^T$ 3. $(\textbf{A}^T)^T = \textbf{A}$ 4. $(\textbf{A}^{-1})^T = (\textbf{A}^T)^{-1}$ 5. $\text{Var}(\mathbf{A}) = \textbf{0}$ if $\textbf{A}$ is constant 6. $\text{Var}(\mathbf{x}+ \mathbf{y}) = \text{Var}(\mathbf{x})+\text{Var}(\mathbf{y})$ if $\mathbf{x}, \mathbf{y}$ are independent 7. $\text{Var}(\mathbf{a}+ \mathbf{y}) = \text{Var}(\mathbf{y})$ if $\mathbf{a}$ is constant (Convince/remind yourself that all of these results are in fact true!) ::: :::{.callout-note icon=false title="Answer to Exercise 1"} The variance of the least squares estimator is: $$ \begin{align*} \text{Var}(\hat{\boldsymbol\beta}) & = \text{Var}\left( (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \right), \text{ plugging in the form of } \hat{\boldsymbol\beta} \text{ we derived in Lab 1} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{y}) \left((\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \right)^T, \text{ using Useful Fact 1} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{y}) \left(\mathbf{X}^T \right)^T \left( (\mathbf{X}^T\mathbf{X})^{-1} \right)^T, \text{ using Useful Fact 2} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{y}) \left( \mathbf{X} \right) \left( (\mathbf{X}^T\mathbf{X})^{T} \right)^{-1}, \text{ using Useful Facts 3 and 4} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{y}) \mathbf{X} \left( (\mathbf{X})^T\left(\mathbf{X}^T \right)^T \right)^{-1}, \text{ using Useful Fact 2} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{y}) \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1}, \text{ using Useful Fact 3} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon) \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1}, \text{ using Assumption 1} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\boldsymbol\epsilon) \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1}, \text{ using Useful Fact 7} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \left(\sigma^2 \mathbf{I} \right)\mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1}, \text{ using Assumption 3} \\ & = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{I} \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1}, \text{ since } \sigma^2 \text{ is a scalar} \\ & = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1}, \text{ by definition of } \textbf{I} \\ & = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \end{align*} $$ ::: \ \ ## Exercise 2 Suppose we find ourselves in a simplified scenario with just a single predictor $x_1$. We now have the following: $$\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \ \mathbf{X} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} = \begin{bmatrix} \mathbf{1} & \mathbf{x} \end{bmatrix}, \text{ and } \boldsymbol\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}$$ Using your work from Exercise 1, determine the **standard error** of the slope estimator $\hat\beta_1$. Suggestions: - plug in the design matrix, $\mathbf{X} = \begin{bmatrix} \mathbf{1} & \mathbf{x} \end{bmatrix}$, to your answer from Exercise 1 - simplify (you will need to remember how to invert a 2-by-2 matrix!) - you will end up with a 2-by-2 variance-covariance matrix for $\hat\beta_0$ and $\hat\beta_1$ -- the standard error of $\hat\beta_1$ is the square root of the second diagonal element :::{.callout-note icon=false title="Answer to Exercise 2"} The variance-covariance matrix for $\hat{\boldsymbol\beta} = \begin{bmatrix} \hat\beta_0 \\ \hat\beta_1 \end{bmatrix}$ simplifies as follows: $$ \begin{align*} \text{Var}(\hat{\boldsymbol\beta}) & = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1}, \text{ by Exercise 1} \\ & = \sigma^2 \left(\begin{bmatrix} \mathbf{1} & \mathbf{x} \end{bmatrix}^T \begin{bmatrix} \mathbf{1} & \mathbf{x} \end{bmatrix} \right)^{-1}, \text{ plugging in our definition of } \textbf{X} \\ & = \sigma^2 \left(\begin{bmatrix} \mathbf{1}^T \\ \mathbf{x}^T \end{bmatrix} \begin{bmatrix} \mathbf{1} & \mathbf{x} \end{bmatrix} \right)^{-1} \\ & = \sigma^2 \left(\begin{bmatrix} \mathbf{1}^T\mathbf{1} & \mathbf{1}^T\mathbf{x} \\ \mathbf{x}^T \mathbf{1} & \mathbf{x}^T \mathbf{x} \end{bmatrix} \right)^{-1} \\ & = \sigma^2 \left(\begin{bmatrix} n & \sum_{i=1}^n x_i \\ \sum_{i=1}^n x_i & \sum_{i=1}^n x_i^2 \end{bmatrix} \right)^{-1} \\ & = \sigma^2 \left(\frac{1}{n\sum_{i=1}^n x_i^2 - \left( \sum_{i=1}^n x_i \right)^2 } \begin{bmatrix} \sum_{i=1}^n x_i^2 & -\sum_{i=1}^n x_i \\ -\sum_{i=1}^n x_i & n \end{bmatrix} \right)\\ & = \begin{bmatrix} \frac{\sigma^2 \sum_{i=1}^n x_i^2}{n\sum_{i=1}^n x_i^2 - \left( \sum_{i=1}^n x_i \right)^2 } & \frac{-\sigma^2\sum_{i=1}^n x_i}{n\sum_{i=1}^n x_i^2 - \left( \sum_{i=1}^n x_i \right)^2 } \\ \frac{-\sigma^2\sum_{i=1}^n x_i}{n\sum_{i=1}^n x_i^2 - \left( \sum_{i=1}^n x_i \right)^2 } & \frac{n\sigma^2}{n\sum_{i=1}^n x_i^2 - \left( \sum_{i=1}^n x_i \right)^2 } \end{bmatrix} \end{align*} $$ Thus, the standard error of $\hat\beta_1$ is $$\sqrt{\frac{n\sigma^2}{n\sum_{i=1}^n x_i^2 - \left( \sum_{i=1}^n x_i \right)^2 }}$$ ::: \ \ ## Exercise 3 Let's check our work by comparing these derivations to some `lm` output! We'll simulate a small "toy" dataset to use as an example. ```{r} # define parameters n <- 100 b0 <- 0 b1 <- 1 # generate data set.seed(494) x <- sample(0:2, size = n, replace = T, prob = c(.64, .32, .04)) e <- rnorm(n) y <- b0 + b1 * x + e # combine into data frame dat <- cbind(y, x) %>% as.data.frame() # check it out head(dat, 10) ``` Read the code above carefully. Make sure you understand what each line of code is doing. Explain below. :::{.callout-note icon=false title="Answer to Exercise 3"} We're simulating a dataset with $n = 100$ observations and two variables: `x` and `y`. The predictor `x` is generated to mimic a SNP with a minor allele frequency of 20% (hence the probability of carrying 2 minor alleles is $0.2 \times 0.2 = `r 0.2*0.2`$, the probability of carrying 1 minor allele is $0.2 \times (1 - 0.2) + (1 - 0.2) \times 0.2 = `r 0.2*0.8 + 0.8*0.2`$, etc.). The outcome variable `y` is generated according to the linear model $y_i = 0 + 1 \times x_i + e_i$, where the "noise" $e_i$ is randomly generated from a standard normal distribution (mean = 0, sd = 1). Translating this to the notation we've been using above, this is what the **true** parameter values are in this setting: - $\boldsymbol\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} = \begin{bmatrix} 0 \\ 1 \end{bmatrix}$ - $\sigma^2 = \text{Var}(e_i) = 1$ ::: \ \ ## Exercise 4 Now, fit a linear regression model to the toy dataset we created in Exercise 3. Our outcome variable is `y` and predictor is `x`. Print out a table of coefficient estimates, standard errors, test statistics, and p-values for the fitted model. ```{r} # fit model mod <- lm(y ~ x, data = dat) # table of coefficient estimates, SE, etc. tidy(mod) ``` How close are the coefficient estimates to their true values? (Remember: we simulated these data so we know the truth!) :::{.callout-note icon=false title="Answer to Exercise 4"} Very close! Our intercept estimate $\hat\beta_0$ = `r coef(mod)[1]` is quite close to the truth ($\beta_0 = 0$). Our slope estimate $\hat\beta_1$ = `r coef(mod)[2]` is also pretty close to the truth ($\beta_1 = 1$). ::: \ \ ## Exercise 5 There's one more piece of information we'll need before we can compare our standard error derivation from Exercise 2 to the `lm` output. Notice that your answer to Exercise 2 involves a parameter, $\sigma^2$, that would typically[^1] be unknown. Thus, in order to estimate the standard error of $\hat\beta_1$, we need to plug in an estimate for $\sigma^2$ as well. Since $\sigma^2$ represents the variance of our error terms, a reasonable way to estimate $\sigma^2$ is to look at the sample variance of our residuals[^2]: $$\hat\sigma^2 = s^2 = \frac{1}{n-p} \boldsymbol\epsilon^T \boldsymbol\epsilon = \frac{1}{n-p}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})$$ [^1]: Although "typically" unknown in practice, because we simulated these data we know the truth ($\sigma^2 = 1$). Let's pretend we don't know this. [^2]: $s^2$ isn't *exactly* the sample variance of our residuals. It's been modified slightly (multiplied by a constant) to make it an unbiased estimator of $\sigma^2$. For those of you that have taken MATH/STAT 355: Statistical Theory, you've seen something similar in the derivation of an unbiased estimator for the variance of a single random variable $\hat\sigma^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2$. a. If we just have a single predictor `x`, what does $s^2$ look like? Simplify here. :::{.callout-note icon=false title="Answer to Exercise 5 - Part A"} $$ \begin{align*} s^2 & = \frac{1}{n-p}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) \\ & = \frac{1}{n-p}\left(\mathbf{y}^T \mathbf{y} - \mathbf{y}^T\mathbf{X}\hat{\boldsymbol{\beta}} - (\mathbf{X}\hat{\boldsymbol{\beta}})^T \mathbf{y} + (\mathbf{X}\hat{\boldsymbol{\beta}})^T \mathbf{X}\hat{\boldsymbol{\beta}} \right) \\ & = \frac{1}{n-p}\left(\mathbf{y}^T \mathbf{y} - \mathbf{y}^T\mathbf{X}\hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}^T \mathbf{X}^T \mathbf{y} + \hat{\boldsymbol{\beta}}^T \mathbf{X}^T \mathbf{X}\hat{\boldsymbol{\beta}} \right) \\ \end{align*} $$ We could stop here, but this can actually be simplified a little further! Let's plug in what we know about $\hat\beta$: $$ \begin{align*} s^2 & = \frac{1}{n-p}\left(\mathbf{y}^T \mathbf{y} - \mathbf{y}^T\mathbf{X}\hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}^T \mathbf{X}^T \mathbf{y} + \hat{\boldsymbol{\beta}}^T \mathbf{X}^T \mathbf{X}\hat{\boldsymbol{\beta}} \right) \\ & = \frac{1}{n-p}(\mathbf{y}^T \mathbf{y} - \mathbf{y}^T\mathbf{X} \left[(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\right] - \left[(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\right]^T \mathbf{X}^T \mathbf{y} \\ & \qquad \qquad \qquad + \left[(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\right]^T \mathbf{X}^T \mathbf{X}\left[(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\right] ) \\ & = \frac{1}{n-p}(\mathbf{y}^T \mathbf{y} - \mathbf{y}^T\mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} - \left[\mathbf{y}^T \mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \right] \mathbf{X}^T \mathbf{y} \\ & \qquad \qquad \qquad + \left[\mathbf{y}^T \mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \right] \mathbf{X}^T \mathbf{X}(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} ) \\ & = \frac{1}{n-p}(\mathbf{y}^T \mathbf{y} - 2\mathbf{y}^T\mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} + \mathbf{y}^T \mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} ) \\ & = \frac{1}{n-p}(\mathbf{y}^T \mathbf{y} - \mathbf{y}^T\mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}) \\ & = \frac{1}{n-p} \left[ \mathbf{y}^T \left(\mathbf{I} - \mathbf{X} (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \right)\mathbf{y} \right] \\ \end{align*} $$ ::: b. Then, calculate $s^2$ in our toy data. How does it compare to the true value of $\sigma^2$? (Remember: we simulated these data so we know the truth!) ```{r} # define our constants (n, p) # n was already defined above (nrow(dat); length(y)) p <- 2 # set up y, X, beta-hat as vectors yvec <- as.matrix(y, ncol = 1) Xmat <- cbind(1, as.matrix(x, ncol = 1)) #bhat <- as.matrix(coef(mod), ncol = 1) # using lm output from earlier xTx <- t(Xmat) %*% Xmat bhat <- solve(xTx) %*% t(Xmat) %*% yvec # by hand # calculate s^2 s2 <- (1 / (n - p)) * ( # \frac{1}{n-p} * (... t(yvec) %*% yvec - # y^T y t(yvec) %*% Xmat %*% bhat - # y^T X \hat{B} t(bhat) %*% t(Xmat) %*% yvec + # \hat{B}^T X^T y t(bhat) %*% t(Xmat) %*% Xmat %*% bhat ) # \hat{B}^T X^T X \hat{B} print(s2) # alternate approach # \frac{1}{n-p} y^T (I - X(X^T X)^{-1} X^T ) y s2_v2 <- (1 / (n - p)) * (t(yvec) %*% (diag(n) - Xmat %*% solve(xTx) %*% t(Xmat)) %*% yvec) print(s2_v2) ``` :::{.callout-note icon=false title="Answer to Exercise 5 - Part B"} Recall from above that the true $\sigma^2$ in this example is 1. Our estimate $s^2 \approx 0.8$ is reasonably close! ::: \ \ ## Exercise 6 Now we can put the pieces together! Based on your answers to Exercises 2 and 5, what is your estimate for the standard error of $\hat\beta_1$ in our toy example? Calculate below. ```{r} # our standard error estimate sqrt(n * s2 / (n * sum(x^2) - (sum(x))^2 ) ) # compare to lm tidy(mod) ``` How does your answer compare to what you got from `lm`? (Hint: They should be the same. If not, revisit your work above!) :::{.callout-note icon=false title="Answer to Exercise 6"} YAY, they're the same!! 🎉 ::: \ \ ## Exercise 7 How can we assess if this standard error estimate is correct? Recall what standard errors represent: how much our estimate might vary if we had a different sample of data (i.e., the variability of our estimate across repeated samples). Simulations are, yet again, a useful tool for evaluating standard error estimates. The idea is this: 1. generate a dataset 2. fit a linear regression model to that dataset 3. record the estimate of interest 4. repeat steps 1--3 many times 5. look at the variability (standard deviation) of those estimates across the many datasets --- this is the *standard error* of that estimator! Let's try this out, focusing on the slope coefficient $\hat\beta_1$. Fill in the blanks in the code below. ```{r} # create a function to do one simulation replicate do_one <- function(n = 100, b0 = 0, b1 = 1){ # step 1: generate data x <- sample(0:2, size = n, replace = T, prob = c(.64, .32, .04)) e <- rnorm(n) y <- b0 + b1 * x + e dat <- cbind(y, x) %>% as.data.frame() # step 2: fit model mod <- lm(y ~ x, data = dat) # step 3: record slope estimate betahat <- coef(mod)[2] return(betahat) } ``` ```{r} # repeat 1000 times set.seed(494) betahats <- replicate(n = 1000, do_one()) ``` Visualize your results, calculate the variability of your estimates, and compare this to the standard error you derived above. What do you see? ```{r} # histogram of coefficient estimates hist(betahats, xlab = expression(hat(beta)[1]), main = 'Histogram of coefficient estimates') # add dashed lines to indicate standard deviation abline(v=1 + sd(betahats), lty=2, col="red") abline(v=1 - sd(betahats), lty=2, col="red") # calculate std deviation of estimates ("true" std error) sd(betahats) ``` :::{.callout-note icon=false title="Answer to Exercise 7"} Our estimates $\hat\beta_1$ are centered around the truth ($\beta_1 = 1$) and have a standard deviation of `r sd(betahats)`. Notice how close this is to our estimate of the standard error of $\hat\beta_1$ from earlier. It's even closer if we plug in the true value of $\sigma^2 = 1$ instead of our estimate $s^2$: ```{r} sqrt(n * 1 / (n * sum(x^2) - (sum(x))^2 ) ) # replace s^2 with true sigma^2 ``` :::  :::{.callout-warning} The code above is randomly generating a new `x` for each simulation replicate. However, our derivations above assume that `x` is fixed. The code below is more appropriate. While I was at it, I also updated the code to: - increase the sample size to 1000 (our standard estimates will get better the larger our sample size) - keep track of not just the coefficient estimate from each replicate but also the standard error estimates --- this way, we can get a sense of the typical size of our standard error *estimates* and compare them to the *true* variability of the coefficient estimates ```{r} # create a function to do one simulation replicate do_one_v2 <- function(x, b0 = 0, b1 = 1){ # step 1: generate data n <- length(x) e <- rnorm(n) y <- b0 + b1 * x + e dat <- cbind(y, x) %>% as.data.frame() # step 2: fit model mod <- lm(y ~ x, data = dat) # step 3: record slope estimate betahat <- coef(mod)[2] sehat <- tidy(mod)$std.error[2] return(c(betahat,sehat)) } ``` ```{r} #| cache: true # repeat 500 times, using the same x each time set.seed(494) x_fixed <- sample(0:2, size = 1000, replace = T, prob = c(.64, .32, .04)) # increasing sample size to 1000 betahats_v2 <- replicate(500, do_one_v2(x = x_fixed)) ``` ```{r} # visualize results hist(betahats_v2[1,], xlab = expression(hat(beta)[1]), main = 'Histogram of coefficient estimates') # calculate the "true" standard error sd(betahats_v2[1,]) ``` This represents the amount of variability we *actually* see in the coefficient estimates ($\hat\beta_1$) across our simulation replicates. Let's compare this "true" standard error to the *estimates* of the standard error that we get from `lm`: ```{r} # compare to the std err *estimates* hist(betahats_v2[2,], xlab = expression(paste(hat(se), "(", hat(beta)[1], ")")), main = 'Histogram of standard error estimates') summary(betahats_v2[2,]) ``` "Good" standard error estimates should be close to the *true* amount of variability in $\hat\beta_1$. In this case, they are! ::: \ \ # Part 2: Correlated Observations ## Exercise 8 Now, suppose our observations are not independent. In other words, they're **correlated**. Our assumptions from Exercise 1 are updated as follows: 1. $\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon$ 2. $E[\boldsymbol\epsilon] = \mathbf{0}$ 3. $\text{Var}(\boldsymbol\epsilon) = \boldsymbol\Sigma$, for some positive semidefinite $n$-by-$n$ matrix $\boldsymbol\Sigma$ where the off-diagonal elements of $\boldsymbol\Sigma$ are no longer required to be zero Repeat your deviations from Exercise 1 under this updated set of assumptions. :::{.callout-note icon=false title="Answer to Exercise 8"} The variance of the least squares estimator is: $$ \begin{align*} \text{Var}(\hat{\boldsymbol\beta}) & = \text{Var}\left( (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \right)\\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{y}) \left[(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \right]^T \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{y}) \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon) \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1}, \text{ using Assumption 1} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{Var}(\boldsymbol\epsilon) \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1} \\ & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \boldsymbol\Sigma \mathbf{X} \left( \mathbf{X}^T \mathbf{X} \right)^{-1}, \text{ using Assumption 3} \\ \end{align*} $$ This is as far as we can simplify! ::: \ \ ## Exercise 9 Compare your answer to Exercise 8 to the one you got in Exercise 1. Are they the same? If not, what implications will this have for our inference? In your answer, please address the implications for confidence intervals, hypothesis testing, and type I and II errors. :::{.callout-note icon=false title="Answer to Exercise 9"} The answer to Exercise 8 is **not** equivalent to the result from Exercise 1. This will have multiple implications for inference. If our observations are correlated and we proceed "as usual" --- that is, using linear regression and the "traditional" standard error estimates (based on $\widehat{\text{Var}}(\hat{\boldsymbol\beta}) = s^2 (\mathbf{X}^T \mathbf{X})^{-1}$) that assume observations are independent --- our standard error estimates will be wrong! Having incorrect standard error estimates has multiple downstream implications. In particular... If the form of $\boldsymbol\Sigma$ is such that our standard error estimates end up being too small (i.e., $\widehat{\text{Std.Err}}(\hat\beta_j) < \text{Std.Err}(\hat\beta_j)$ for $j \in \{0, ..., p\}$): - our confidence intervals will be too narrow - our p-values will be too small - out Type I Error rate will be higher than intended If our standard error estimates are too large (i.e., $\widehat{\text{Std.Err}}(\hat\beta_j) > \text{Std.Err}(\hat\beta_j)$): - our confidence intervals will be too wide - our p-values will be too large - our Type II Error rate will be higher than intended (and power will be lower)  ::: \ \ ## Going Deeper How do we avoid the issues identified in Exercise 9? Two of the most common statistical modeling techniques to address correlation are mixed effects models and generalized estimating equations. Want to learn more? Take *STAT 452: Correlated Data* with Prof. Brianna Heggeseth in Fall 2025! (Or check out her course notes [here](https://bcheggeseth.github.io/CorrelatedData/).) \ \ ## Excercise 10 What's the point of all the work you did above? Summarize your takeaways from this lab here. In writing your response, imagine you are writing out the learning goals for this assignment (like I've done for you at the top of previous labs). :::{.callout-note icon=false title="Answer to Exercise 10"} Answers will vary, but here are my thoughts... Lab 5 Learning Goals: - Derive the variance of the ordinary least squares (OLS) coefficient estimator $\hat\beta_{OLS}$ under the assumption of independence - Use this result to estimate the standard error of the slope coefficient estimate in a simple (one predictor) linear regression model - Confirm theoretical derivations by comparing to R output (and thus understand what R is doing "under the hood" when it gives us standard error estimates) - Perform a simulation study to assess the performance of a standard error estimate - Revisit the the variance of the ordinary least squares (OLS) coefficient estimator $\hat\beta_{OLS}$ in the presence of correlation - Understand the impact that incorrect standard error estimates can have on downstream inference (confidence intervals, p-values, etc.) In short: 1. Understand what standard errors represent and how they are estimated 2. Understand the impact that correlation has on statistical inference :::  \ \