Lab 5: Exploring the impact of correlation

STAT 494: Statistical Genetics

Published

April 1, 2025



Getting Started

  1. Access the source code for this document (click the </> Code button above)

  2. Copy-paste into a new QMD

  3. Add your name as the author in the YAML header, above

  4. Load the packages that we’ll need for this lab. (And install them first if needed.)

library(tidyverse)
library(broom)
  1. Note the continued usage of customized Quarto callouts, like below, for typing your answers. Please keep all answers within these blocks. This will make it easier for us to review your work!
Your Answer

Here’s an example of one of our customized “Your Answer” callous blocks. Whenever you are answering a question, please put all text within the provided block.



Part 1: Independent Observations

Exercise 1

Suppose we fit a linear regression model exploring the relationship between a quantitative trait \(y\) and a set of predictors \(x_1, x_2, ..., x_p\). Derive the variance of the least squares estimator we derived in Lab 1, \[\hat{\boldsymbol\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y},\] where \[\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \mathbf{X} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{p1} \\ 1 & x_{12} & \cdots & x_{p2} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn}\end{bmatrix}, \boldsymbol\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix}\] in the context of independent observations.

You may make the following assumptions:

  • \(\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon\)
  • \(E[\boldsymbol\epsilon] = \mathbf{0}\)
  • \(\text{Var}(\boldsymbol\epsilon) = \sigma^2 \mathbf{I}_{n \times n}\), where \(\mathbf{I}_{n \times n}\) is the \(n\)-by-\(n\) identity matrix
Your Answer to Exercise 1

The variance of the least squares estimator is:

\[ \begin{align*} \text{Var}(\hat{\boldsymbol\beta}) & = \text{Var}\left( ??? \right) \\ & = ??? \\ & = ??? \end{align*} \]



Exercise 2

Suppose we find ourselves in a simplified scenario with just a single predictor \(x_1\). We now have the following: \[\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \ \mathbf{X} = \begin{bmatrix} 1 & x_{11} \\ 1 & x_{12} \\ \vdots & \vdots \\ 1 & x_{1n} \end{bmatrix} = \begin{bmatrix} \mathbf{1} & \mathbf{x}_1 \end{bmatrix}, \text{ and } \boldsymbol\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}\] Using your work from Exercise 1, determine the standard error of the slope estimator \(\hat\beta_1\).

Suggestions:

  • plug in the design matrix, \(\mathbf{X} = \begin{bmatrix} \mathbf{1} & \mathbf{x}_1 \end{bmatrix}\), to your answer from Exercise 1
  • simplify (you will need to remember how to invert a 2-by-2 matrix!)
  • you will end up with a 2-by-2 variance-covariance matrix for \(\hat\beta_0\) and \(\hat\beta_1\) – the standard error of \(\hat\beta_1\) is the square root of the second diagonal element
Your Answer to Exercise 2

TBD



Exercise 3

Let’s check our work by comparing these derivations to some lm output! We’ll simulate a small “toy” dataset to use as an example.

# define parameters
n <- 100
b0 <- 0
b1 <- 1

# generate data
set.seed(494)
x <- sample(0:2, size = n, replace = T, prob = c(.64, .32, .04))
e <- rnorm(n)
y <- b0 + b1 * x + e

# combine into data frame
dat <- cbind(y, x) %>%
  as.data.frame()

# check it out
head(dat, 10)
             y x
1   0.47813179 0
2  -2.97211789 0
3   1.54235459 0
4   1.03226746 0
5   1.28625948 0
6  -0.02388712 0
7  -0.64840383 0
8   0.32668469 0
9   1.95681007 2
10  1.89940920 2

Read the code above carefully. Make sure you understand what each line of code is doing. Explain below.

Your Answer to Exercise 3

TBD



Exercise 4

Now, fit a linear regression model to the toy dataset we created in Exercise 3. Our outcome variable is y and predictor is x. Print out a table of coefficient estimates, standard errors, test statistics, and p-values for the fitted model.

# fit model
# table of coefficient estimates, SE, etc.

How close are the coefficient estimates to their true values? (Remember: we simulated these data so we know the truth!)

Your Answer to Exercise 4

TBD



Exercise 5

There’s one more piece of information we’ll need before we can compare our standard error derivation from Exercise 2 to the lm output.

Notice that your answer to Exercise 2 involves a parameter, \(\sigma^2\), that would typically be unknown. Thus, in order to estimate the standard error of \(\hat\beta_1\), we need to plug in an estimate for \(\sigma^2\) as well. Since \(\sigma^2\) represents the variance of our error terms, a reasonable way to estimate \(\sigma^2\) is to look at the sample variance of our residuals. We’ll use the following estimator1: \[\hat\sigma^2 = s^2 = \frac{1}{n-p} \boldsymbol\epsilon^T \boldsymbol\epsilon = \frac{1}{n-p}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})\]

  1. If we just have a single predictor \(x_1\), what does \(s^2\) look like? Simplify here.
Your Answer to Exercise 5 - Part A

\[ \begin{align*} s^2 & = \frac{1}{n-p}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) \\ & = \frac{1}{n-p} \left( ??? \right) \\ \end{align*} \]

  1. Then, calculate \(s^2\) in our toy data. How does it compare to the true value of \(\sigma^2\)? (Remember: we simulated these data so we know the truth!)
# your code here
Your Answer to Exercise 5 - Part B

TBD



Exercise 6

Now we can put the pieces together!

Based on your answers to Exercises 2 and 5, what is your estimate for the standard error of \(\hat\beta_1\) in our toy example? Calculate below.

# your code here

How does your answer compare to what you got from lm? (Hint: They should be the same. If not, revisit your work above!)

Your Answer to Exercise 6

TBD



Exercise 7

How can we assess if this standard error estimate is correct? Recall what standard errors represent: how much our estimate might vary if we had a different sample of data (i.e., the variability of our estimate across repeated samples). Simulations are, yet again, a useful tool for evaluating standard error estimates. The idea is this:

  1. generate a dataset
  2. fit a linear regression model to that dataset
  3. record the estimate of interest
  4. repeat steps 1–3 many times
  5. look at the variability (standard deviation) of those estimates across the many datasets — this is the standard error of that estimator!

Let’s try this out, focusing on the slope coefficient \(\hat\beta_1\). Fill in the blanks in the code below.

# create a function to do one simulation replicate
do_one <- function(n = 100, b0 = 0, b1 = 1){
  # step 1: generate data
  x <- sample(0:2, size = n, replace = T, prob = c(.64, .32, .04))
  e <- rnorm(n)
  y <- ___
  dat <- cbind(y, x) %>%
   as.data.frame()
  
  # step 2: fit model
  mod <- lm(___)
  
  # step 3: record slope estimate
  betahat <- ___
  return(betahat)
}
Error in parse(text = input): <text>:6:9: unexpected input
5:   e <- rnorm(n)
6:   y <- __
           ^
# repeat 1000 times
set.seed(494)
betahats <- replicate(n = ___, do_one())
Error in parse(text = input): <text>:3:28: unexpected input
2: set.seed(494)
3: betahats <- replicate(n = __
                              ^

Visualize your results, calculate the variability of your estimates, and compare this to the standard error you derived above. What do you see?

# your code here
Your Answer to Exercise 7

TBD



Part 2: Correlated Observations

Exercise 8

Now, suppose our observations are not independent. In other words, they’re correlated. Our assumptions from Exercise 1 are updated as follows:

  • \(\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon\)
  • \(E[\boldsymbol\epsilon] = \mathbf{0}\)
  • \(\text{Var}(\boldsymbol\epsilon) = \boldsymbol\Sigma\), for some positive semidefinite \(n\)-by-\(n\) matrix \(\boldsymbol\Sigma\) where the off-diagonal elements of \(\boldsymbol\Sigma\) are no longer required to be zero

Repeat your deviations from Exercise 1 under this updated set of assumptions.

Your Answer to Exercise 8

The variance of the least squares estimator is:

\[ \begin{align*} \text{Var}(\hat{\boldsymbol\beta}) & = \text{Var}\left( ??? \right) \\ & = ??? \\ & = ??? \end{align*} \]



Exercise 9

Compare your answer to Exercise 8 to the one you got in Exercise 1. Are they the same? If not, what implications will this have for our inference?

In your answer, please address the implications for confidence intervals, hypothesis testing, and type I and II errors.

Your Answer to Exercise 9

TBD



Going Deeper

How do we avoid the issues identified in Exercise 9? Two of the most common statistical modeling techniques to address correlation are mixed effects models and generalized estimating equations.

Want to learn more? Take STAT 452: Correlated Data with Prof. Brianna Heggeseth in Fall 2025! (Or check out her course notes here.)



Excercise 10

What’s the point of all the work you did above? Summarize your takeaways from this lab here. In writing your response, imagine you are writing out the learning goals for this assignment (like I’ve done for you at the top of previous labs).

Your Answer to Exercise 10

Lab 5 Learning Goals:

  • Goal 1
  • Goal 2
  • Etc.



The end :)

Footnotes

  1. \(s^2\) isn’t exactly the sample variance of our residuals. It’s been modified slightly (multiplied by a constant) to make it an unbiased estimator of \(\sigma^2\). For those of you that have taken MATH/STAT 355: Statistical Theory, you’ve seen something similar in the derivation of an unbiased estimator for the variance of a single random variable \(\hat\sigma^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2\).↩︎