Lab 5: Exploring the impact of correlation

STAT 494: Statistical Genetics

Published

April 1, 2025

Getting Started

Access the source code for this document (click the </> Code button above)
Copy-paste into a new QMD
Add your name as the author in the YAML header, above
Load the packages that we’ll need for this lab. (And install them first if needed.)

library(tidyverse)
library(broom)

Note the continued usage of customized Quarto callouts, like below, for typing your answers. Please keep all answers within these blocks. This will make it easier for us to review your work!

Your Answer

Here’s an example of one of our customized “Your Answer” callous blocks. Whenever you are answering a question, please put all text within the provided block.

Part 1: Independent Observations

Exercise 1

Suppose we fit a linear regression model exploring the relationship between a quantitative trait $y$ and a set of predictors $x_1, x_2, ..., x_p$. Derive the variance of the least squares estimator we derived in Lab 1, \[\hat{\boldsymbol\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y},\] where \[\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \mathbf{X} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{p1} \\ 1 & x_{12} & \cdots & x_{p2} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn}\end{bmatrix}, \boldsymbol\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix}\] in the context of independent observations.

You may make the following assumptions:

$\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon$
$E[\boldsymbol\epsilon] = \mathbf{0}$
$\text{Var}(\boldsymbol\epsilon) = \sigma^2 \mathbf{I}_{n \times n}$, where $\mathbf{I}_{n \times n}$ is the $n$-by-$n$ identity matrix

Your Answer to Exercise 1

The variance of the least squares estimator is:

\[ \begin{align*} \text{Var}(\hat{\boldsymbol\beta}) & = \text{Var}\left( ??? \right) \\ & = ??? \\ & = ??? \end{align*} \]

Exercise 2

Suppose we find ourselves in a simplified scenario with just a single predictor $x_1$. We now have the following: \[\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \ \mathbf{X} = \begin{bmatrix} 1 & x_{11} \\ 1 & x_{12} \\ \vdots & \vdots \\ 1 & x_{1n} \end{bmatrix} = \begin{bmatrix} \mathbf{1} & \mathbf{x}_1 \end{bmatrix}, \text{ and } \boldsymbol\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}\] Using your work from Exercise 1, determine the standard error of the slope estimator $\hat\beta_1$.

Suggestions:

plug in the design matrix, $\mathbf{X} = \begin{bmatrix} \mathbf{1} & \mathbf{x}_1 \end{bmatrix}$, to your answer from Exercise 1
simplify (you will need to remember how to invert a 2-by-2 matrix!)
you will end up with a 2-by-2 variance-covariance matrix for $\hat\beta_0$ and $\hat\beta_1$ – the standard error of $\hat\beta_1$ is the square root of the second diagonal element

Your Answer to Exercise 2

TBD

Exercise 3

Let’s check our work by comparing these derivations to some lm output! We’ll simulate a small “toy” dataset to use as an example.

# define parameters
n <- 100
b0 <- 0
b1 <- 1

# generate data
set.seed(494)
x <- sample(0:2, size = n, replace = T, prob = c(.64, .32, .04))
e <- rnorm(n)
y <- b0 + b1 * x + e

# combine into data frame
dat <- cbind(y, x) %>%
  as.data.frame()

# check it out
head(dat, 10)

             y x
1   0.47813179 0
2  -2.97211789 0
3   1.54235459 0
4   1.03226746 0
5   1.28625948 0
6  -0.02388712 0
7  -0.64840383 0
8   0.32668469 0
9   1.95681007 2
10  1.89940920 2

Read the code above carefully. Make sure you understand what each line of code is doing. Explain below.

Your Answer to Exercise 3

TBD

Exercise 4

Now, fit a linear regression model to the toy dataset we created in Exercise 3. Our outcome variable is y and predictor is x. Print out a table of coefficient estimates, standard errors, test statistics, and p-values for the fitted model.

# fit model
# table of coefficient estimates, SE, etc.

How close are the coefficient estimates to their true values? (Remember: we simulated these data so we know the truth!)

Your Answer to Exercise 4

TBD

Exercise 5

There’s one more piece of information we’ll need before we can compare our standard error derivation from Exercise 2 to the lm output.

Notice that your answer to Exercise 2 involves a parameter, $\sigma^2$, that would typically be unknown. Thus, in order to estimate the standard error of $\hat\beta_1$, we need to plug in an estimate for $\sigma^2$ as well. Since $\sigma^2$ represents the variance of our error terms, a reasonable way to estimate $\sigma^2$ is to look at the sample variance of our residuals. We’ll use the following estimator¹: \[\hat\sigma^2 = s^2 = \frac{1}{n-p} \boldsymbol\epsilon^T \boldsymbol\epsilon = \frac{1}{n-p}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})\]

If we just have a single predictor $x_1$, what does $s^2$ look like? Simplify here.

Your Answer to Exercise 5 - Part A

\[ \begin{align*} s^2 & = \frac{1}{n-p}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) \\ & = \frac{1}{n-p} \left( ??? \right) \\ \end{align*} \]

Then, calculate $s^2$ in our toy data. How does it compare to the true value of $\sigma^2$? (Remember: we simulated these data so we know the truth!)

# your code here

Your Answer to Exercise 5 - Part B

TBD

Exercise 6

Now we can put the pieces together!

Based on your answers to Exercises 2 and 5, what is your estimate for the standard error of $\hat\beta_1$ in our toy example? Calculate below.

# your code here

How does your answer compare to what you got from lm? (Hint: They should be the same. If not, revisit your work above!)

Your Answer to Exercise 6

TBD

Exercise 7

How can we assess if this standard error estimate is correct? Recall what standard errors represent: how much our estimate might vary if we had a different sample of data (i.e., the variability of our estimate across repeated samples). Simulations are, yet again, a useful tool for evaluating standard error estimates. The idea is this:

generate a dataset
fit a linear regression model to that dataset
record the estimate of interest
repeat steps 1–3 many times
look at the variability (standard deviation) of those estimates across the many datasets — this is the standard error of that estimator!

Let’s try this out, focusing on the slope coefficient $\hat\beta_1$. Fill in the blanks in the code below.

# create a function to do one simulation replicate
do_one <- function(n = 100, b0 = 0, b1 = 1){
  # step 1: generate data
  x <- sample(0:2, size = n, replace = T, prob = c(.64, .32, .04))
  e <- rnorm(n)
  y <- ___
  dat <- cbind(y, x) %>%
   as.data.frame()
  
  # step 2: fit model
  mod <- lm(___)
  
  # step 3: record slope estimate
  betahat <- ___
  return(betahat)
}

Error in parse(text = input): <text>:6:9: unexpected input
5:   e <- rnorm(n)
6:   y <- __
           ^

# repeat 1000 times
set.seed(494)
betahats <- replicate(n = ___, do_one())

Error in parse(text = input): <text>:3:28: unexpected input
2: set.seed(494)
3: betahats <- replicate(n = __
                              ^

Visualize your results, calculate the variability of your estimates, and compare this to the standard error you derived above. What do you see?

# your code here

Your Answer to Exercise 7

TBD

Part 2: Correlated Observations

Exercise 8

Now, suppose our observations are not independent. In other words, they’re correlated. Our assumptions from Exercise 1 are updated as follows:

$\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon$
$E[\boldsymbol\epsilon] = \mathbf{0}$
$\text{Var}(\boldsymbol\epsilon) = \boldsymbol\Sigma$, for some positive semidefinite $n$-by-$n$ matrix $\boldsymbol\Sigma$ where the off-diagonal elements of $\boldsymbol\Sigma$ are no longer required to be zero

Repeat your deviations from Exercise 1 under this updated set of assumptions.

Your Answer to Exercise 8

The variance of the least squares estimator is:

\[ \begin{align*} \text{Var}(\hat{\boldsymbol\beta}) & = \text{Var}\left( ??? \right) \\ & = ??? \\ & = ??? \end{align*} \]

Exercise 9

Compare your answer to Exercise 8 to the one you got in Exercise 1. Are they the same? If not, what implications will this have for our inference?

In your answer, please address the implications for confidence intervals, hypothesis testing, and type I and II errors.

Your Answer to Exercise 9

TBD

Going Deeper

How do we avoid the issues identified in Exercise 9? Two of the most common statistical modeling techniques to address correlation are mixed effects models and generalized estimating equations.

Want to learn more? Take STAT 452: Correlated Data with Prof. Brianna Heggeseth in Fall 2025! (Or check out her course notes here.)

Excercise 10

What’s the point of all the work you did above? Summarize your takeaways from this lab here. In writing your response, imagine you are writing out the learning goals for this assignment (like I’ve done for you at the top of previous labs).

Your Answer to Exercise 10

Lab 5 Learning Goals:

Goal 1
Goal 2
Etc.

The end :)

Footnotes

$s^2$ isn’t exactly the sample variance of our residuals. It’s been modified slightly (multiplied by a constant) to make it an unbiased estimator of $\sigma^2$. For those of you that have taken MATH/STAT 355: Statistical Theory, you’ve seen something similar in the derivation of an unbiased estimator for the variance of a single random variable $\hat\sigma^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2$.↩︎

--- title: "Lab 5: Exploring the impact of correlation" subtitle: "STAT 494: Statistical Genetics" image: "../images/std_err.png" date: "April 1, 2025" author: "" format: html: toc: true toc-depth: 3 embed-resources: true code-tools: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, error = TRUE) ``` \ \ # Getting Started 1. Access the source code for this document (click the `</> Code` button above) 2. Copy-paste into a new QMD 3. Add your name as the `author` in the YAML header, above 4. Load the packages that we'll need for this lab. (And install them first if needed.) ```{r} #| message: false library(tidyverse) library(broom) ``` 5. Note the continued usage of customized Quarto [callouts](https://quarto.org/docs/authoring/callouts.html), like below, for typing your answers. Please keep all answers within these blocks. This will make it easier for us to review your work! :::{.callout-note icon=false title="Your Answer"} Here's an example of one of our customized "Your Answer" callous blocks. Whenever you are answering a question, please put all text within the provided block. ::: \ \ # Part 1: Independent Observations ## Exercise 1 Suppose we fit a linear regression model exploring the relationship between a quantitative trait $y$ and a set of predictors $x_1, x_2, ..., x_p$. Derive the **variance** of the least squares estimator we derived in Lab 1, $$\hat{\boldsymbol\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y},$$ where $$\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \mathbf{X} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{p1} \\ 1 & x_{12} & \cdots & x_{p2} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn}\end{bmatrix}, \boldsymbol\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix}$$ in the context of **independent** observations. You may make the following assumptions: - $\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon$ - $E[\boldsymbol\epsilon] = \mathbf{0}$ - $\text{Var}(\boldsymbol\epsilon) = \sigma^2 \mathbf{I}_{n \times n}$, where $\mathbf{I}_{n \times n}$ is the $n$-by-$n$ identity matrix :::{.callout-note icon=false title="Your Answer to Exercise 1"} The variance of the least squares estimator is: $$ \begin{align*} \text{Var}(\hat{\boldsymbol\beta}) & = \text{Var}\left( ??? \right) \\ & = ??? \\ & = ??? \end{align*} $$ ::: \ \ ## Exercise 2 Suppose we find ourselves in a simplified scenario with just a single predictor $x_1$. We now have the following: $$\mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \ \mathbf{X} = \begin{bmatrix} 1 & x_{11} \\ 1 & x_{12} \\ \vdots & \vdots \\ 1 & x_{1n} \end{bmatrix} = \begin{bmatrix} \mathbf{1} & \mathbf{x}_1 \end{bmatrix}, \text{ and } \boldsymbol\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}$$ Using your work from Exercise 1, determine the **standard error** of the slope estimator $\hat\beta_1$. Suggestions: - plug in the design matrix, $\mathbf{X} = \begin{bmatrix} \mathbf{1} & \mathbf{x}_1 \end{bmatrix}$, to your answer from Exercise 1 - simplify (you will need to remember how to invert a 2-by-2 matrix!) - you will end up with a 2-by-2 variance-covariance matrix for $\hat\beta_0$ and $\hat\beta_1$ -- the standard error of $\hat\beta_1$ is the square root of the second diagonal element :::{.callout-note icon=false title="Your Answer to Exercise 2"} TBD ::: \ \ ## Exercise 3 Let's check our work by comparing these derivations to some `lm` output! We'll simulate a small "toy" dataset to use as an example. ```{r} # define parameters n <- 100 b0 <- 0 b1 <- 1 # generate data set.seed(494) x <- sample(0:2, size = n, replace = T, prob = c(.64, .32, .04)) e <- rnorm(n) y <- b0 + b1 * x + e # combine into data frame dat <- cbind(y, x) %>% as.data.frame() # check it out head(dat, 10) ``` Read the code above carefully. Make sure you understand what each line of code is doing. Explain below. :::{.callout-note icon=false title="Your Answer to Exercise 3"} TBD ::: \ \ ## Exercise 4 Now, fit a linear regression model to the toy dataset we created in Exercise 3. Our outcome variable is `y` and predictor is `x`. Print out a table of coefficient estimates, standard errors, test statistics, and p-values for the fitted model. ```{r} # fit model # table of coefficient estimates, SE, etc. ``` How close are the coefficient estimates to their true values? (Remember: we simulated these data so we know the truth!) :::{.callout-note icon=false title="Your Answer to Exercise 4"} TBD ::: \ \ ## Exercise 5 There's one more piece of information we'll need before we can compare our standard error derivation from Exercise 2 to the `lm` output. Notice that your answer to Exercise 2 involves a parameter, $\sigma^2$, that would typically be unknown. Thus, in order to estimate the standard error of $\hat\beta_1$, we need to plug in an estimate for $\sigma^2$ as well. Since $\sigma^2$ represents the variance of our error terms, a reasonable way to estimate $\sigma^2$ is to look at the sample variance of our residuals. We'll use the following estimator[^1]: $$\hat\sigma^2 = s^2 = \frac{1}{n-p} \boldsymbol\epsilon^T \boldsymbol\epsilon = \frac{1}{n-p}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})$$ [^1]: $s^2$ isn't *exactly* the sample variance of our residuals. It's been modified slightly (multiplied by a constant) to make it an unbiased estimator of $\sigma^2$. For those of you that have taken MATH/STAT 355: Statistical Theory, you've seen something similar in the derivation of an unbiased estimator for the variance of a single random variable $\hat\sigma^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2$. a. If we just have a single predictor $x_1$, what does $s^2$ look like? Simplify here. :::{.callout-note icon=false title="Your Answer to Exercise 5 - Part A"} $$ \begin{align*} s^2 & = \frac{1}{n-p}(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}})^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) \\ & = \frac{1}{n-p} \left( ??? \right) \\ \end{align*} $$ ::: b. Then, calculate $s^2$ in our toy data. How does it compare to the true value of $\sigma^2$? (Remember: we simulated these data so we know the truth!) ```{r} # your code here ``` :::{.callout-note icon=false title="Your Answer to Exercise 5 - Part B"} TBD ::: \ \ ## Exercise 6 Now we can put the pieces together! Based on your answers to Exercises 2 and 5, what is your estimate for the standard error of $\hat\beta_1$ in our toy example? Calculate below. ```{r} # your code here ``` How does your answer compare to what you got from `lm`? (Hint: They should be the same. If not, revisit your work above!) :::{.callout-note icon=false title="Your Answer to Exercise 6"} TBD ::: \ \ ## Exercise 7 How can we assess if this standard error estimate is correct? Recall what standard errors represent: how much our estimate might vary if we had a different sample of data (i.e., the variability of our estimate across repeated samples). Simulations are, yet again, a useful tool for evaluating standard error estimates. The idea is this: 1. generate a dataset 2. fit a linear regression model to that dataset 3. record the estimate of interest 4. repeat steps 1--3 many times 5. look at the variability (standard deviation) of those estimates across the many datasets --- this is the *standard error* of that estimator! Let's try this out, focusing on the slope coefficient $\hat\beta_1$. Fill in the blanks in the code below. ```{r} # create a function to do one simulation replicate do_one <- function(n = 100, b0 = 0, b1 = 1){ # step 1: generate data x <- sample(0:2, size = n, replace = T, prob = c(.64, .32, .04)) e <- rnorm(n) y <- ___ dat <- cbind(y, x) %>% as.data.frame() # step 2: fit model mod <- lm(___) # step 3: record slope estimate betahat <- ___ return(betahat) } ``` ```{r} # repeat 1000 times set.seed(494) betahats <- replicate(n = ___, do_one()) ``` Visualize your results, calculate the variability of your estimates, and compare this to the standard error you derived above. What do you see? ```{r} # your code here ``` :::{.callout-note icon=false title="Your Answer to Exercise 7"} TBD ::: \ \ # Part 2: Correlated Observations ## Exercise 8 Now, suppose our observations are not independent. In other words, they're **correlated**. Our assumptions from Exercise 1 are updated as follows: - $\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon$ - $E[\boldsymbol\epsilon] = \mathbf{0}$ - $\text{Var}(\boldsymbol\epsilon) = \boldsymbol\Sigma$, for some positive semidefinite $n$-by-$n$ matrix $\boldsymbol\Sigma$ where the off-diagonal elements of $\boldsymbol\Sigma$ are no longer required to be zero Repeat your deviations from Exercise 1 under this updated set of assumptions. :::{.callout-note icon=false title="Your Answer to Exercise 8"} The variance of the least squares estimator is: $$ \begin{align*} \text{Var}(\hat{\boldsymbol\beta}) & = \text{Var}\left( ??? \right) \\ & = ??? \\ & = ??? \end{align*} $$ ::: \ \ ## Exercise 9 Compare your answer to Exercise 8 to the one you got in Exercise 1. Are they the same? If not, what implications will this have for our inference? In your answer, please address the implications for confidence intervals, hypothesis testing, and type I and II errors. :::{.callout-note icon=false title="Your Answer to Exercise 9"} TBD ::: \ \ ## Going Deeper How do we avoid the issues identified in Exercise 9? Two of the most common statistical modeling techniques to address correlation are mixed effects models and generalized estimating equations. Want to learn more? Take *STAT 452: Correlated Data* with Prof. Brianna Heggeseth in Fall 2025! (Or check out her course notes [here](https://bcheggeseth.github.io/CorrelatedData/).) \ \ ## Excercise 10 What's the point of all the work you did above? Summarize your takeaways from this lab here. In writing your response, imagine you are writing out the learning goals for this assignment (like I've done for you at the top of previous labs). :::{.callout-note icon=false title="Your Answer to Exercise 10"} Lab 5 Learning Goals: - Goal 1 - Goal 2 - Etc. ::: \ \ The end :)