Lab 1: Exploring the p > n problem

STAT 494: Statistical Genetics

Published

January 30, 2025

Toy Data Example

To understand the p > n problem, we’ll start by exploring a toy data example.

Part 1: Simulating One Variant

To simulate a biallelic genetic variant, we can use the rbinom function in R. This will randomly generate data from a binomial distribution. If we set \(n\) equal to 2 and \(p\) equal to our minor allele frequency (MAF), the rbinom function will randomly generate counts between 0, 1, and 2 that we can think of as representing the number of minor alleles (0, 1, or 2) carried by each individual at that position.

Pause and Discuss

Think back to MATH/STAT 354: Probability. What do you remember about the binomial distribution? Why is this an appropriate distribution to be using here?

Run the code below to see this in action:

set.seed(494) # for reproducible random number generation
snp <- rbinom(n = 100, size = 2, p = 0.1) # 100 people, MAF = 0.1
print(snp)

  [1] 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [38] 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0
 [75] 0 2 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0

Try changing \(p\) or \(n\) to see how the data change with different minor allele frequencies and sample sizes.

# your code here

Write a few notes about what you found here.

YOUR ANSWER HERE

Part 2: Simulating Many Variants

In a typical GWAS, we’re working with a lot more than just a single variant. If we want to quickly generate many variants, it can be useful to first create our own function that will generate one variant (do_one) and then use replicate to run that function many times, thus generating many variants.

Note

In reality, nearby genetic variants are correlated with one another, but we’ll ignore that correlation structure for now and just generate independent variants.

I wrote the do_one function for you. Take a look at the code:

# function to simulate one genetic variant
do_one <- function(n_ppl, MAF){
  snp <- rbinom(n = n_ppl, size = 2, p = MAF)
  return(snp)
}

Try running the do_one function a few times, with different inputs of n_ppl and MAF. Confirm that you understand what it is doing.

# try do_one

Now, use the replicate function to run do_one 1000 times.

# replicate 1000 times to create 1000 variants
set.seed(494)
snps <- replicate(1000, do_one(n_ppl = 100, MAF = 0.1))

What is the dimension of the snps object that we just created? Is this an example of “big” data? Explain.

# your code here

YOUR ANSWER HERE

Part 3: Simulating a Null Trait

Our last step in creating our toy dataset is to add a trait. Let’s create a null trait (that isn’t associated with any genetic variants) using the rnorm function to randomly sample from a normal distribution:

set.seed(494) # set seed for reproducibility
y <- rnorm(100, mean = 65, sd = 3) # generate trait
print(y)

  [1] 59.62481 63.88034 63.02679 62.50892 70.83525 67.50514 63.61816 69.64214
  [9] 64.97347 65.47989 63.13887 63.07716 63.74931 64.70136 65.07071 58.33367
 [17] 61.91952 64.85250 65.66969 62.40541 62.04025 61.84794 64.28388 65.04385
 [25] 61.42188 62.55285 59.29327 67.46579 61.65637 67.67236 66.09086 70.02120
 [33] 68.01750 60.73839 58.42093 67.99430 65.89968 64.39152 65.65709 66.54789
 [41] 65.00309 65.34142 63.41219 68.68026 61.68235 61.18094 68.73800 69.94243
 [49] 66.55700 60.87955 66.43440 56.08365 69.62706 68.09680 68.85878 64.92834
 [57] 63.05479 65.98005 64.87043 64.69823 65.76869 67.50849 65.20761 62.59497
 [65] 68.12195 67.62197 66.63407 66.00194 62.77750 63.89318 63.51581 71.79953
 [73] 65.50906 68.08233 68.99696 64.76795 69.99623 66.88639 65.01119 65.69350
 [81] 64.53200 64.69204 65.19411 64.11513 63.19411 62.63277 62.99986 64.56554
 [89] 67.31328 62.24985 65.15231 61.25394 67.23057 65.94068 64.43938 69.38935
 [97] 66.89983 65.07994 63.90870 60.93687

Finally, we can combine together our simulated SNPs and trait into a single data frame:

dat <- as.data.frame(cbind(y, snps))

Take a minute to explore (e.g., View) our final dataset.

Part 4: Multiple Linear Regression

Try fitting a multiple linear regression model to investigate the association between our trait, \(y\), and our 1000 SNPs. Hint: instead of typing out the name of each SNP variable (lm(y ~ V1 + V2 + V3 + ...), use the shortcut lm(y ~ .) to tell R to use all of the predictor variables).

# your code here

What happens when you try to fit this multiple linear regression model?

YOUR ANSWER HERE

Theory of Linear Models

To understand what’s happening here, we need to get into the mathematical theory behind linear models.

To estimate the coefficients in a multiple linear regression model, we typically use an approach known as least squares estimation.

In fitting a linear regression model, our goal is to find the line that provides the “best” fit to our data. More specifically, we hope to find the values of the parameters \(\beta_0, \beta_1, \beta_2, \dots\) that minimize the sum of squared residuals:

\[\sum_{i=1}^n r_i^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n (y_i - [\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \dots])^2\]

In other words, we’re trying to find the coefficient values that minimize the sum of squares of the vertical distances from the observed data points \((\mathbf{x}_i, y_i)\) to the fitted values \((\mathbf{x}_i, \hat{y}_i)\).

In order to find the least squares estimates for the intercept and slope(s) of our linear regression model, we need to minimize a function—the sum of squared residuals. To do this, we can use techniques from multivariate calculus!

Tip

Take a look at the the following resources to review this concept further:

Part 5: Simple Linear Regression

To find the least squares estimates \(\hat\beta_0, \hat\beta_1\) for a simple linear regression model (with one predictor variable) \[E[y \mid x] = \beta_0 + \beta_1 x,\] start by finding the partial derivatives \(\frac{\partial}{\partial\beta_0}\) and \(\frac{\partial}{\partial\beta_1}\) of the sum of squared residuals:

\[ \begin{aligned} \frac{\partial}{\partial\beta_0} \sum_{i=1}^n (y_i - [\beta_0 + \beta_1 x_i])^2 &= \sum_{i=1}^n 2(y_i - [\beta_0 + \beta_1 x_i])(-1) \\ & = \sum_{i=1}^n -2y_i + \sum_{i=1}^n 2\beta_0 + \sum_{i=1}^n 2\beta_1 x_i \\ &= -2 n \bar{y} + 2n \beta_0 + 2 \beta_1 n \bar{x}, \text{ where } \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i \text{ and } \bar{x} = \sum_{i=1}^n x_i \\ &= 2n(\beta_0 + \beta_1 \bar{x} - \bar{y})\\ \end{aligned} \]

Now you try: find the partial derivative with respect to \(\beta_1\).

\[ \begin{aligned} \frac{\partial}{\partial\beta_1} \sum_{i=1}^n (y_i - [\beta_0 + \beta_1 x_i])^2 &= \dots \\ \end{aligned} \]

Then, set both of these partial derivatives equal to zero and solve for \(\beta_0, \beta_1\):

\[ \begin{aligned} ... \end{aligned} \]

Check your answer: if you did everything correctly, your estimates should match the equations in the Stat 155 Notes.

Part 6: Matrix Version of Least Squares

Deriving the least squares and maximum likelihood estimates for \(\beta_0\) and \(\beta_1\) was perhaps a bit tedious, but do-able, for a simple linear regression model with one explanatory variable. Things get more difficult when we consider multiple linear regression models with many explanatory variables:

\[E[y | x_1, x_2, ..., x_p] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots \beta_p x_p.\]

When we have many explanatory variables, it’s often useful to formulate our linear regression model using vectors and matrices.

Consider the vector of outcomes \(\mathbf{y}\)

\[\mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix},\]

the vector of covariates \(\boldsymbol\beta\)

\[\boldsymbol\beta = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{pmatrix},\]

and the matrix of covariates (sometimes referred to as the “design matrix”) \(\mathbf{X}\)

\[\mathbf{X} = \begin{pmatrix} 1 & x_{11} & \cdots & x_{p1} \\ 1 & x_{12} & \cdots & x_{p2} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn} \end{pmatrix}.\]

Then, we can write our linear regression model as

\[E[\mathbf{y} \mid \mathbf{X}] = \mathbf{X} \boldsymbol\beta\]

\[\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon,\]

where \(E[\boldsymbol\epsilon] = \mathbf{0}\).

Linear Algebra Review. What are the dimensions of \(\mathbf{X}\) and \(\boldsymbol\beta\)? What will be the dimension of their product \(\mathbf{X} \boldsymbol\beta\)?

YOUR ANSWER HERE

Linear Algebra Review (continued). Calculate at least a few of the entries of \(\mathbf{X} \boldsymbol\beta\) to confirm that this is an appropriate way to represent our linear regression model equation.

\[ \begin{aligned} ... \end{aligned} \]

Using matrix notation, we can now formulate the least squares problem as follows:

\[\text{argmin}_{\boldsymbol\beta} (\mathbf{y} - \mathbf{X}\boldsymbol\beta)^\top(\mathbf{y} - \mathbf{X}\boldsymbol\beta),\]

where the notation \(\text{argmin}_{\boldsymbol\beta} f(\boldsymbol\beta)\) means that we want to find the value of \(\boldsymbol\beta\) that minimizes the function \(f(\boldsymbol\beta)\).

Linear Algebra Review (continued). Convince yourself that the expression \((\mathbf{y} - \mathbf{X}\boldsymbol\beta)^\top(\mathbf{y} - \mathbf{X}\boldsymbol\beta)\) is equivalent to the sum of squared residuals \(\sum_{i=1}^n r_i^2\) as we’re used to seeing it.

\[ \begin{aligned} ... \end{aligned} \]

Part 7: Matrix Calculus

Our next step is to find the value of \(\boldsymbol\beta\) that minimizes the sum of squared residuals \((\mathbf{y} - \mathbf{X}\boldsymbol\beta)^\top(\mathbf{y} - \mathbf{X}\boldsymbol\beta)\).

There are some useful results about derivatives of vectors and matrices that will be helpful for this derivation. In particular:

\((a + b)^\top c = a^\top c + b^\top c\)
\((AB)^\top = B^\top A^\top\)
\(a^\top b = b^\top a\)
\(\frac{\partial}{\partial \mathbf{a}} \mathbf{a}^\top \mathbf{b} = \mathbf{b} = \frac{\partial}{\partial \mathbf{a}} \mathbf{b}^\top \mathbf{a}\)
\(\frac{\partial}{\partial \mathbf{a}} \mathbf{a}^\top \mathbf{B} \mathbf{a} = 2\mathbf{B}\mathbf{a}\)

Challenge

For this assignment, you can use these results without proof. But, it would be a useful exercise to confirm for yourself that all (or at least some) of these results are true!

Using these results, find the derivative of the sum of squared residuals with respect to \(\boldsymbol\beta\):

\[ \begin{aligned} \frac{\partial}{\partial \boldsymbol\beta} (\mathbf{y} - \mathbf{X}\boldsymbol\beta)^\top(\mathbf{y} - \mathbf{X}\boldsymbol\beta) & = \dots \end{aligned} \]

Then, set this equal to zero and solve for \(\boldsymbol\beta\):

\[ \begin{aligned} -2\mathbf{X}^\top\mathbf{y} + 2 \mathbf{X}^\top \mathbf{X} \boldsymbol\beta &\stackrel{set}{=} 0 \\ \dots \end{aligned} \]

Check your answer: you should end up with the solution \(\hat{\boldsymbol\beta} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\).

Part 8: The p > n problem, revisited

Now that we’ve done all that math, we can finally come back to our p > n problem.

We saw above that the least squares estimator for the coefficients of a multiple linear regression model can be written as \[\hat{\boldsymbol\beta} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}.\]

Let’s calculate \(\mathbf{X}^\top \mathbf{X}\) for our toy example, above:

xtx <- t(snps) %*% snps

What is the determinant of \(\mathbf{X}^\top \mathbf{X}\)?

# your code here

What does this tell us about \((\mathbf{X}^\top \mathbf{X})^{-1}\)?

YOUR ANSWER HERE