Lab 4: Principal component analysis

STAT 494: Statistical Genetics

Published

February 25, 2025

Goals

Gain intuition for how/why PCA captures population structure (and other features like linkage disequilibrium)
Compare GWAS models with and without PC adjustment
Explore and derive omitted variable bias
Implement PCA in R, using a base R function and an R package specifically designed for working with genetic data

Getting Started

Access the source code for this document (click the </> Code button above)
Copy-paste into a new QMD
Add your name as the author in the YAML header, above
Download the .bib file that contains the references cited later in this lab here. Update the path to the bibliography file in the YAML header, above, depending on where you saved it relative to your QMD.
Load the packages that we’ll need for this lab. (And install them first if needed.)

library(tidyverse)
library(GGally)
library(gridExtra)

Note the new usage of customized Quarto callouts, like below, for typing your answers. Please keep all answers within these blocks. This will make it easier for us to review your work!

Your Answer

Here’s an example of one of our customized “Your Answer” callout blocks. Whenever you are answering a question, please put all text within the provided block.

Exercises

Part 1: Building Intuition

Load Data

Read the toy data (pca_toy_data.csv) into R:

pcadata <- read_csv('https://mac-stat.github.io/statgen/data/pca_toy_data.csv')

Rows: 1000 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (17): population, trait, SNP1, SNP2, SNP3, SNP4, SNP5, SNP6, SNP7, SNP8,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Explore Data

Explore the toy dataset, with the goal of answering each of the questions below.

How many people are in this dataset? How many SNPs?

# your code here

Your Answer

type your answer here (keeping it inside this callout block.)

How many populations are represented? How many people belong to each population?

# your code here

Your Answer

your answer here

Are there any SNPs that seem to be more common in one population than the other? By how much do the allele frequencies in the two populations differ for each SNP?

# your code here

Your Answer

your answer here

Run PCA

There is more than one way to run PCA in R. We’ll use similar code to what you may have seen in STAT 253: Statistical Machine Learning.

Tip

If PCA is new to you, or you need a refresher, I recommend taking some time to review the following resources:

Stat 253 Videos: PCA (v1), PCA (v2)
Stat 253 Course Notes: Lectures 19 (PCA) and 20 (PCR)
Introduction to Statistical Learning: Sections 6.3.1 (PCR) and 12.2 (PCA)

First, we’ll need to set up the data. Create a data frame called geno that contains only the genotype information for each of the SNPs (i.e., remove the trait and population information).

# your code here

Then, run the code chunk below to perform PCA using the prcomp function.

pca_out <- prcomp(geno, center = TRUE, scale = TRUE)

Error: object 'geno' not found

Open the help page for prcomp and read through the description of what it returns. Hint: see the “Value” section of the help page. Which component of the output contains the loadings? Which part contains the scores?

Your Answer

your answer here

Based on what you learned from the help page, extract the loadings and scores and save them as new objects called pca_loadings and pca_scores, respectively. Hint: use $ or pluck() to extract a particular component from a list.

# your code here

Plot Results

Scores

Create a scatterplot comparing the scores for the first two PCs, with points colored by which population the individual belongs to.

# your code here

What do you learn from this plot?

Your Answer

your answer here

Another way to visualize the PC scores is to make what’s known as a parallel coordinates plot. This plot will allow us to look at all of our PCs at once. As above, we’ll color each of the individuals according to which population they belong to. I’ll give you most of the code for this one:

# parallel coordinates plot
pca_scores %>%
  as.data.frame() %>%
  mutate(population = as.factor(pcadata$population)) %>% 
  ggparcoord(columns = 1:15, groupColumn = 'population', alpha = 0.2) + 
  theme_minimal() + 
  scale_color_brewer(palette = 'Dark2')

Error: object 'pca_scores' not found

What do you learn from the parallel coordinates plot?

Your Answer

your answer here

Loadings

Next, let’s investigate which SNPs have the largest contributions to the first PC by looking at the loadings. Create a data visualization that compares the loadings of each SNP for the first PC.

# your code here

Which SNPs contribute the most, either positively or negatively, to the first PC? Why do you think this is?

Your Answer

your answer here

Now, repeat this process for the second PC, third, and fourth PCs. What do you notice?

# your code here

Your Answer

your answer here

Variance Explained

As we’ve discussed in class, the principal components are ordered in terms of the amount of variance they explain in the original data. To see exactly what proportion of total variance each PC explains, we can create what’s known as a scree plot. The code chunk below calculates the proportion of variance explained.

# extract variance of each PC
pca_var <- (pca_out$sdev)^2

Error: object 'pca_out' not found

# calculate proportion of variance explained
total_var <- sum(pca_var)

Error: object 'pca_var' not found

pve <- pca_var/total_var

Error: object 'pca_var' not found

Your turn: create a data visualization to show how the proportion of variance explained compares across the PCs.

# your code here

What do you learn from this plot?

Your Answer

your answer here

Impact of LD

In multiple journal club articles (Novembre et al. 2008; Burton et al. 2007) we’ve seen researchers remove SNPs that are highly correlated with one another (i.e., in high LD) prior to running PCA. Justification for this step is not always provided, but the Wellcome Trust Case Control Consortium GWAS notes that PCs may otherwise pick up “effects attributable to local linkage disequilibrium rather than genome-wide structure” (p. 663). Reducing the number of SNPs on which we have to run PCA also offers computational advantages.

First, let’s review what we know, in general, about what happens when we run PCA on a set of correlated features. Think back to STAT 253 (or review these notes, particularly Examples 1–5, if you haven’t taken it yet): what is often true about PCA and correlated features?

Your Answer

your answer here

Next, explain how you could design a simulation study or, at least a “toy” data example like the one we’re using in this lab, to explore the impact of correlation (i.e., linkage disequilibrium) on PCA in the context of genetic data. How would you set up the data? How would you assess the impact of the correlation between SNPs in that data on PCA?

Your Answer

your answer here

Optional Challenge: Implement the approach you described above. What did you learn?

Your Answer (Optional Challenge)

your answer here

Part 2: Impact on GWAS

Toy Data

Now, let’s investigate the impact of adjusting for PCs in GWAS models. First, conduct a genome-wide association study using marginal regression models that do not make any adjustment for population structure. In other words, use models of the form trait ~ snp with no other covariates.

# your code here

What do you notice? Which SNP(s) have the smallest p-value?

Your Answer

your answer here

Next, conduct a genome-wide association study using models that adjust for the first PC (trait ~ snp + PC1). How do the results compare to the unadjusted analysis?

# your code here

Your Answer

your answer here

The trait that we are using for these GWAS is simulated, so we know the “answers” here. In particular, I created this trait such that it depends on the genotype at SNP3 as well as the population that the person belongs to. Now that you know the “truth”, re-evaluate your results above. Which model is better? Why?

Your Answer

your answer here

Confounding

The question of when we should include PCs — or some other measure of genetic similarity / geography / ancestry / population structure — in our GWAS models is at it’s heart a question about confounding.

STAT 155 Review: Suppose $Y$ is our outcome variable and $X$ is our predictor of interest. A third variable, $Z$, confounds the relationship between $X$ and $Y$ if… (fill in the blank).

Your Answer

your answer here

Using this definition as a guide, explain why population membership is a confounding variable in the context of the toy data GWAS you implemented in the previous section.

Your Answer

your answer here

Omitted Variable Bias

As you learned in STAT 155, if we believe a variable is a confounder we should adjust for it (i.e., include it as a covariate) in our regression model. In the notation of above, if I am interested in the relationship between $X$ and $Y$ but I believe a third variable, $Z$ is a confounding variable, then I should fit the regression model lm(y ~ x + z) rather than lm(y ~ x). If we don’t do this, our results will be biased.

Warning

This section is under construction! More detailed instructions coming soon.

Part 3: Application to Real Genetic Data (Learn a New R Package!)

As we discussed in the case of implementing GWAS in Lab 2, in practice many statistical genetics researchers do not implement PCA “by hand” as we did above. Instead, they typically use one of the many R packages (or other software tools) that have been designed specifically for the task of implementing PCA in genetic data. One example — created by some of researchers at the University of Washington — is SNPRelate. Let’s try it out!

First, install the SNPRelate R package from Bioconductor:

# NOTE: keep the code chunk option eval: false so this gets skipped when rendering
# your installation code here

Then load it:

# your code here

Next, locate the “SNPRelate Tutorial” vignette. (Refer back to Lab 2 for a reminder on how to access a package’s vignettes.) Read, and replicate, the first two parts of the Data Analysis section: “LD-based SNP pruning” and “Principal Component Analysis”. Record your code and insights here. In particular…

How many individuals are present in the provided example data? How many SNPs?

# your code here

Your Answer

your answer here

What populations are present in these data? What do each of the population codes (e.g., “CEU”) stand for?

# your code here

Your Answer

your answer here

The snpgdsLDpruning function performs a pre-processing step known as “LD pruning” which calculates the correlation (i.e., LD) between SNPs and removes one of the two SNPs from every pair of highly correlated SNPs. Run this function with a seed of 494. How many SNPs remain after LD pruning?

set.seed(494)
# your code here

Your Answer

your answer here

Now, run PCA on this reduced set of “LD-pruned” SNPs.

# your code here

Finally, create the following plots:

a score plot (with individuals colored by their “true” population label)
parallel coordinates plot
loadings plot (the tutorial presents an example looking at the correlation between eigenvectors and SNP genotypes. this is similar to looking at the loadings. there’s also a snpgdsPCASNPLoading function – see if you can figure out how to use it.)
scree plot

# score plot

# parallel coordinates plot

# loadings plot

# scree plot

What do you learn from these plots? Write down 1–2 sentences for each.

Your Answer

your answer here

References

Burton, Paul R., David G. Clayton, Lon R. Cardon, Nick Craddock, Panos Deloukas, Audrey Duncanson, Dominic P. Kwiatkowski, et al. 2007. “Genome-Wide Association Study of 14,000 Cases of Seven Common Diseases and 3,000 Shared Controls.” Nature 447 (7145): 661–78.

Novembre, John, Toby Johnson, Katarzyna Bryc, Zoltán Kutalik, Adam R Boyko, Adam Auton, Amit Indap, et al. 2008. “Genes Mirror Geography Within Europe.” Nature 456 (7218): 98–101.

--- title: "Lab 4: Principal component analysis" subtitle: "STAT 494: Statistical Genetics" image: "../images/pca_scoreplot.png" bibliography: "../files/references.bib" date: "February 25, 2025" author: "" format: html: toc: true toc-depth: 3 embed-resources: true code-tools: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, error = TRUE) ``` \ \ # Goals - Gain intuition for how/why PCA captures population structure (and other features like linkage disequilibrium) - Compare GWAS models with and without PC adjustment - Explore and derive omitted variable bias - Implement PCA in R, using a base R function and an R package specifically designed for working with genetic data \ \ # Getting Started 1. Access the source code for this document (click the `</> Code` button above) 2. Copy-paste into a new QMD 3. Add your name as the `author` in the YAML header, above 4. Download the `.bib` file that contains the references cited later in this lab [here](../files/references.bib). Update the path to the bibliography file in the YAML header, above, depending on where you saved it relative to your QMD. 5. Load the packages that we'll need for this lab. (And install them first if needed.) ```{r} #| message: false library(tidyverse) library(GGally) library(gridExtra) ``` 6. Note the new usage of customized Quarto [callouts](https://quarto.org/docs/authoring/callouts.html), like below, for typing your answers. Please keep all answers within these blocks. This will make it easier for us to review your work! :::{.callout-note icon=false title="Your Answer"} Here's an example of one of our customized "Your Answer" callout blocks. Whenever you are answering a question, please put all text within the provided block. ::: \ \ # Exercises ## Part 1: Building Intuition ### Load Data Read the toy data (`pca_toy_data.csv`) into R: ```{r read-data} pcadata <- read_csv('https://mac-stat.github.io/statgen/data/pca_toy_data.csv') ``` \ \ ### Explore Data Explore the toy dataset, with the goal of answering each of the questions below. How many people are in this dataset? How many SNPs? ```{r} # your code here ``` :::{.callout-note icon=false title="Your Answer"} type your answer here (keeping it inside this callout block.) ::: \ How many populations are represented? How many people belong to each population? ```{r} # your code here ``` :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ Are there any SNPs that seem to be more common in one population than the other? By how much do the allele frequencies in the two populations differ for each SNP? ```{r} # your code here ``` :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ \ ### Run PCA There is more than one way to run PCA in R. We'll use similar code to what you may have seen in *STAT 253: Statistical Machine Learning*. :::{.callout-tip} If PCA is new to you, or you need a refresher, I recommend taking some time to review the following resources: - Stat 253 Videos: [PCA (v1)](https://www.youtube.com/watch?v=JEwkN4Lq92M), [PCA (v2)](https://www.youtube.com/watch?v=x4_XTBoxr3Q) - [Stat 253 Course Notes](https://kegrinde.github.io/stat253_coursenotes/): Lectures 19 (PCA) and 20 (PCR) - [Introduction to Statistical Learning](https://www.statlearning.com/): Sections 6.3.1 (PCR) and 12.2 (PCA) ::: \ First, we'll need to set up the data. Create a data frame called `geno` that contains only the genotype information for each of the SNPs (i.e., remove the `trait` and `population` information). ```{r pca-setup} # your code here ``` \ Then, run the code chunk below to perform PCA using the `prcomp` function. ```{r run-pca} pca_out <- prcomp(geno, center = TRUE, scale = TRUE) ``` \ Open the help page for `prcomp` and read through the description of what it returns. *Hint: see the "Value" section of the help page.* Which component of the output contains the loadings? Which part contains the scores? :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ Based on what you learned from the help page, extract the loadings and scores and save them as new objects called `pca_loadings` and `pca_scores`, respectively. *Hint: use `$` or `pluck()` to extract a particular component from a list.* ```{r extract-scores-and-loadings} # your code here ``` \ \ ### Plot Results #### Scores Create a scatterplot comparing the scores for the first two PCs, with points colored by which population the individual belongs to. ```{r plot-scores} # your code here ``` What do you learn from this plot? :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ Another way to visualize the PC scores is to make what's known as a *parallel coordinates plot*. This plot will allow us to look at all of our PCs at once. As above, we'll color each of the individuals according to which population they belong to. I'll give you most of the code for this one: ```{r parallel-coordinates} # parallel coordinates plot pca_scores %>% as.data.frame() %>% mutate(population = as.factor(pcadata$population)) %>% ggparcoord(columns = 1:15, groupColumn = 'population', alpha = 0.2) + theme_minimal() + scale_color_brewer(palette = 'Dark2') ``` What do you learn from the parallel coordinates plot? :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ \ #### Loadings Next, let's investigate which SNPs have the largest contributions to the first PC by looking at the loadings. Create a data visualization that compares the loadings of each SNP for the first PC. ```{r plot-PC1-loadings} # your code here ``` Which SNPs contribute the most, either positively or negatively, to the first PC? Why do you think this is? :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ Now, repeat this process for the second PC, third, and fourth PCs. What do you notice? ```{r plot-other-loadings} # your code here ``` :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ \ #### Variance Explained As we've discussed in class, the principal components are ordered in terms of the amount of variance they explain in the original data. To see exactly what proportion of total variance each PC explains, we can create what's known as a scree plot. The code chunk below calculates the proportion of variance explained. ```{r} # extract variance of each PC pca_var <- (pca_out$sdev)^2 # calculate proportion of variance explained total_var <- sum(pca_var) pve <- pca_var/total_var ``` \ Your turn: create a data visualization to show how the proportion of variance explained compares across the PCs. ```{r} # your code here ``` What do you learn from this plot? :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ \ ### Impact of LD In multiple journal club articles [@novembre2008; @wellcome2007] we've seen researchers remove SNPs that are highly correlated with one another (i.e., in high LD) prior to running PCA. Justification for this step is not always provided, but the Wellcome Trust Case Control Consortium GWAS notes that PCs may otherwise pick up "effects attributable to local linkage disequilibrium rather than genome-wide structure" (p. 663). Reducing the number of SNPs on which we have to run PCA also offers computational advantages. \ First, let's review what we know, in general, about what happens when we run PCA on a set of correlated features. Think back to STAT 253 (or review [these notes](https://kegrinde.github.io/stat253_coursenotes/L19-pca.html), particularly Examples 1--5, if you haven't taken it yet): what is often true about PCA and correlated features? :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ Next, explain how you could design a simulation study or, at least a "toy" data example like the one we're using in this lab, to explore the impact of correlation (i.e., linkage disequilibrium) on PCA in the context of genetic data. *How would you set up the data? How would you assess the impact of the correlation between SNPs in that data on PCA?* :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ **Optional Challenge:** Implement the approach you described above. What did you learn? ```{r} ``` :::{.callout-note appearance="minimal" title="Your Answer (Optional Challenge)"} your answer here ::: \ \ ## Part 2: Impact on GWAS ### Toy Data Now, let's investigate the impact of adjusting for PCs in GWAS models. First, conduct a genome-wide association study using marginal regression models that do *not* make any adjustment for population structure. In other words, use models of the form `trait ~ snp` with no other covariates. ```{r unadjusted-gwas} # your code here ``` What do you notice? Which SNP(s) have the smallest p-value? :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ Next, conduct a genome-wide association study using models that adjust for the first PC (`trait ~ snp + PC1`). How do the results compare to the unadjusted analysis? ```{r adjusted-gwas} # your code here ``` :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ The trait that we are using for these GWAS is simulated, so we know the "answers" here. In particular, I created this trait such that it depends on the genotype at SNP3 as well as the population that the person belongs to. Now that you know the "truth", re-evaluate your results above. Which model is better? Why? :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ \ ### Confounding The question of when we should include PCs --- or some other measure of genetic similarity / geography / ancestry / population structure --- in our GWAS models is at it's heart a question about **confounding**. STAT 155 Review: Suppose $Y$ is our outcome variable and $X$ is our predictor of interest. A third variable, $Z$, **confounds** the relationship between $X$ and $Y$ if... (fill in the blank). :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ Using this definition as a guide, explain why `population` membership is a confounding variable in the context of the toy data GWAS you implemented in the previous section. :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ \ ### Omitted Variable Bias As you learned in STAT 155, if we believe a variable is a confounder we should adjust for it (i.e., include it as a covariate) in our regression model. In the notation of above, if I am interested in the relationship between $X$ and $Y$ but I believe a third variable, $Z$ is a confounding variable, then I should fit the regression model `lm(y ~ x + z)` rather than `lm(y ~ x)`. If we don't do this, our results will be **biased**. :::{.callout-warning} This section is under construction! More detailed instructions coming soon. :::  \ \ ## Part 3: Application to Real Genetic Data (Learn a New R Package!) As we discussed in the case of implementing GWAS in Lab 2, in practice many statistical genetics researchers do not implement PCA "by hand" as we did above. Instead, they typically use one of the many R packages (or other software tools) that have been designed specifically for the task of implementing PCA in genetic data. One example --- created by some of researchers at the University of Washington --- is [`SNPRelate`](https://bioconductor.org/packages/release/bioc/html/SNPRelate.html). Let's try it out! \ First, install the `SNPRelate` R package from Bioconductor: ```{r} #| eval: false # NOTE: keep the code chunk option eval: false so this gets skipped when rendering # your installation code here ``` Then load it: ```{r} # your code here ``` \ Next, locate the "SNPRelate Tutorial" vignette. (Refer back to Lab 2 for a reminder on how to access a package's vignettes.) Read, and replicate, the first two parts of the *Data Analysis* section: "LD-based SNP pruning" and "Principal Component Analysis". Record your code and insights here. In particular... How many individuals are present in the provided example data? How many SNPs? ```{r} # your code here ``` :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ What populations are present in these data? What do each of the population codes (e.g., "CEU") stand for? ```{r} # your code here ``` :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ The `snpgdsLDpruning` function performs a pre-processing step known as "LD pruning" which calculates the correlation (i.e., LD) between SNPs and removes one of the two SNPs from every pair of highly correlated SNPs. Run this function with a seed of 494. How many SNPs remain after LD pruning? ```{r} set.seed(494) # your code here ``` :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ Now, run PCA on this reduced set of "LD-pruned" SNPs. ```{r} # your code here ``` \ Finally, create the following plots: 1. a score plot (with individuals colored by their "true" population label) 2. parallel coordinates plot 3. loadings plot (the tutorial presents an example looking at the *correlation between eigenvectors and SNP genotypes*. this is similar to looking at the loadings. there's also a `snpgdsPCASNPLoading` function -- see if you can figure out how to use it.) 4. scree plot ```{r} # score plot ``` ```{r} # parallel coordinates plot ``` ```{r} # loadings plot ``` ```{r} # scree plot ``` What do you learn from these plots? Write down 1--2 sentences for each. :::{.callout-note icon=false title="Your Answer"} your answer here ::: \ \