Lab 4: Principal component analysis

STAT 494: Statistical Genetics

Published

February 25, 2025



Goals

  • Gain intuition for how/why PCA captures population structure (and other features like linkage disequilibrium)
  • Compare GWAS models with and without PC adjustment
  • Explore and derive omitted variable bias
  • Implement PCA in R, using a base R function and an R package specifically designed for working with genetic data



Getting Started

  1. Access the source code for this document (click the </> Code button above)

  2. Copy-paste into a new QMD

  3. Add your name as the author in the YAML header, above

  4. Download the .bib file that contains the references cited later in this lab here. Update the path to the bibliography file in the YAML header, above, depending on where you saved it relative to your QMD.

  5. Load the packages that we’ll need for this lab. (And install them first if needed.)

library(tidyverse)
library(GGally)
library(gridExtra)
  1. Note the new usage of customized Quarto callouts, like below, for typing your answers. Please keep all answers within these blocks. This will make it easier for us to review your work!
Your Answer

Here’s an example of one of our customized “Your Answer” callout blocks. Whenever you are answering a question, please put all text within the provided block.



Exercises

Part 1: Building Intuition

Load Data

Read the toy data (pca_toy_data.csv) into R:

pcadata <- read_csv('https://mac-stat.github.io/statgen/data/pca_toy_data.csv')
Rows: 1000 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (17): population, trait, SNP1, SNP2, SNP3, SNP4, SNP5, SNP6, SNP7, SNP8,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.



Explore Data

Explore the toy dataset, with the goal of answering each of the questions below.

How many people are in this dataset? How many SNPs?

# your code here
Your Answer

type your answer here (keeping it inside this callout block.)


How many populations are represented? How many people belong to each population?

# your code here
Your Answer

your answer here


Are there any SNPs that seem to be more common in one population than the other? By how much do the allele frequencies in the two populations differ for each SNP?

# your code here
Your Answer

your answer here



Run PCA

There is more than one way to run PCA in R. We’ll use similar code to what you may have seen in STAT 253: Statistical Machine Learning.

Tip

If PCA is new to you, or you need a refresher, I recommend taking some time to review the following resources:


First, we’ll need to set up the data. Create a data frame called geno that contains only the genotype information for each of the SNPs (i.e., remove the trait and population information).

# your code here


Then, run the code chunk below to perform PCA using the prcomp function.

pca_out <- prcomp(geno, center = TRUE, scale = TRUE)
Error: object 'geno' not found


Open the help page for prcomp and read through the description of what it returns. Hint: see the “Value” section of the help page. Which component of the output contains the loadings? Which part contains the scores?

Your Answer

your answer here


Based on what you learned from the help page, extract the loadings and scores and save them as new objects called pca_loadings and pca_scores, respectively. Hint: use $ or pluck() to extract a particular component from a list.

# your code here



Plot Results

Scores

Create a scatterplot comparing the scores for the first two PCs, with points colored by which population the individual belongs to.

# your code here

What do you learn from this plot?

Your Answer

your answer here


Another way to visualize the PC scores is to make what’s known as a parallel coordinates plot. This plot will allow us to look at all of our PCs at once. As above, we’ll color each of the individuals according to which population they belong to. I’ll give you most of the code for this one:

# parallel coordinates plot
pca_scores %>%
  as.data.frame() %>%
  mutate(population = as.factor(pcadata$population)) %>% 
  ggparcoord(columns = 1:15, groupColumn = 'population', alpha = 0.2) + 
  theme_minimal() + 
  scale_color_brewer(palette = 'Dark2')
Error: object 'pca_scores' not found

What do you learn from the parallel coordinates plot?

Your Answer

your answer here



Loadings

Next, let’s investigate which SNPs have the largest contributions to the first PC by looking at the loadings. Create a data visualization that compares the loadings of each SNP for the first PC.

# your code here

Which SNPs contribute the most, either positively or negatively, to the first PC? Why do you think this is?

Your Answer

your answer here


Now, repeat this process for the second PC, third, and fourth PCs. What do you notice?

# your code here
Your Answer

your answer here



Variance Explained

As we’ve discussed in class, the principal components are ordered in terms of the amount of variance they explain in the original data. To see exactly what proportion of total variance each PC explains, we can create what’s known as a scree plot. The code chunk below calculates the proportion of variance explained.

# extract variance of each PC
pca_var <- (pca_out$sdev)^2
Error: object 'pca_out' not found
# calculate proportion of variance explained
total_var <- sum(pca_var)
Error: object 'pca_var' not found
pve <- pca_var/total_var
Error: object 'pca_var' not found


Your turn: create a data visualization to show how the proportion of variance explained compares across the PCs.

# your code here

What do you learn from this plot?

Your Answer

your answer here



Impact of LD

In multiple journal club articles (Novembre et al. 2008; Burton et al. 2007) we’ve seen researchers remove SNPs that are highly correlated with one another (i.e., in high LD) prior to running PCA. Justification for this step is not always provided, but the Wellcome Trust Case Control Consortium GWAS notes that PCs may otherwise pick up “effects attributable to local linkage disequilibrium rather than genome-wide structure” (p. 663). Reducing the number of SNPs on which we have to run PCA also offers computational advantages.


First, let’s review what we know, in general, about what happens when we run PCA on a set of correlated features. Think back to STAT 253 (or review these notes, particularly Examples 1–5, if you haven’t taken it yet): what is often true about PCA and correlated features?

Your Answer

your answer here


Next, explain how you could design a simulation study or, at least a “toy” data example like the one we’re using in this lab, to explore the impact of correlation (i.e., linkage disequilibrium) on PCA in the context of genetic data. How would you set up the data? How would you assess the impact of the correlation between SNPs in that data on PCA?

Your Answer

your answer here


Optional Challenge: Implement the approach you described above. What did you learn?

Your Answer (Optional Challenge)

your answer here



Part 2: Impact on GWAS

Toy Data

Now, let’s investigate the impact of adjusting for PCs in GWAS models. First, conduct a genome-wide association study using marginal regression models that do not make any adjustment for population structure. In other words, use models of the form trait ~ snp with no other covariates.

# your code here

What do you notice? Which SNP(s) have the smallest p-value?

Your Answer

your answer here


Next, conduct a genome-wide association study using models that adjust for the first PC (trait ~ snp + PC1). How do the results compare to the unadjusted analysis?

# your code here
Your Answer

your answer here


The trait that we are using for these GWAS is simulated, so we know the “answers” here. In particular, I created this trait such that it depends on the genotype at SNP3 as well as the population that the person belongs to. Now that you know the “truth”, re-evaluate your results above. Which model is better? Why?

Your Answer

your answer here



Confounding

The question of when we should include PCs — or some other measure of genetic similarity / geography / ancestry / population structure — in our GWAS models is at it’s heart a question about confounding.

STAT 155 Review: Suppose \(Y\) is our outcome variable and \(X\) is our predictor of interest. A third variable, \(Z\), confounds the relationship between \(X\) and \(Y\) if… (fill in the blank).

Your Answer

your answer here


Using this definition as a guide, explain why population membership is a confounding variable in the context of the toy data GWAS you implemented in the previous section.

Your Answer

your answer here



Omitted Variable Bias

As you learned in STAT 155, if we believe a variable is a confounder we should adjust for it (i.e., include it as a covariate) in our regression model. In the notation of above, if I am interested in the relationship between \(X\) and \(Y\) but I believe a third variable, \(Z\) is a confounding variable, then I should fit the regression model lm(y ~ x + z) rather than lm(y ~ x). If we don’t do this, our results will be biased.

Warning

This section is under construction! More detailed instructions coming soon.



Part 3: Application to Real Genetic Data (Learn a New R Package!)

As we discussed in the case of implementing GWAS in Lab 2, in practice many statistical genetics researchers do not implement PCA “by hand” as we did above. Instead, they typically use one of the many R packages (or other software tools) that have been designed specifically for the task of implementing PCA in genetic data. One example — created by some of researchers at the University of Washington — is SNPRelate. Let’s try it out!


First, install the SNPRelate R package from Bioconductor:

# NOTE: keep the code chunk option eval: false so this gets skipped when rendering
# your installation code here

Then load it:

# your code here


Next, locate the “SNPRelate Tutorial” vignette. (Refer back to Lab 2 for a reminder on how to access a package’s vignettes.) Read, and replicate, the first two parts of the Data Analysis section: “LD-based SNP pruning” and “Principal Component Analysis”. Record your code and insights here. In particular…

How many individuals are present in the provided example data? How many SNPs?

# your code here
Your Answer

your answer here


What populations are present in these data? What do each of the population codes (e.g., “CEU”) stand for?

# your code here
Your Answer

your answer here


The snpgdsLDpruning function performs a pre-processing step known as “LD pruning” which calculates the correlation (i.e., LD) between SNPs and removes one of the two SNPs from every pair of highly correlated SNPs. Run this function with a seed of 494. How many SNPs remain after LD pruning?

set.seed(494)
# your code here
Your Answer

your answer here


Now, run PCA on this reduced set of “LD-pruned” SNPs.

# your code here


Finally, create the following plots:

  1. a score plot (with individuals colored by their “true” population label)
  2. parallel coordinates plot
  3. loadings plot (the tutorial presents an example looking at the correlation between eigenvectors and SNP genotypes. this is similar to looking at the loadings. there’s also a snpgdsPCASNPLoading function – see if you can figure out how to use it.)
  4. scree plot
# score plot
# parallel coordinates plot
# loadings plot
# scree plot

What do you learn from these plots? Write down 1–2 sentences for each.

Your Answer

your answer here



References

Burton, Paul R., David G. Clayton, Lon R. Cardon, Nick Craddock, Panos Deloukas, Audrey Duncanson, Dominic P. Kwiatkowski, et al. 2007. “Genome-Wide Association Study of 14,000 Cases of Seven Common Diseases and 3,000 Shared Controls.” Nature 447 (7145): 661–78.
Novembre, John, Toby Johnson, Katarzyna Bryc, Zoltán Kutalik, Adam R Boyko, Adam Auton, Amit Indap, et al. 2008. “Genes Mirror Geography Within Europe.” Nature 456 (7218): 98–101.