library(tidyverse)
library(GGally)
library(gridExtra)
Lab 4: Principal component analysis
STAT 494: Statistical Genetics
Goals
- Gain intuition for how/why PCA captures population structure (and other features like linkage disequilibrium)
- Compare GWAS models with and without PC adjustment
- Explore and derive omitted variable bias
- Implement PCA in R, using a base R function and an R package specifically designed for working with genetic data
Getting Started
Access the source code for this document (click the
</> Code
button above)Copy-paste into a new QMD
Add your name as the
author
in the YAML header, aboveDownload the
.bib
file that contains the references cited later in this lab here. Update the path to the bibliography file in the YAML header, above, depending on where you saved it relative to your QMD.Load the packages that we’ll need for this lab. (And install them first if needed.)
- Note the new usage of customized Quarto callouts, like below, for typing your answers. Please keep all answers within these blocks. This will make it easier for us to review your work!
Exercises
Part 1: Building Intuition
Load Data
Read the toy data (pca_toy_data.csv
) into R:
<- read_csv('https://mac-stat.github.io/statgen/data/pca_toy_data.csv') pcadata
Rows: 1000 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (17): population, trait, SNP1, SNP2, SNP3, SNP4, SNP5, SNP6, SNP7, SNP8,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Explore Data
Explore the toy dataset, with the goal of answering each of the questions below.
How many people are in this dataset? How many SNPs?
# your code here
How many populations are represented? How many people belong to each population?
# your code here
Are there any SNPs that seem to be more common in one population than the other? By how much do the allele frequencies in the two populations differ for each SNP?
# your code here
Run PCA
There is more than one way to run PCA in R. We’ll use similar code to what you may have seen in STAT 253: Statistical Machine Learning.
If PCA is new to you, or you need a refresher, I recommend taking some time to review the following resources:
- Stat 253 Videos: PCA (v1), PCA (v2)
- Stat 253 Course Notes: Lectures 19 (PCA) and 20 (PCR)
- Introduction to Statistical Learning: Sections 6.3.1 (PCR) and 12.2 (PCA)
First, we’ll need to set up the data. Create a data frame called geno
that contains only the genotype information for each of the SNPs (i.e., remove the trait
and population
information).
# your code here
Then, run the code chunk below to perform PCA using the prcomp
function.
<- prcomp(geno, center = TRUE, scale = TRUE) pca_out
Error: object 'geno' not found
Open the help page for prcomp
and read through the description of what it returns. Hint: see the “Value” section of the help page. Which component of the output contains the loadings? Which part contains the scores?
Based on what you learned from the help page, extract the loadings and scores and save them as new objects called pca_loadings
and pca_scores
, respectively. Hint: use $
or pluck()
to extract a particular component from a list.
# your code here
Plot Results
Scores
Create a scatterplot comparing the scores for the first two PCs, with points colored by which population the individual belongs to.
# your code here
What do you learn from this plot?
Another way to visualize the PC scores is to make what’s known as a parallel coordinates plot. This plot will allow us to look at all of our PCs at once. As above, we’ll color each of the individuals according to which population they belong to. I’ll give you most of the code for this one:
# parallel coordinates plot
%>%
pca_scores as.data.frame() %>%
mutate(population = as.factor(pcadata$population)) %>%
ggparcoord(columns = 1:15, groupColumn = 'population', alpha = 0.2) +
theme_minimal() +
scale_color_brewer(palette = 'Dark2')
Error: object 'pca_scores' not found
What do you learn from the parallel coordinates plot?
Loadings
Next, let’s investigate which SNPs have the largest contributions to the first PC by looking at the loadings. Create a data visualization that compares the loadings of each SNP for the first PC.
# your code here
Which SNPs contribute the most, either positively or negatively, to the first PC? Why do you think this is?
Now, repeat this process for the second PC, third, and fourth PCs. What do you notice?
# your code here
Variance Explained
As we’ve discussed in class, the principal components are ordered in terms of the amount of variance they explain in the original data. To see exactly what proportion of total variance each PC explains, we can create what’s known as a scree plot. The code chunk below calculates the proportion of variance explained.
# extract variance of each PC
<- (pca_out$sdev)^2 pca_var
Error: object 'pca_out' not found
# calculate proportion of variance explained
<- sum(pca_var) total_var
Error: object 'pca_var' not found
<- pca_var/total_var pve
Error: object 'pca_var' not found
Your turn: create a data visualization to show how the proportion of variance explained compares across the PCs.
# your code here
What do you learn from this plot?
Impact of LD
In multiple journal club articles (Novembre et al. 2008; Burton et al. 2007) we’ve seen researchers remove SNPs that are highly correlated with one another (i.e., in high LD) prior to running PCA. Justification for this step is not always provided, but the Wellcome Trust Case Control Consortium GWAS notes that PCs may otherwise pick up “effects attributable to local linkage disequilibrium rather than genome-wide structure” (p. 663). Reducing the number of SNPs on which we have to run PCA also offers computational advantages.
First, let’s review what we know, in general, about what happens when we run PCA on a set of correlated features. Think back to STAT 253 (or review these notes, particularly Examples 1–5, if you haven’t taken it yet): what is often true about PCA and correlated features?
Next, explain how you could design a simulation study or, at least a “toy” data example like the one we’re using in this lab, to explore the impact of correlation (i.e., linkage disequilibrium) on PCA in the context of genetic data. How would you set up the data? How would you assess the impact of the correlation between SNPs in that data on PCA?
Optional Challenge: Implement the approach you described above. What did you learn?
Part 2: Impact on GWAS
Toy Data
Now, let’s investigate the impact of adjusting for PCs in GWAS models. First, conduct a genome-wide association study using marginal regression models that do not make any adjustment for population structure. In other words, use models of the form trait ~ snp
with no other covariates.
# your code here
What do you notice? Which SNP(s) have the smallest p-value?
Next, conduct a genome-wide association study using models that adjust for the first PC (trait ~ snp + PC1
). How do the results compare to the unadjusted analysis?
# your code here
The trait that we are using for these GWAS is simulated, so we know the “answers” here. In particular, I created this trait such that it depends on the genotype at SNP3 as well as the population that the person belongs to. Now that you know the “truth”, re-evaluate your results above. Which model is better? Why?
Confounding
The question of when we should include PCs — or some other measure of genetic similarity / geography / ancestry / population structure — in our GWAS models is at it’s heart a question about confounding.
STAT 155 Review: Suppose \(Y\) is our outcome variable and \(X\) is our predictor of interest. A third variable, \(Z\), confounds the relationship between \(X\) and \(Y\) if… (fill in the blank).
Using this definition as a guide, explain why population
membership is a confounding variable in the context of the toy data GWAS you implemented in the previous section.
Omitted Variable Bias
As you learned in STAT 155, if we believe a variable is a confounder we should adjust for it (i.e., include it as a covariate) in our regression model. In the notation of above, if I am interested in the relationship between \(X\) and \(Y\) but I believe a third variable, \(Z\) is a confounding variable, then I should fit the regression model lm(y ~ x + z)
rather than lm(y ~ x)
. If we don’t do this, our results will be biased.
This section is under construction! More detailed instructions coming soon.
Part 3: Application to Real Genetic Data (Learn a New R Package!)
As we discussed in the case of implementing GWAS in Lab 2, in practice many statistical genetics researchers do not implement PCA “by hand” as we did above. Instead, they typically use one of the many R packages (or other software tools) that have been designed specifically for the task of implementing PCA in genetic data. One example — created by some of researchers at the University of Washington — is SNPRelate
. Let’s try it out!
First, install the SNPRelate
R package from Bioconductor:
# NOTE: keep the code chunk option eval: false so this gets skipped when rendering
# your installation code here
Then load it:
# your code here
Next, locate the “SNPRelate Tutorial” vignette. (Refer back to Lab 2 for a reminder on how to access a package’s vignettes.) Read, and replicate, the first two parts of the Data Analysis section: “LD-based SNP pruning” and “Principal Component Analysis”. Record your code and insights here. In particular…
How many individuals are present in the provided example data? How many SNPs?
# your code here
What populations are present in these data? What do each of the population codes (e.g., “CEU”) stand for?
# your code here
The snpgdsLDpruning
function performs a pre-processing step known as “LD pruning” which calculates the correlation (i.e., LD) between SNPs and removes one of the two SNPs from every pair of highly correlated SNPs. Run this function with a seed of 494. How many SNPs remain after LD pruning?
set.seed(494)
# your code here
Now, run PCA on this reduced set of “LD-pruned” SNPs.
# your code here
Finally, create the following plots:
- a score plot (with individuals colored by their “true” population label)
- parallel coordinates plot
- loadings plot (the tutorial presents an example looking at the correlation between eigenvectors and SNP genotypes. this is similar to looking at the loadings. there’s also a
snpgdsPCASNPLoading
function – see if you can figure out how to use it.) - scree plot
# score plot
# parallel coordinates plot
# loadings plot
# scree plot
What do you learn from these plots? Write down 1–2 sentences for each.