Lab 4: Principal component analysis

STAT 494: Statistical Genetics

Author

Solutions

Published

March 7, 2025



Goals

  • Gain intuition for how/why PCA captures population structure (and other features like linkage disequilibrium)
  • Compare GWAS models with and without PC adjustment
  • Explore and derive omitted variable bias
  • Implement PCA in R, using a base R function and an R package specifically designed for working with genetic data



Getting Started

  1. Access the source code for this document (click the </> Code button above)

  2. Copy-paste into a new QMD

  3. Add your name as the author in the YAML header, above

  4. Download the .bib file that contains the references cited later in this lab here. Update the path to the bibliography file in the YAML header, above, depending on where you saved it relative to your QMD.

  5. Load the packages that we’ll need for this lab. (And install them first if needed.)

library(tidyverse)
library(GGally)
library(gridExtra)
  1. Note the new usage of customized Quarto callouts, like below, for typing your answers. Please keep all answers within these blocks. This will make it easier for us to review your work!
Your Answer

Here’s an example of one of our customized “Your Answer” callout blocks. Whenever you are answering a question, please put all text within the provided block.



Exercises

Part 1: Building Intuition

Load Data

Read the toy data (pca_toy_data.csv) into R:

pcadata <- read_csv('https://mac-stat.github.io/statgen/data/pca_toy_data.csv')
Rows: 1000 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (17): population, trait, SNP1, SNP2, SNP3, SNP4, SNP5, SNP6, SNP7, SNP8,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.



Explore Data

Explore the toy dataset, with the goal of answering each of the questions below.

How many people are in this dataset? How many SNPs?

# look at first few rows
head(pcadata)
# A tibble: 6 × 17
  population  trait  SNP1  SNP2  SNP3  SNP4  SNP5  SNP6  SNP7  SNP8  SNP9 SNP10
       <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1          1 -0.111     0     0     0     2     0     0     0     0     0     1
2          1  1.47      0     0     0     0     0     0     0     0     0     0
3          1  1.43      0     0     0     0     0     0     1     0     1     0
4          1 -1.72      0     0     0     1     0     0     0     1     0     0
5          1 -0.356     0     0     0     0     0     1     1     1     0     0
6          1  0.516     0     0     0     0     0     0     1     0     0     0
# ℹ 5 more variables: SNP11 <dbl>, SNP12 <dbl>, SNP13 <dbl>, SNP14 <dbl>,
#   SNP15 <dbl>
# calculate dimensions
dim(pcadata)
[1] 1000   17
Answer

There are are 1000 people and 15 SNPs


How many populations are represented? How many people belong to each population?

# count how many people are in each population
pcadata %>%
  count(population)
# A tibble: 2 × 2
  population     n
       <dbl> <int>
1          1   500
2          2   500
Answer

There are two populations (Population 1 and Population 2), with 500 people from each.


Are there any SNPs that seem to be more common in one population than the other? By how much do the allele frequencies in the two populations differ for each SNP?

# here's one way (but not the only way) to do this

# function to get empirical minor allele frequency
## count up how many minor alleles are observed
## divide by two times the number of people
get_MAF <- function(snp){
  sum(snp)/(2*length(snp))
}

# get observed allele frequency for each population
pcadata %>%
  group_by(population) %>%
  summarize_all(get_MAF)
# A tibble: 2 × 17
  population trait  SNP1  SNP2  SNP3  SNP4  SNP5  SNP6  SNP7  SNP8  SNP9 SNP10
       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1          1 0.268     0     0 0.298 0.291 0.293 0.102 0.107 0.094 0.095 0.097
2          2 0.997     1     1 0.512 0.506 0.504 0.11  0.085 0.097 0.092 0.111
# ℹ 5 more variables: SNP11 <dbl>, SNP12 <dbl>, SNP13 <dbl>, SNP14 <dbl>,
#   SNP15 <dbl>
Answer

From this, we notice that the first two SNPs have extreme differences in the minor allele frequencies between Populations 1 and 2: no one in Population 1 has the minor allele at either SNP, whereas everyone in Population 2 has two copies.

We also see that SNPs 3–5 have different frequencies in the two populations, with the minor allele being more common (MAF \(\approx\) 50%) in Population 2 as opposed to Population 1 (MAF \(\approx\) 30%).

The remaining SNPs seem to be approximately equally frequent across the two populations, with a minor allele frequency of about 10% in both Population 1 and Population 2.



Run PCA

There is more than one way to run PCA in R. We’ll use similar code to what you may have seen in STAT 253: Statistical Machine Learning.

Tip

If PCA is new to you, or you need a refresher, I recommend taking some time to review the following resources:


First, we’ll need to set up the data. Create a data frame called geno that contains only the genotype information for each of the SNPs (i.e., remove the trait and population information).

geno <- pcadata %>%
  select(-population, -trait)


Then, run the code chunk below to perform PCA using the prcomp function.

pca_out <- prcomp(geno, center = TRUE, scale = TRUE)


Open the help page for prcomp and read through the description of what it returns. Hint: see the “Value” section of the help page. Which component of the output contains the loadings? Which part contains the scores?

Answer
  • Loadings: rotation
  • Scores: x


Based on what you learned from the help page, extract the loadings and scores and save them as new objects called pca_loadings and pca_scores, respectively. Hint: use $ or pluck() to extract a particular component from a list.

pca_loadings <- pca_out$rotation
pca_scores <- pca_out$x



Plot Results

Scores

Create a scatterplot comparing the scores for the first two PCs, with points colored by which population the individual belongs to.

pca_scores %>%
  as.data.frame() %>% # convert pca_scores into a data frame for plotting
  mutate(population = as.factor(pcadata$population)) %>%  # add the population labels
  ggplot(aes(x = PC1, y = PC2, color = population)) + # then plot
  geom_point() + 
  scale_color_brewer(palette = 'Dark2')

What do you learn from this plot?

Answer

Although we didn’t give PCA any information about the populations that these individuals belonged to, we see that the first PC seems to separate individuals out into these two populations!


Another way to visualize the PC scores is to make what’s known as a parallel coordinates plot. This plot will allow us to look at all of our PCs at once. As above, we’ll color each of the individuals according to which population they belong to. I’ll give you most of the code for this one:

# parallel coordinates plot
pca_scores %>%
  as.data.frame() %>%
  mutate(population = as.factor(pcadata$population)) %>% 
  ggparcoord(columns = 1:15, groupColumn = 'population', alpha = 0.2) + 
  theme_minimal() + 
  scale_color_brewer(palette = 'Dark2')

What do you learn from the parallel coordinates plot?

Answer

The parallel coordinates plot is a way of visualizing the scores for all of our PCs at once. Each line on this plot is one person, and tracing their path through the parallel coordinates plot tells us their scores for each PC. Coloring the plot by the population labels can highlight patterns in terms of which PCs seem to be capturing population membership. In this case, we can see that the first PC seems to be separating out the individuals in Population 1 (lower scores) from Population 2 (higher scores), whereas the mixture of colors for later PCs indicated that those PCs don’t seem to be capturing population membership.

Note: in practice, some groups use parallel coordinates plots to decide how many PCs they want to adjust for in their GWAS models! Here’s an example from a study that I have worked with: Hispanic Community Health Study/Study of Latinos (see Figure 3).



Loadings

Next, let’s investigate which SNPs have the largest contributions to the first PC by looking at the loadings. Create a data visualization that compares the loadings of each SNP for the first PC.

# option 1
pca_loadings %>%
  as.data.frame() %>%
  mutate(index = seq_len(nrow(pca_loadings))) %>%
  ggplot(aes(x = index, y = PC1)) + 
  geom_point() + 
  labs(xlab = 'SNP Number', ylab = 'Loadings for PC1') + 
  theme_minimal()

# option 2
library(reshape2)
pca_loadings %>% 
  melt() %>%
  filter(Var2 == "PC1") %>% 
  ggplot(aes(x = Var1, y = value, fill = Var1)) +
  geom_bar(stat = "identity") +
  labs(y = "loadings", x = "original features", fill = "original features") + 
  theme(axis.text.x = element_text(angle = 70, vjust = 1, hjust=1))

Which SNPs contribute the most, either positively or negatively, to the first PC? Why do you think this is?

Answer

SNPs 1 and 2 have the highest loadings, followed by SNPs 3–5. These are the SNPs that have different allele frequencies in the two populations.


Now, repeat this process for the second PC, third, and fourth PCs. What do you notice?

# option 1
l2 <- pca_loadings %>%
  as.data.frame() %>%
  mutate(index = seq_len(nrow(pca_loadings))) %>%
  ggplot(aes(x = index, y = PC2)) + 
  geom_point() + 
  labs(xlab = 'SNP Number', ylab = 'Loadings for PC2') + 
  theme_minimal()
l3 <- pca_loadings %>%
  as.data.frame() %>%
  mutate(index = seq_len(nrow(pca_loadings))) %>%
  ggplot(aes(x = index, y = PC3)) + 
  geom_point() + 
  labs(xlab = 'SNP Number', ylab = 'Loadings for PC3') + 
  theme_minimal()
l4 <- pca_loadings %>%
  as.data.frame() %>%
  mutate(index = seq_len(nrow(pca_loadings))) %>%
  ggplot(aes(x = index, y = PC4)) + 
  geom_point() + 
  labs(xlab = 'SNP Number', ylab = 'Loadings for PC4') + 
  theme_minimal()
grid.arrange(l2, l3, l4, ncol = 1)

# option 2
pca_loadings[, 1:4] %>%
  melt() %>% 
  ggplot(aes(x = Var1, y = value, fill = Var1)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ Var2) + 
  labs(y = "loadings", x = "original features", fill = "original features") +
  theme(axis.text.x=element_blank(), axis.ticks.x = element_blank())

Answer

There doesn’t seem to be as much of a distinct pattern for these later PCs in terms of particular SNPs having more weight.



Variance Explained

As we’ve discussed in class, the principal components are ordered in terms of the amount of variance they explain in the original data. To see exactly what proportion of total variance each PC explains, we can create what’s known as a scree plot. The code chunk below calculates the proportion of variance explained.

# extract variance of each PC
pca_var <- (pca_out$sdev)^2

# calculate proportion of variance explained
total_var <- sum(pca_var)
pve <- pca_var/total_var


Your turn: create a data visualization to show how the proportion of variance explained compares across the PCs.

# option 1
pve %>%
  as.data.frame() %>%
  mutate(index = seq_len(length(pca_var))) %>%
  ggplot(aes(x = index, y = pve)) + 
  geom_point() + 
  geom_line() + 
  labs(x = 'SNP Number', y = 'Percent of Variance Explained') + 
  theme_minimal()

# option 2: stat 253 code
library(tidymodels)
pca_out %>%
  tidy(matrix = 'eigenvalues') %>%
  ggplot(aes(y = percent, x = PC)) + 
  geom_point(size = 2) + 
  geom_line() + 
  labs(y = "% of variance explained")

# option 3: plot cumulative % variance explained
pca_out %>% 
  tidy(matrix = "eigenvalues") %>% 
  rbind(0) %>% 
  ggplot(aes(y = cumulative, x = PC)) + 
    geom_point(size = 2) + 
    geom_line() + 
    labs(y = "CUMULATIVE % of variance explained")

What do you learn from this plot?

Answer

The first PC explains over 15% of the total variance. The percent of variance quickly drops (down to 5–7%) when we look at the remaining PCs.

Note: looking at the percent of variance explained by each PC is another tool that researchers use to decide how many PCs to include in their GWAS models. This is sometimes referred to as the “elbow” method since the idea is to look at a scree plot, find the “elbow” in the curve, and then use all the PCs prior to that point where the line flattens out.



BONUS: Another Example (Three Populations)

Suppose we have three populations instead of just two. How many PCs do we need in order to capture “ancestry” then?

Let’s simulate some data:

# simulate a genetic variant with freq. depending on which population the person belongs to
sim_data_onevar <- function(pop, maf){
  snp <- rbinom(n = length(pop), size = 2, p = maf[pop])
  return(snp)
}

# create dataset with three types of SNPs
popidx <- rep(1:3, each = 500)
maf1 <- c(0, 0.8, 0.7) # type 1: non-existent in pop 1, similarly frequent in pops 2 and 3
maf2 <- c(0.3, 0.5, 0.6) # type 2: similar freq in pops 2 and 3, slightly less common in pop 1
maf3 <- c(0.1, 0.1, 0.7) # type 3: similar freq in pops 1 and 2, much more common in pop 3 

set.seed(494)
snps1 <- replicate(5, sim_data_onevar(pop = popidx, maf = maf1)) # create 5 (independent) snps of type 1
snps2 <- replicate(5, sim_data_onevar(pop = popidx, maf = maf2)) # create 5 (independent) snps of type 2
snps3 <- replicate(10, sim_data_onevar(pop = popidx, maf = maf3)) # create 10 (independent) snps of type 3

# put together into a data frame
pcadata3pop <- cbind(popidx, snps1, snps2, snps3) %>%
  as.data.frame()
names(pcadata3pop) <- c('population', paste0('SNP', 1:(ncol(pcadata3pop)-1)))

# check it out
dim(pcadata3pop)
[1] 1500   21
head(pcadata3pop)
  population SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
1          1    0    0    0    0    0    0    0    1    1     0     0     1
2          1    0    0    0    0    0    0    0    0    0     0     0     0
3          1    0    0    0    0    0    0    1    0    2     0     0     0
4          1    0    0    0    0    0    2    2    1    0     0     0     1
5          1    0    0    0    0    0    1    1    0    1     1     1     0
6          1    0    0    0    0    0    0    0    1    0     1     0     0
  SNP13 SNP14 SNP15 SNP16 SNP17 SNP18 SNP19 SNP20
1     0     0     0     0     0     0     1     0
2     0     1     0     0     1     0     0     0
3     0     0     1     0     0     2     0     0
4     0     1     0     0     0     0     0     0
5     0     0     0     1     0     0     1     0
6     0     0     0     0     0     1     1     2
pcadata3pop %>%
  count(population)
  population   n
1          1 500
2          2 500
3          3 500

This dataset has 20 SNPs and 500 people in each of three populations.

Let’s run PCA on these data:

# pull out genotypes
geno3pop <- pcadata3pop %>%
  select(-population)

# run PCA
pca_out_3pop <- prcomp(geno3pop, center = T, scale = T)

# pull out scores, loadings, and variance
scores3pop <- pca_out_3pop$x
loadings3pop <- pca_out_3pop$rotation
var3pop <- (pca_out_3pop$sdev)^2

Here’s what our scores look like:

scores3pop %>%
  as.data.frame() %>% # convert pca_scores into a data frame for plotting
  mutate(population = as.factor(pcadata3pop$population)) %>%  # add the population labels
  ggplot(aes(x = PC1, y = PC2, color = population)) + # then plot
  geom_point() + 
  scale_color_brewer(palette = 'Dark2')

Notice that the combination of PCs 1 and 2 can distinguish between the three populations.

And here’s the parallel coordinates plot:

# parallel coordinates plot
scores3pop %>%
  as.data.frame() %>% # convert pca_scores into a data frame for plotting
  mutate(population = as.factor(pcadata3pop$population)) %>% # add the population labels
  ggparcoord(columns = 1:20, groupColumn = 'population', alpha = 0.2) + # plot the first 15 columns
  theme_minimal() + 
  scale_color_brewer(palette = 'Dark2')

From this, we see again that PCs 1 and 2 are useful for distinguishing the three populations.

And here’s the scree plot:

# calculate proportion of variance explained
pve3pop <- var3pop/sum(var3pop)

# scree plot
pve3pop %>%
  as.data.frame() %>%
  mutate(index = seq_len(length(pve3pop))) %>%
  ggplot(aes(x = index, y = pve3pop)) + 
  geom_point() + 
  geom_line() + 
  labs(x = 'SNP Number', y = 'Percent of Variance Explained') + 
  theme_minimal()

Now it looks like we need 2 PCs to capture population structure!



Impact of LD

In multiple journal club articles (Novembre et al. 2008; Burton et al. 2007) we’ve seen researchers remove SNPs that are highly correlated with one another (i.e., in high LD) prior to running PCA. Justification for this step is not always provided, but the Wellcome Trust Case Control Consortium GWAS notes that PCs may otherwise pick up “effects attributable to local linkage disequilibrium rather than genome-wide structure” (p. 663). Reducing the number of SNPs on which we have to run PCA also offers computational advantages.


First, let’s review what we know, in general, about what happens when we run PCA on a set of correlated features. Think back to STAT 253 (or review these notes, particularly Examples 1–5, if you haven’t taken it yet): what is often true about PCA and correlated features?

Answer

Top PCs tend to give higher weight (i.e. larger loadings) to correlated features!


Next, explain how you could design a simulation study or, at least a “toy” data example like the one we’re using in this lab, to explore the impact of correlation (i.e., linkage disequilibrium) on PCA in the context of genetic data. How would you set up the data? How would you assess the impact of the correlation between SNPs in that data on PCA?

Answer
  • Use real genetic data or simulate data in such a way that some SNPs are strongly correlated with one another and others are not
  • Run PCA on those data
  • Check the SNP loadings – do the SNPs that are highly correlated with one another have higher loadings?
  • Then, assess what happens if you reduce the amount of correlation. Do patterns in loadings change?


Optional Challenge: Implement the approach you described above. What did you learn?

Answer (Optional Challenge)

I recently wrote a paper about this topic :) Check it out if you’re interested in learning more: link.



Part 2: Impact on GWAS

Toy Data

Now, let’s investigate the impact of adjusting for PCs in GWAS models. First, conduct a genome-wide association study using marginal regression models that do not make any adjustment for population structure. In other words, use models of the form trait ~ snp with no other covariates.

# empty vector to store p-values
pvals <- c()

# loop through the 15 SNPs
for(i in 1:15){
  dat <- pcadata %>% select(trait, paste0('SNP',i)) # pull out just the trait and SNP of interest
  mod <- lm(trait ~ ., data = dat) # regress trait on everything else (.), which is just SNP i in this case
  pvals[i] <- summary(mod)$coef[2,4] # pull out the p-value for the slope
}

# plot -log10 pvalues
data.frame(p = pvals, SNP = 1:15) %>%
  ggplot(aes(y = -log10(p), x = SNP)) + 
  geom_point() + 
  theme_minimal() + 
  ggtitle('Unadjusted Analysis')

What do you notice? Which SNP(s) have the smallest p-value?

Answer

In this analysis, SNP3 has the smallest p-value, and the p-values for SNPs 1 and 2 are also quite small.


Next, conduct a genome-wide association study using models that adjust for the first PC (trait ~ snp + PC1). How do the results compare to the unadjusted analysis?

# empty vector to store p-values
pvals <- c()

# loop through the 15 SNPs
for(i in 1:15){
  dat <- pcadata %>% 
    select(trait, paste0('SNP',i)) %>% # pull out just the trait and SNP of interest
    mutate(PC1 = pca_scores[,1]) # add the scores for PC1
  mod <- lm(trait ~ ., data = dat) # regress trait on everything else (.), which is SNP i and PC1
  pvals[i] <- summary(mod)$coef[2,4] # pull out the p-value for the slope
}

# plot -log10 pvalues
data.frame(p = pvals, SNP = 1:15) %>%
  ggplot(aes(y = -log10(p), x = SNP)) + 
  geom_point() + 
  theme_minimal() + 
  ggtitle('Adjusted Analysis')

Answer

In this adjusted analysis, SNP3 is now the only one with a particularly small p-value.


The trait that we are using for these GWAS is simulated, so we know the “answers” here. In particular, I created this trait such that it depends on the genotype at SNP3 as well as the population that the person belongs to. Now that you know the “truth”, re-evaluate your results above. Which model is better? Why?

Answer

The second model (that adjusts for PC1) is better. When we don’t make any adjustment, we see that SNPs 1 and 2 (which have a particularly strong association with which population the individual belongs to) have very small p-values even though they shouldn’t (since they’re not actually causally related to the trait): in other words, we’d probably have two false positives here (depending on our choice of significance threshold). This is all happening because of the phenomenon of confounding!



Confounding

The question of when we should include PCs — or some other measure of genetic similarity / geography / ancestry / population structure — in our GWAS models is at it’s heart a question about confounding.

STAT 155 Review: Suppose \(Y\) is our outcome variable and \(X\) is our predictor of interest. A third variable, \(Z\), confounds the relationship between \(X\) and \(Y\) if… (fill in the blank).

Answer

My preferred definition of confounding is as follows:

  • \(Z\) is causally associated with the outcome \(Y\) in the population
  • \(Z\) is associated with the predictor of interest \(X\) in our sample
  • \(Z\) is not caused by/a result of the predictor of interest \(X\)

You will also see people use the following:

  • \(Z\) is causally associated with the outcome \(Y\) in the population
  • \(Z\) is causally associated with the predictor of interest \(X\) in the population
  • (in other words: \(Z\) is a “common cause” of both the outcome and the predictor of interest)


Using this definition as a guide, explain why population membership is a confounding variable in the context of the toy data GWAS you implemented in the previous section.

Answer
  • population membership is causally associated with our trait by design – I simulated the trait to depend on the population to which the person belongs
  • population membership is also associated with the genotype at SNPs 1–5 (we saw above that the minor allele frequencies differ between the two populations)
  • population membership is not a result of genotype at these five SNPs; if anything, the causal relationship goes the other direction

For SNPs 1–5, population membership will be a confounder. For the other 10 SNPs, population does not meet the definition of a confounder since it is not associated with the “predictor of interest” (i.e., genotype at those SNPs).



Omitted Variable Bias

As you learned in STAT 155, if we believe a variable is a confounder we should adjust for it (i.e., include it as a covariate) in our regression model. In the notation of above, if I am interested in the relationship between \(X\) and \(Y\) but I believe a third variable, \(Z\) is a confounding variable, then I should fit the regression model lm(y ~ x + z) rather than lm(y ~ x). If we don’t do this, our results will be biased.

Answer

We’ll complete this exercise during class time. No need to submit anything here.



Part 3: Application to Real Genetic Data (Learn a New R Package!)

As we discussed in the case of implementing GWAS in Lab 2, in practice many statistical genetics researchers do not implement PCA “by hand” as we did above. Instead, they typically use one of the many R packages (or other software tools) that have been designed specifically for the task of implementing PCA in genetic data. One example — created by some of researchers at the University of Washington — is SNPRelate. Let’s try it out!


First, install the SNPRelate R package from Bioconductor:

# NOTE: keep the code chunk option eval: false so this gets skipped when rendering
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("SNPRelate")

Then load it:

library(SNPRelate)


Next, locate the “SNPRelate Tutorial” vignette. (Refer back to Lab 2 for a reminder on how to access a package’s vignettes.) Read, and replicate, the first two parts of the Data Analysis section: “LD-based SNP pruning” and “Principal Component Analysis”. Record your code and insights here. In particular…

# get list of vignettes
vignette(package = 'SNPRelate')

# view specific vignette
vignette("SNPRelate", package = "SNPRelate")


How many individuals are present in the provided example data? How many SNPs?

# Open the GDS file
genofile <- snpgdsOpen(snpgdsExampleFileName())

# look at file
genofile
File: /Users/kgrinde/Library/R/arm64/4.4/library/SNPRelate/extdata/hapmap_geno.gds (709.6K)
+    [  ] *
|--+ sample.id   { VStr8 279 ZIP(29.9%), 679B }
|--+ snp.id   { Int32 9088 ZIP(34.8%), 12.3K }
|--+ snp.rs.id   { VStr8 9088 ZIP(40.1%), 36.2K }
|--+ snp.position   { Int32 9088 ZIP(94.7%), 33.6K }
|--+ snp.chromosome   { UInt8 9088 ZIP(0.94%), 85B } *
|--+ snp.allele   { VStr8 9088 ZIP(11.3%), 4.0K }
|--+ genotype   { Bit2 279x9088, 619.0K } *
\--+ sample.annot   [ data.frame ] *
   |--+ family.id   { VStr8 279 ZIP(34.4%), 514B }
   |--+ father.id   { VStr8 279 ZIP(31.5%), 220B }
   |--+ mother.id   { VStr8 279 ZIP(30.9%), 214B }
   |--+ sex   { VStr8 279 ZIP(17.0%), 95B }
   \--+ pop.group   { VStr8 279 ZIP(6.18%), 69B }
Answer
  • 279 people (length of sample.id and other sample info)
  • 9088 SNPs (length of snp.id and other SNP info)


What populations are present in these data? What do each of the population codes (e.g., “CEU”) stand for?

# Get population information
pop_code <- read.gdsn(index.gdsn(genofile,
                                 path = "sample.annot/pop.group"))

table(pop_code)
pop_code
CEU HCB JPT YRI 
 92  47  47  93 
Answer

Four populations:

  • CEU: Utah residents with Northern and Western European ancestry
  • HCB: Han Chinese in Beijing, China
  • JPT: Japanese in Tokyo, Japan
  • YRI: Yoruba in Ibadan, Nigera

Source: HapMap


The snpgdsLDpruning function performs a pre-processing step known as “LD pruning” which calculates the correlation (i.e., LD) between SNPs and removes one of the two SNPs from every pair of highly correlated SNPs. Run this function with a seed of 494. How many SNPs remain after LD pruning?

# NOTE: LD pruning can be time-intensive so adding cache: true may be helpful here

# run LD pruning with default LD threshold
set.seed(494)
snpset <- snpgdsLDpruning(genofile, ld.threshold=0.2) 
SNP pruning based on LD:
Excluding 365 SNPs on non-autosomes
Excluding 139 SNPs (monomorphic: TRUE, MAF: 0.005, missing rate: 0.05)
    # of samples: 279
    # of SNPs: 8,584
    using 1 thread
    sliding window: 500,000 basepairs, Inf SNPs
    |LD| threshold: 0.2
    method: composite
Chrom 1: |====================|====================|
    74.16%, 531 / 716 (Mon Mar 24 19:09:53 2025)
Chrom 2: |====================|====================|
    72.24%, 536 / 742 (Mon Mar 24 19:09:53 2025)
Chrom 3: |====================|====================|
    73.73%, 449 / 609 (Mon Mar 24 19:09:53 2025)
Chrom 4: |====================|====================|
    72.60%, 408 / 562 (Mon Mar 24 19:09:53 2025)
Chrom 5: |====================|====================|
    75.62%, 428 / 566 (Mon Mar 24 19:09:53 2025)
Chrom 6: |====================|====================|
    73.81%, 417 / 565 (Mon Mar 24 19:09:53 2025)
Chrom 7: |====================|====================|
    75.42%, 356 / 472 (Mon Mar 24 19:09:53 2025)
Chrom 8: |====================|====================|
    69.88%, 341 / 488 (Mon Mar 24 19:09:53 2025)
Chrom 9: |====================|====================|
    76.44%, 318 / 416 (Mon Mar 24 19:09:53 2025)
Chrom 10: |====================|====================|
    73.29%, 354 / 483 (Mon Mar 24 19:09:53 2025)
Chrom 11: |====================|====================|
    75.84%, 339 / 447 (Mon Mar 24 19:09:53 2025)
Chrom 12: |====================|====================|
    74.71%, 319 / 427 (Mon Mar 24 19:09:53 2025)
Chrom 13: |====================|====================|
    75.87%, 261 / 344 (Mon Mar 24 19:09:53 2025)
Chrom 14: |====================|====================|
    76.24%, 215 / 282 (Mon Mar 24 19:09:53 2025)
Chrom 15: |====================|====================|
    75.57%, 198 / 262 (Mon Mar 24 19:09:53 2025)
Chrom 16: |====================|====================|
    71.58%, 199 / 278 (Mon Mar 24 19:09:53 2025)
Chrom 17: |====================|====================|
    73.43%, 152 / 207 (Mon Mar 24 19:09:53 2025)
Chrom 18: |====================|====================|
    72.93%, 194 / 266 (Mon Mar 24 19:09:53 2025)
Chrom 19: |====================|====================|
    82.50%, 99 / 120 (Mon Mar 24 19:09:53 2025)
Chrom 20: |====================|====================|
    70.74%, 162 / 229 (Mon Mar 24 19:09:53 2025)
Chrom 21: |====================|====================|
    75.40%, 95 / 126 (Mon Mar 24 19:09:53 2025)
Chrom 22: |====================|====================|
    74.14%, 86 / 116 (Mon Mar 24 19:09:53 2025)
6,457 markers are selected in total.
# Get all selected snp id
snpset.id <- unlist(unname(snpset))
head(snpset.id)
[1]  1  2  4  5  7 10
length(snpset.id)
[1] 6457
Answer

After pruning there are 6,457 SNPs left.

Remember that we started with 9088. You can see in the output above that 365 SNPs on “non-autosomes” were removed (i.e., keep only SNPs on chromosomes 1–22). Another 139 SNPs were removed that were monomomorphic (MAF = 0), very rare (MAF < 0.005), or had high rates of missing values. Finally, another \(\approx\) 2k SNPs were removed that were in high LD with neighboring SNPs.


Now, run PCA on this reduced set of “LD-pruned” SNPs.

# NOTE: I recommend using cache: true here again

# Run PCA
pca <- snpgdsPCA(genofile, snp.id=snpset.id, num.thread=2)
Principal Component Analysis (PCA) on genotypes:
Excluding 2,631 SNPs (non-autosomes or non-selection)
Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
    # of samples: 279
    # of SNPs: 6,457
    using 2 threads
    # of principal components: 32
PCA:    the sum of all selected genotypes (0,1,2) = 1808200
CPU capabilities:
Mon Mar 24 19:09:53 2025    (internal increment: 1760)

[..................................................]  0%, ETC: ---        
[==================================================] 100%, completed, 0s
Mon Mar 24 19:09:53 2025    Begin (eigenvalues and eigenvectors)
Mon Mar 24 19:09:53 2025    Done.


Finally, create the following plots:

  1. a score plot (with individuals colored by their “true” population label)
  2. parallel coordinates plot
  3. loadings plot (the tutorial presents an example looking at the correlation between eigenvectors and SNP genotypes. this is similar to looking at the loadings. there’s also a snpgdsPCASNPLoading function – see if you can figure out how to use it.)
  4. scree plot
# score plot

# Get sample id
sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))

# Get population information
#   or pop_code <- scan("pop.txt", what=character())
#   if it is stored in a text file "pop.txt"
pop_code <- read.gdsn(index.gdsn(genofile, "sample.annot/pop.group"))

# assume the order of sample IDs is as the same as population codes
head(cbind(sample.id, pop_code))
     sample.id pop_code
[1,] "NA19152" "YRI"   
[2,] "NA19139" "YRI"   
[3,] "NA18912" "YRI"   
[4,] "NA19160" "YRI"   
[5,] "NA07034" "CEU"   
[6,] "NA07055" "CEU"   
# Make a data.frame
tab <- data.frame(sample.id = pca$sample.id,
    pop = factor(pop_code)[match(pca$sample.id, sample.id)],
    EV1 = pca$eigenvect[,1],    # the first eigenvector
    EV2 = pca$eigenvect[,2],    # the second eigenvector
    stringsAsFactors = FALSE)
head(tab)
  sample.id pop         EV1          EV2
1   NA19152 YRI -0.08264960 -0.009838342
2   NA19139 YRI -0.08349414 -0.010831381
3   NA18912 YRI -0.08265756 -0.010841330
4   NA19160 YRI -0.08786783 -0.012579223
5   NA07034 CEU  0.03159633  0.079261324
6   NA07055 CEU  0.03515841  0.081038435
plot(tab$EV2, tab$EV1, col=as.integer(tab$pop), xlab="eigenvector 2", ylab="eigenvector 1")
legend("bottomright", legend=levels(tab$pop), pch="o", col=1:nlevels(tab$pop))

# parallel coordinates plot
library(MASS)

datpop <- factor(pop_code)[match(pca$sample.id, sample.id)]
parcoord(pca$eigenvect[,1:16], col=datpop)

# loadings plot

# Get chromosome index
chr <- read.gdsn(index.gdsn(genofile, "snp.chromosome"))

# get loadings
LOAD <- snpgdsPCASNPLoading(pca, genofile)
SNP Loading:
    # of samples: 279
    # of SNPs: 6,457
    using 1 thread
    using the top 32 eigenvectors
SNP Loading:    the sum of all selected genotypes (0,1,2) = 1808200
Tue Apr  8 20:38:53 2025    (internal increment: 14092)

[..................................................]  0%, ETC: ---        
[==================================================] 100%, completed, 0s
Tue Apr  8 20:38:53 2025    Done.
# plot loadings for first 2 PCs
savepar <- par(mfrow=c(2,1), mai=c(0.45, 0.55, 0.1, 0.25))
for (i in 1:2)
{
    plot(abs(LOAD$snploading[i,]), xlab="", ylab=paste("PC", i),
        col=chr, pch="+")
}

par(savepar)
# scree plot
# variance proportion (%)
pc.percent <- pca$varprop*100
head(round(pc.percent, 2))
[1] 10.29  5.61  1.03  0.98  0.87  0.77
plot(pc.percent, type = 'b', xlim = c(0, 30) )

What do you learn from these plots? Write down 1–2 sentences for each.

Answer

a few takeaways:

  • seems like we’d need 2–3 PCs to capture the “population structure” in this dataset
  • the top PCs can’t distinguish between HCB and JPT
  • after LD purning, all SNPs seem to be contributing relatively equally to the top PCs



References

Burton, Paul R., David G. Clayton, Lon R. Cardon, Nick Craddock, Panos Deloukas, Audrey Duncanson, Dominic P. Kwiatkowski, et al. 2007. “Genome-Wide Association Study of 14,000 Cases of Seven Common Diseases and 3,000 Shared Controls.” Nature 447 (7145): 661–78.
Novembre, John, Toby Johnson, Katarzyna Bryc, Zoltán Kutalik, Adam R Boyko, Adam Auton, Amit Indap, et al. 2008. “Genes Mirror Geography Within Europe.” Nature 456 (7218): 98–101.