library(tidyverse)
library(GGally)
library(gridExtra)
Lab 4: Principal component analysis
STAT 494: Statistical Genetics
Goals
- Gain intuition for how/why PCA captures population structure (and other features like linkage disequilibrium)
- Compare GWAS models with and without PC adjustment
- Explore and derive omitted variable bias
- Implement PCA in R, using a base R function and an R package specifically designed for working with genetic data
Getting Started
Access the source code for this document (click the
</> Code
button above)Copy-paste into a new QMD
Add your name as the
author
in the YAML header, aboveDownload the
.bib
file that contains the references cited later in this lab here. Update the path to the bibliography file in the YAML header, above, depending on where you saved it relative to your QMD.Load the packages that we’ll need for this lab. (And install them first if needed.)
- Note the new usage of customized Quarto callouts, like below, for typing your answers. Please keep all answers within these blocks. This will make it easier for us to review your work!
Exercises
Part 1: Building Intuition
Load Data
Read the toy data (pca_toy_data.csv
) into R:
<- read_csv('https://mac-stat.github.io/statgen/data/pca_toy_data.csv') pcadata
Rows: 1000 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (17): population, trait, SNP1, SNP2, SNP3, SNP4, SNP5, SNP6, SNP7, SNP8,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Explore Data
Explore the toy dataset, with the goal of answering each of the questions below.
How many people are in this dataset? How many SNPs?
# look at first few rows
head(pcadata)
# A tibble: 6 × 17
population trait SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -0.111 0 0 0 2 0 0 0 0 0 1
2 1 1.47 0 0 0 0 0 0 0 0 0 0
3 1 1.43 0 0 0 0 0 0 1 0 1 0
4 1 -1.72 0 0 0 1 0 0 0 1 0 0
5 1 -0.356 0 0 0 0 0 1 1 1 0 0
6 1 0.516 0 0 0 0 0 0 1 0 0 0
# ℹ 5 more variables: SNP11 <dbl>, SNP12 <dbl>, SNP13 <dbl>, SNP14 <dbl>,
# SNP15 <dbl>
# calculate dimensions
dim(pcadata)
[1] 1000 17
How many populations are represented? How many people belong to each population?
# count how many people are in each population
%>%
pcadata count(population)
# A tibble: 2 × 2
population n
<dbl> <int>
1 1 500
2 2 500
Are there any SNPs that seem to be more common in one population than the other? By how much do the allele frequencies in the two populations differ for each SNP?
# here's one way (but not the only way) to do this
# function to get empirical minor allele frequency
## count up how many minor alleles are observed
## divide by two times the number of people
<- function(snp){
get_MAF sum(snp)/(2*length(snp))
}
# get observed allele frequency for each population
%>%
pcadata group_by(population) %>%
summarize_all(get_MAF)
# A tibble: 2 × 17
population trait SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.268 0 0 0.298 0.291 0.293 0.102 0.107 0.094 0.095 0.097
2 2 0.997 1 1 0.512 0.506 0.504 0.11 0.085 0.097 0.092 0.111
# ℹ 5 more variables: SNP11 <dbl>, SNP12 <dbl>, SNP13 <dbl>, SNP14 <dbl>,
# SNP15 <dbl>
Run PCA
There is more than one way to run PCA in R. We’ll use similar code to what you may have seen in STAT 253: Statistical Machine Learning.
If PCA is new to you, or you need a refresher, I recommend taking some time to review the following resources:
- Stat 253 Videos: PCA (v1), PCA (v2)
- Stat 253 Course Notes: Lectures 19 (PCA) and 20 (PCR)
- Introduction to Statistical Learning: Sections 6.3.1 (PCR) and 12.2 (PCA)
First, we’ll need to set up the data. Create a data frame called geno
that contains only the genotype information for each of the SNPs (i.e., remove the trait
and population
information).
<- pcadata %>%
geno select(-population, -trait)
Then, run the code chunk below to perform PCA using the prcomp
function.
<- prcomp(geno, center = TRUE, scale = TRUE) pca_out
Open the help page for prcomp
and read through the description of what it returns. Hint: see the “Value” section of the help page. Which component of the output contains the loadings? Which part contains the scores?
Based on what you learned from the help page, extract the loadings and scores and save them as new objects called pca_loadings
and pca_scores
, respectively. Hint: use $
or pluck()
to extract a particular component from a list.
<- pca_out$rotation
pca_loadings <- pca_out$x pca_scores
Plot Results
Scores
Create a scatterplot comparing the scores for the first two PCs, with points colored by which population the individual belongs to.
%>%
pca_scores as.data.frame() %>% # convert pca_scores into a data frame for plotting
mutate(population = as.factor(pcadata$population)) %>% # add the population labels
ggplot(aes(x = PC1, y = PC2, color = population)) + # then plot
geom_point() +
scale_color_brewer(palette = 'Dark2')
What do you learn from this plot?
Another way to visualize the PC scores is to make what’s known as a parallel coordinates plot. This plot will allow us to look at all of our PCs at once. As above, we’ll color each of the individuals according to which population they belong to. I’ll give you most of the code for this one:
# parallel coordinates plot
%>%
pca_scores as.data.frame() %>%
mutate(population = as.factor(pcadata$population)) %>%
ggparcoord(columns = 1:15, groupColumn = 'population', alpha = 0.2) +
theme_minimal() +
scale_color_brewer(palette = 'Dark2')
What do you learn from the parallel coordinates plot?
Loadings
Next, let’s investigate which SNPs have the largest contributions to the first PC by looking at the loadings. Create a data visualization that compares the loadings of each SNP for the first PC.
# option 1
%>%
pca_loadings as.data.frame() %>%
mutate(index = seq_len(nrow(pca_loadings))) %>%
ggplot(aes(x = index, y = PC1)) +
geom_point() +
labs(xlab = 'SNP Number', ylab = 'Loadings for PC1') +
theme_minimal()
# option 2
library(reshape2)
%>%
pca_loadings melt() %>%
filter(Var2 == "PC1") %>%
ggplot(aes(x = Var1, y = value, fill = Var1)) +
geom_bar(stat = "identity") +
labs(y = "loadings", x = "original features", fill = "original features") +
theme(axis.text.x = element_text(angle = 70, vjust = 1, hjust=1))
Which SNPs contribute the most, either positively or negatively, to the first PC? Why do you think this is?
Now, repeat this process for the second PC, third, and fourth PCs. What do you notice?
# option 1
<- pca_loadings %>%
l2 as.data.frame() %>%
mutate(index = seq_len(nrow(pca_loadings))) %>%
ggplot(aes(x = index, y = PC2)) +
geom_point() +
labs(xlab = 'SNP Number', ylab = 'Loadings for PC2') +
theme_minimal()
<- pca_loadings %>%
l3 as.data.frame() %>%
mutate(index = seq_len(nrow(pca_loadings))) %>%
ggplot(aes(x = index, y = PC3)) +
geom_point() +
labs(xlab = 'SNP Number', ylab = 'Loadings for PC3') +
theme_minimal()
<- pca_loadings %>%
l4 as.data.frame() %>%
mutate(index = seq_len(nrow(pca_loadings))) %>%
ggplot(aes(x = index, y = PC4)) +
geom_point() +
labs(xlab = 'SNP Number', ylab = 'Loadings for PC4') +
theme_minimal()
grid.arrange(l2, l3, l4, ncol = 1)
# option 2
1:4] %>%
pca_loadings[, melt() %>%
ggplot(aes(x = Var1, y = value, fill = Var1)) +
geom_bar(stat = "identity") +
facet_wrap(~ Var2) +
labs(y = "loadings", x = "original features", fill = "original features") +
theme(axis.text.x=element_blank(), axis.ticks.x = element_blank())
Variance Explained
As we’ve discussed in class, the principal components are ordered in terms of the amount of variance they explain in the original data. To see exactly what proportion of total variance each PC explains, we can create what’s known as a scree plot. The code chunk below calculates the proportion of variance explained.
# extract variance of each PC
<- (pca_out$sdev)^2
pca_var
# calculate proportion of variance explained
<- sum(pca_var)
total_var <- pca_var/total_var pve
Your turn: create a data visualization to show how the proportion of variance explained compares across the PCs.
# option 1
%>%
pve as.data.frame() %>%
mutate(index = seq_len(length(pca_var))) %>%
ggplot(aes(x = index, y = pve)) +
geom_point() +
geom_line() +
labs(x = 'SNP Number', y = 'Percent of Variance Explained') +
theme_minimal()
# option 2: stat 253 code
library(tidymodels)
%>%
pca_out tidy(matrix = 'eigenvalues') %>%
ggplot(aes(y = percent, x = PC)) +
geom_point(size = 2) +
geom_line() +
labs(y = "% of variance explained")
# option 3: plot cumulative % variance explained
%>%
pca_out tidy(matrix = "eigenvalues") %>%
rbind(0) %>%
ggplot(aes(y = cumulative, x = PC)) +
geom_point(size = 2) +
geom_line() +
labs(y = "CUMULATIVE % of variance explained")
What do you learn from this plot?
BONUS: Another Example (Three Populations)
Suppose we have three populations instead of just two. How many PCs do we need in order to capture “ancestry” then?
Let’s simulate some data:
# simulate a genetic variant with freq. depending on which population the person belongs to
<- function(pop, maf){
sim_data_onevar <- rbinom(n = length(pop), size = 2, p = maf[pop])
snp return(snp)
}
# create dataset with three types of SNPs
<- rep(1:3, each = 500)
popidx <- c(0, 0.8, 0.7) # type 1: non-existent in pop 1, similarly frequent in pops 2 and 3
maf1 <- c(0.3, 0.5, 0.6) # type 2: similar freq in pops 2 and 3, slightly less common in pop 1
maf2 <- c(0.1, 0.1, 0.7) # type 3: similar freq in pops 1 and 2, much more common in pop 3
maf3
set.seed(494)
<- replicate(5, sim_data_onevar(pop = popidx, maf = maf1)) # create 5 (independent) snps of type 1
snps1 <- replicate(5, sim_data_onevar(pop = popidx, maf = maf2)) # create 5 (independent) snps of type 2
snps2 <- replicate(10, sim_data_onevar(pop = popidx, maf = maf3)) # create 10 (independent) snps of type 3
snps3
# put together into a data frame
<- cbind(popidx, snps1, snps2, snps3) %>%
pcadata3pop as.data.frame()
names(pcadata3pop) <- c('population', paste0('SNP', 1:(ncol(pcadata3pop)-1)))
# check it out
dim(pcadata3pop)
[1] 1500 21
head(pcadata3pop)
population SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
1 1 0 0 0 0 0 0 0 1 1 0 0 1
2 1 0 0 0 0 0 0 0 0 0 0 0 0
3 1 0 0 0 0 0 0 1 0 2 0 0 0
4 1 0 0 0 0 0 2 2 1 0 0 0 1
5 1 0 0 0 0 0 1 1 0 1 1 1 0
6 1 0 0 0 0 0 0 0 1 0 1 0 0
SNP13 SNP14 SNP15 SNP16 SNP17 SNP18 SNP19 SNP20
1 0 0 0 0 0 0 1 0
2 0 1 0 0 1 0 0 0
3 0 0 1 0 0 2 0 0
4 0 1 0 0 0 0 0 0
5 0 0 0 1 0 0 1 0
6 0 0 0 0 0 1 1 2
%>%
pcadata3pop count(population)
population n
1 1 500
2 2 500
3 3 500
This dataset has 20 SNPs and 500 people in each of three populations.
Let’s run PCA on these data:
# pull out genotypes
<- pcadata3pop %>%
geno3pop select(-population)
# run PCA
<- prcomp(geno3pop, center = T, scale = T)
pca_out_3pop
# pull out scores, loadings, and variance
<- pca_out_3pop$x
scores3pop <- pca_out_3pop$rotation
loadings3pop <- (pca_out_3pop$sdev)^2 var3pop
Here’s what our scores look like:
%>%
scores3pop as.data.frame() %>% # convert pca_scores into a data frame for plotting
mutate(population = as.factor(pcadata3pop$population)) %>% # add the population labels
ggplot(aes(x = PC1, y = PC2, color = population)) + # then plot
geom_point() +
scale_color_brewer(palette = 'Dark2')
Notice that the combination of PCs 1 and 2 can distinguish between the three populations.
And here’s the parallel coordinates plot:
# parallel coordinates plot
%>%
scores3pop as.data.frame() %>% # convert pca_scores into a data frame for plotting
mutate(population = as.factor(pcadata3pop$population)) %>% # add the population labels
ggparcoord(columns = 1:20, groupColumn = 'population', alpha = 0.2) + # plot the first 15 columns
theme_minimal() +
scale_color_brewer(palette = 'Dark2')
From this, we see again that PCs 1 and 2 are useful for distinguishing the three populations.
And here’s the scree plot:
# calculate proportion of variance explained
<- var3pop/sum(var3pop)
pve3pop
# scree plot
%>%
pve3pop as.data.frame() %>%
mutate(index = seq_len(length(pve3pop))) %>%
ggplot(aes(x = index, y = pve3pop)) +
geom_point() +
geom_line() +
labs(x = 'SNP Number', y = 'Percent of Variance Explained') +
theme_minimal()
Now it looks like we need 2 PCs to capture population structure!
Impact of LD
In multiple journal club articles (Novembre et al. 2008; Burton et al. 2007) we’ve seen researchers remove SNPs that are highly correlated with one another (i.e., in high LD) prior to running PCA. Justification for this step is not always provided, but the Wellcome Trust Case Control Consortium GWAS notes that PCs may otherwise pick up “effects attributable to local linkage disequilibrium rather than genome-wide structure” (p. 663). Reducing the number of SNPs on which we have to run PCA also offers computational advantages.
First, let’s review what we know, in general, about what happens when we run PCA on a set of correlated features. Think back to STAT 253 (or review these notes, particularly Examples 1–5, if you haven’t taken it yet): what is often true about PCA and correlated features?
Next, explain how you could design a simulation study or, at least a “toy” data example like the one we’re using in this lab, to explore the impact of correlation (i.e., linkage disequilibrium) on PCA in the context of genetic data. How would you set up the data? How would you assess the impact of the correlation between SNPs in that data on PCA?
Optional Challenge: Implement the approach you described above. What did you learn?
Part 2: Impact on GWAS
Toy Data
Now, let’s investigate the impact of adjusting for PCs in GWAS models. First, conduct a genome-wide association study using marginal regression models that do not make any adjustment for population structure. In other words, use models of the form trait ~ snp
with no other covariates.
# empty vector to store p-values
<- c()
pvals
# loop through the 15 SNPs
for(i in 1:15){
<- pcadata %>% select(trait, paste0('SNP',i)) # pull out just the trait and SNP of interest
dat <- lm(trait ~ ., data = dat) # regress trait on everything else (.), which is just SNP i in this case
mod <- summary(mod)$coef[2,4] # pull out the p-value for the slope
pvals[i]
}
# plot -log10 pvalues
data.frame(p = pvals, SNP = 1:15) %>%
ggplot(aes(y = -log10(p), x = SNP)) +
geom_point() +
theme_minimal() +
ggtitle('Unadjusted Analysis')
What do you notice? Which SNP(s) have the smallest p-value?
Next, conduct a genome-wide association study using models that adjust for the first PC (trait ~ snp + PC1
). How do the results compare to the unadjusted analysis?
# empty vector to store p-values
<- c()
pvals
# loop through the 15 SNPs
for(i in 1:15){
<- pcadata %>%
dat select(trait, paste0('SNP',i)) %>% # pull out just the trait and SNP of interest
mutate(PC1 = pca_scores[,1]) # add the scores for PC1
<- lm(trait ~ ., data = dat) # regress trait on everything else (.), which is SNP i and PC1
mod <- summary(mod)$coef[2,4] # pull out the p-value for the slope
pvals[i]
}
# plot -log10 pvalues
data.frame(p = pvals, SNP = 1:15) %>%
ggplot(aes(y = -log10(p), x = SNP)) +
geom_point() +
theme_minimal() +
ggtitle('Adjusted Analysis')
The trait that we are using for these GWAS is simulated, so we know the “answers” here. In particular, I created this trait such that it depends on the genotype at SNP3 as well as the population that the person belongs to. Now that you know the “truth”, re-evaluate your results above. Which model is better? Why?
Confounding
The question of when we should include PCs — or some other measure of genetic similarity / geography / ancestry / population structure — in our GWAS models is at it’s heart a question about confounding.
STAT 155 Review: Suppose \(Y\) is our outcome variable and \(X\) is our predictor of interest. A third variable, \(Z\), confounds the relationship between \(X\) and \(Y\) if… (fill in the blank).
Using this definition as a guide, explain why population
membership is a confounding variable in the context of the toy data GWAS you implemented in the previous section.
Omitted Variable Bias
As you learned in STAT 155, if we believe a variable is a confounder we should adjust for it (i.e., include it as a covariate) in our regression model. In the notation of above, if I am interested in the relationship between \(X\) and \(Y\) but I believe a third variable, \(Z\) is a confounding variable, then I should fit the regression model lm(y ~ x + z)
rather than lm(y ~ x)
. If we don’t do this, our results will be biased.
Part 3: Application to Real Genetic Data (Learn a New R Package!)
As we discussed in the case of implementing GWAS in Lab 2, in practice many statistical genetics researchers do not implement PCA “by hand” as we did above. Instead, they typically use one of the many R packages (or other software tools) that have been designed specifically for the task of implementing PCA in genetic data. One example — created by some of researchers at the University of Washington — is SNPRelate
. Let’s try it out!
First, install the SNPRelate
R package from Bioconductor:
# NOTE: keep the code chunk option eval: false so this gets skipped when rendering
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("SNPRelate") BiocManager
Then load it:
library(SNPRelate)
Next, locate the “SNPRelate Tutorial” vignette. (Refer back to Lab 2 for a reminder on how to access a package’s vignettes.) Read, and replicate, the first two parts of the Data Analysis section: “LD-based SNP pruning” and “Principal Component Analysis”. Record your code and insights here. In particular…
# get list of vignettes
vignette(package = 'SNPRelate')
# view specific vignette
vignette("SNPRelate", package = "SNPRelate")
How many individuals are present in the provided example data? How many SNPs?
# Open the GDS file
<- snpgdsOpen(snpgdsExampleFileName())
genofile
# look at file
genofile
File: /Users/kgrinde/Library/R/arm64/4.4/library/SNPRelate/extdata/hapmap_geno.gds (709.6K)
+ [ ] *
|--+ sample.id { VStr8 279 ZIP(29.9%), 679B }
|--+ snp.id { Int32 9088 ZIP(34.8%), 12.3K }
|--+ snp.rs.id { VStr8 9088 ZIP(40.1%), 36.2K }
|--+ snp.position { Int32 9088 ZIP(94.7%), 33.6K }
|--+ snp.chromosome { UInt8 9088 ZIP(0.94%), 85B } *
|--+ snp.allele { VStr8 9088 ZIP(11.3%), 4.0K }
|--+ genotype { Bit2 279x9088, 619.0K } *
\--+ sample.annot [ data.frame ] *
|--+ family.id { VStr8 279 ZIP(34.4%), 514B }
|--+ father.id { VStr8 279 ZIP(31.5%), 220B }
|--+ mother.id { VStr8 279 ZIP(30.9%), 214B }
|--+ sex { VStr8 279 ZIP(17.0%), 95B }
\--+ pop.group { VStr8 279 ZIP(6.18%), 69B }
What populations are present in these data? What do each of the population codes (e.g., “CEU”) stand for?
# Get population information
<- read.gdsn(index.gdsn(genofile,
pop_code path = "sample.annot/pop.group"))
table(pop_code)
pop_code
CEU HCB JPT YRI
92 47 47 93
The snpgdsLDpruning
function performs a pre-processing step known as “LD pruning” which calculates the correlation (i.e., LD) between SNPs and removes one of the two SNPs from every pair of highly correlated SNPs. Run this function with a seed of 494. How many SNPs remain after LD pruning?
# NOTE: LD pruning can be time-intensive so adding cache: true may be helpful here
# run LD pruning with default LD threshold
set.seed(494)
<- snpgdsLDpruning(genofile, ld.threshold=0.2) snpset
SNP pruning based on LD:
Excluding 365 SNPs on non-autosomes
Excluding 139 SNPs (monomorphic: TRUE, MAF: 0.005, missing rate: 0.05)
# of samples: 279
# of SNPs: 8,584
using 1 thread
sliding window: 500,000 basepairs, Inf SNPs
|LD| threshold: 0.2
method: composite
Chrom 1: |====================|====================|
74.16%, 531 / 716 (Mon Mar 24 19:09:53 2025)
Chrom 2: |====================|====================|
72.24%, 536 / 742 (Mon Mar 24 19:09:53 2025)
Chrom 3: |====================|====================|
73.73%, 449 / 609 (Mon Mar 24 19:09:53 2025)
Chrom 4: |====================|====================|
72.60%, 408 / 562 (Mon Mar 24 19:09:53 2025)
Chrom 5: |====================|====================|
75.62%, 428 / 566 (Mon Mar 24 19:09:53 2025)
Chrom 6: |====================|====================|
73.81%, 417 / 565 (Mon Mar 24 19:09:53 2025)
Chrom 7: |====================|====================|
75.42%, 356 / 472 (Mon Mar 24 19:09:53 2025)
Chrom 8: |====================|====================|
69.88%, 341 / 488 (Mon Mar 24 19:09:53 2025)
Chrom 9: |====================|====================|
76.44%, 318 / 416 (Mon Mar 24 19:09:53 2025)
Chrom 10: |====================|====================|
73.29%, 354 / 483 (Mon Mar 24 19:09:53 2025)
Chrom 11: |====================|====================|
75.84%, 339 / 447 (Mon Mar 24 19:09:53 2025)
Chrom 12: |====================|====================|
74.71%, 319 / 427 (Mon Mar 24 19:09:53 2025)
Chrom 13: |====================|====================|
75.87%, 261 / 344 (Mon Mar 24 19:09:53 2025)
Chrom 14: |====================|====================|
76.24%, 215 / 282 (Mon Mar 24 19:09:53 2025)
Chrom 15: |====================|====================|
75.57%, 198 / 262 (Mon Mar 24 19:09:53 2025)
Chrom 16: |====================|====================|
71.58%, 199 / 278 (Mon Mar 24 19:09:53 2025)
Chrom 17: |====================|====================|
73.43%, 152 / 207 (Mon Mar 24 19:09:53 2025)
Chrom 18: |====================|====================|
72.93%, 194 / 266 (Mon Mar 24 19:09:53 2025)
Chrom 19: |====================|====================|
82.50%, 99 / 120 (Mon Mar 24 19:09:53 2025)
Chrom 20: |====================|====================|
70.74%, 162 / 229 (Mon Mar 24 19:09:53 2025)
Chrom 21: |====================|====================|
75.40%, 95 / 126 (Mon Mar 24 19:09:53 2025)
Chrom 22: |====================|====================|
74.14%, 86 / 116 (Mon Mar 24 19:09:53 2025)
6,457 markers are selected in total.
# Get all selected snp id
<- unlist(unname(snpset))
snpset.id head(snpset.id)
[1] 1 2 4 5 7 10
length(snpset.id)
[1] 6457
Now, run PCA on this reduced set of “LD-pruned” SNPs.
# NOTE: I recommend using cache: true here again
# Run PCA
<- snpgdsPCA(genofile, snp.id=snpset.id, num.thread=2) pca
Principal Component Analysis (PCA) on genotypes:
Excluding 2,631 SNPs (non-autosomes or non-selection)
Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
# of samples: 279
# of SNPs: 6,457
using 2 threads
# of principal components: 32
PCA: the sum of all selected genotypes (0,1,2) = 1808200
CPU capabilities:
Mon Mar 24 19:09:53 2025 (internal increment: 1760)
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 0s
Mon Mar 24 19:09:53 2025 Begin (eigenvalues and eigenvectors)
Mon Mar 24 19:09:53 2025 Done.
Finally, create the following plots:
- a score plot (with individuals colored by their “true” population label)
- parallel coordinates plot
- loadings plot (the tutorial presents an example looking at the correlation between eigenvectors and SNP genotypes. this is similar to looking at the loadings. there’s also a
snpgdsPCASNPLoading
function – see if you can figure out how to use it.) - scree plot
# score plot
# Get sample id
<- read.gdsn(index.gdsn(genofile, "sample.id"))
sample.id
# Get population information
# or pop_code <- scan("pop.txt", what=character())
# if it is stored in a text file "pop.txt"
<- read.gdsn(index.gdsn(genofile, "sample.annot/pop.group"))
pop_code
# assume the order of sample IDs is as the same as population codes
head(cbind(sample.id, pop_code))
sample.id pop_code
[1,] "NA19152" "YRI"
[2,] "NA19139" "YRI"
[3,] "NA18912" "YRI"
[4,] "NA19160" "YRI"
[5,] "NA07034" "CEU"
[6,] "NA07055" "CEU"
# Make a data.frame
<- data.frame(sample.id = pca$sample.id,
tab pop = factor(pop_code)[match(pca$sample.id, sample.id)],
EV1 = pca$eigenvect[,1], # the first eigenvector
EV2 = pca$eigenvect[,2], # the second eigenvector
stringsAsFactors = FALSE)
head(tab)
sample.id pop EV1 EV2
1 NA19152 YRI -0.08264960 -0.009838342
2 NA19139 YRI -0.08349414 -0.010831381
3 NA18912 YRI -0.08265756 -0.010841330
4 NA19160 YRI -0.08786783 -0.012579223
5 NA07034 CEU 0.03159633 0.079261324
6 NA07055 CEU 0.03515841 0.081038435
plot(tab$EV2, tab$EV1, col=as.integer(tab$pop), xlab="eigenvector 2", ylab="eigenvector 1")
legend("bottomright", legend=levels(tab$pop), pch="o", col=1:nlevels(tab$pop))
# parallel coordinates plot
library(MASS)
<- factor(pop_code)[match(pca$sample.id, sample.id)]
datpop parcoord(pca$eigenvect[,1:16], col=datpop)
# loadings plot
# Get chromosome index
<- read.gdsn(index.gdsn(genofile, "snp.chromosome"))
chr
# get loadings
<- snpgdsPCASNPLoading(pca, genofile) LOAD
SNP Loading:
# of samples: 279
# of SNPs: 6,457
using 1 thread
using the top 32 eigenvectors
SNP Loading: the sum of all selected genotypes (0,1,2) = 1808200
Tue Apr 8 20:38:53 2025 (internal increment: 14092)
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 0s
Tue Apr 8 20:38:53 2025 Done.
# plot loadings for first 2 PCs
<- par(mfrow=c(2,1), mai=c(0.45, 0.55, 0.1, 0.25))
savepar for (i in 1:2)
{plot(abs(LOAD$snploading[i,]), xlab="", ylab=paste("PC", i),
col=chr, pch="+")
}
par(savepar)
# scree plot
# variance proportion (%)
<- pca$varprop*100
pc.percent head(round(pc.percent, 2))
[1] 10.29 5.61 1.03 0.98 0.87 0.77
plot(pc.percent, type = 'b', xlim = c(0, 30) )
What do you learn from these plots? Write down 1–2 sentences for each.