install.packages(c('tidyverse', 'broom', 'qqman'))
Lab 2: Implementing and visualizing marginal regression
STAT 494: Statistical Genetics
Goals
- Explore real genetic data
- Implement GWAS in R using marginal regression
- Visualize GWAS results in R using Manhattan plots (“by hand” and using open-source R packages)
- Learn how to use a new R package by reading its documentation
Preparation
To prepare for this assignment, please install (or update) the following packages:
tidyverse
broom
qqman
snpStats
The first three packages are all available on CRAN (the Comprehensive R Archive Network) so you can install them in the usual way:
The fourth package is available on Bioconductor, a site dedicated to sharing open-source software for analysis of biological data. Many R packages related to statistical genetics are posted on this site! The installation process is slightly different for Bioconductor packages. Read more about the snpStats
package and how to install it here: [link].
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("snpStats") BiocManager
Please stop by office hours if you are running into any issues getting these packages installed before class!
Exercises
Part 0: Setup
Packages
Load all of the packages that you installed (or updated) above:
library(snpStats)
library(tidyverse)
library(broom)
library(qqman)
Data
To download the data that we’ll be using today, go to this GitHub page and download the 1_QC_GWAS.zip
file. Once you’ve downloaded that file successfully, you’ll need to unzip it. On most computers, double-clicking the zipped file will unzip it, but you can also do it from the command line if that doesn’t work.
The data you just downloaded include real genotype information and a simulated binary phenotype for 165 individuals of European ancestry from the International HapMap Project (“HapMap”). The data were processed by a group of researchers in the Netherlands and France for a tutorial they wrote on how to conduct a genome-wide association study. If you’re interested, you can read that tutorial here and find the corresponding code on GitHub. The authors decided to focus their tutorial on an “ethnically homogeneous dataset,” so although HapMap actually sampled individuals from a variety of populations across the world (see this list), the data that we’re using today only includes individuals from Utah with Northern and Western European ancestry (the “CEU” population).
As we saw in our Journal Club 2 article, for many years it was common practice for researchers to focus their analyses only on individuals of European ancestry. The creators of the GWAS tutorial from which we are pulling these data are following suit. We’ll discuss how to move away from this restrictive (and problematic, IMO) setting later in the semester, but for now we’ll use these data as provided.
The information that we need for our analysis today is split across three files: HapMap_3_r3_1.bed
, HapMap_3_r3_1.bim
, and HapMap_3_r3_1.fam
. Although there are many ways that genetic data are stored (and this is in and of itself an interesting research question given their large size), one of the most common file formats is known as the PLINK format. That’s what we see here. The .bed
file contains the genotypes, the .fam
file contains information on the individuals, and the .bim
file contains information on the SNPs.
PLINK
is a popular tool for analyzing genetic data and conducting GWAS. Learn more here. We’re going to be conducting our GWAS “by hand” so that we can understand the methods better, but in practice you’ll find that most people use PLINK
or other software programs instead. Learning to use PLINK could make for an interesting final project (especially if one of your learning goals is related to developing your computational skills)!
We can use the read.plink
function in the snpStats
package to read these data into R. First, update the strings below to reflect the correct path to the data on your computer. Here’s what it looks like for me:
# update file paths
<- '../data/1_QC_GWAS/HapMap_3_r3_1.bed'
bed <- '../data/1_QC_GWAS/HapMap_3_r3_1.bim'
bim <- '../data/1_QC_GWAS/HapMap_3_r3_1.fam' fam
Now, you can run read.plink
:
# NOTE: this might take a minute or two
<- read.plink(bed, bim, fam) hapmap
Stop and Discuss. At the top of the previous code chunk, we specified the code chunk option cache: true
. What does this do, and why is it useful here? (Discuss at your table and use your resources!)
YOUR ANSWER HERE
Resources you used to answer this question:
Part 1: Data Exploration
The new hapmap
data object is a “list” of three items:
class(hapmap)
[1] "list"
names(hapmap)
[1] "genotypes" "fam" "map"
genotypes
contains the genotype datafam
contains information on each individualmap
contains information on each SNP
Explore genotype data
Get info about the genotype data:
$genotypes hapmap
A SnpMatrix with 165 rows and 1457897 columns
Row names: NA06989 ... NA12865
Col names: rs2185539 ... rs1973881
How many individuals do we have in this dataset?
YOUR ANSWER HERE
How many SNPs?
YOUR ANSWER HERE
Explore sample information
Look at the info we have on each individual:
head(hapmap$fam)
pedigree member father mother sex affected
NA06989 1328 NA06989 <NA> <NA> 2 2
NA11891 1377 NA11891 <NA> <NA> 1 2
NA11843 1349 NA11843 <NA> <NA> 1 1
NA12341 1330 NA12341 <NA> <NA> 2 2
NA12739 1444 NA12739 NA12748 NA12749 1 NA
NA10850 1344 NA10850 <NA> NA12058 2 NA
What are each of these columns telling us? (Discuss at your table!)
pedigree
:member
:father
:mother
:sex
:affected
:
Explore SNP information
We can also look at the info we have on each SNP:
head(hapmap$map)
chromosome snp.name cM position allele.1 allele.2
rs2185539 1 rs2185539 NA 556738 T C
rs11510103 1 rs11510103 NA 557616 G A
rs11240767 1 rs11240767 NA 718814 T C
rs3131972 1 rs3131972 NA 742584 A G
rs3131969 1 rs3131969 NA 744045 A G
rs1048488 1 rs1048488 NA 750775 C T
What are each of these columns telling us? (Discuss at your table!)
chromosome
:snp.name
:cM
:position
:allele.1
:allele.2
:
Calculate MAFs
One useful piece of information about these SNPs that the map
data frame does not already contain is the minor allele frequency (MAF) — that is, the frequency of the minor allele in this dataset. Thankfully, the snpStats
package provides some helpful code for quickly calculating the MAF for each SNP in the dataset:
<- col.summary(hapmap$genotypes)$MAF
maf head(maf)
[1] 0.00000000 0.00621118 0.00000000 0.15757576 0.13030303 0.15853659
Let’s add the minor allele frequencies to our map
data frame. (This will come in handy later.)
# add new MAF variable
$map <- hapmap$map %>%
hapmapmutate(MAF = maf)
# look at SNP info again
head(hapmap$map)
chromosome snp.name cM position allele.1 allele.2 MAF
rs2185539 1 rs2185539 NA 556738 T C 0.00000000
rs11510103 1 rs11510103 NA 557616 G A 0.00621118
rs11240767 1 rs11240767 NA 718814 T C 0.00000000
rs3131972 1 rs3131972 NA 742584 A G 0.15757576
rs3131969 1 rs3131969 NA 744045 A G 0.13030303
rs1048488 1 rs1048488 NA 750775 C T 0.15853659
What do you notice about the minor allele frequency of the first and third SNP in our dataset (rs2185539 and rs11240767)? What does this mean?
YOUR ANSWER HERE
Reformat data for analysis
The snptats
package uses a unique format to store data. Currently, genotypes are coded as 01, 02, and 03 (with 00 representing missing values).
# look at first five rows/columns
$genotypes@.Data[1:5,1:5] hapmap
rs2185539 rs11510103 rs11240767 rs3131972 rs3131969
NA06989 03 03 03 02 02
NA11891 03 03 03 02 03
NA11843 03 03 03 03 03
NA12341 03 03 03 02 02
NA12739 03 03 03 03 03
Let’s convert this to the 0, 1, 2 (and NA) format that we’ve talked about in class. Run the code chunk below.
# what type of object is hapmap$genotypes
class(hapmap$genotypes)
[1] "SnpMatrix"
attr(,"package")
[1] "snpStats"
# convert from "SnpMatrix" to "numeric"
<- as(hapmap$genotypes, "numeric")
X
# what type of object do we have now
class(X)
[1] "matrix" "array"
# look at first five rows/columns to confirm conversion looks ok
1:5, 1:5] X[
rs2185539 rs11510103 rs11240767 rs3131972 rs3131969
NA06989 2 2 2 1 1
NA11891 2 2 2 1 2
NA11843 2 2 2 2 2
NA12341 2 2 2 1 1
NA12739 2 2 2 2 2
You should now see a matrix of 0’s, 1’s, and 2’s.
(OPTIONAL) Data Wrangling Review. Now that we’ve converted the genotype data into a more familiar format, how might you use tidyverse
data wrangling tools from COMP/STAT 112 or STAT 155 to calculate the minor allele frequency of each SNP?
# your code here
Part 2: Trait Simulation
Background
Simulation studies are a widely used technique in statistical research. The general idea is this…
We “simulate”, or artificially generate, data in a way where we know the “truth”. Then, we apply a statistical method to these simulated data to see how it performs. Since we created the data, we can compare the result we get from our statistical method to the “truth” and get a sense of how accurate the method is. This is a particularly useful framework when we want to evaluate a new method, understand how a method’s performance varies across different scenarios, or compare multiple methods against each other. It can also be useful in contexts, like statistical genetics, in which finding/working with “real” data can be challenging (due to data privacy concerns, logistical constraints, etc.).
Although the genotype data we are working with in this activity are real, the phenotype/trait (the “affected” column in the fam
sample information) is not. Let’s create our own!
Picking a causal SNP
Let’s simulate a trait that depends on the SNP known as rs2476601. Here’s what we know about this SNP:
$map %>%
hapmapfilter(snp.name == 'rs2476601')
chromosome snp.name cM position allele.1 allele.2 MAF
rs2476601 1 rs2476601 NA 114179091 A G 0.1196319
We can learn more about this SNP by searching the dbSNP
database: https://www.ncbi.nlm.nih.gov/snp/. Enter the rsID (the SNP’s name) into the dbSNP search bar and see what you can learn. Summarize your findings here.
YOUR ANSWER HERE
Simulating a quantitative trait
Now, let’s create a quantitative trait y
that depends on your genotype at this SNP plus some random noise. We’ll give this SNP an “effect size” of 2: for every additional copy of the minor allele that you carry, your trait value will increase by 2 units (plus or minus some random noise). I’ve set this up for you – run the code chunk below.
# simulate trait
set.seed(494) # set seed for reproducibility
<- nrow(X) # calculate the number of individuals
n <- 2 * X[,'rs2476601'] + rnorm(n, 0, 1) # y = 2 * x + e
y
# look at the first few simulated trait values
head(y)
NA06989 NA11891 NA11843 NA12341 NA12739 NA10850
0.2082684 3.6267809 3.3422648 1.1696396 3.9450827 4.8350451
Since this is a simulation, we know by design that our trait y
depends on SNP rs2476601 and nothing else. We can use this information to check how well our methods are working. We’ll refer to rs2476601 as the causal SNP from here on out. (NOTE: causal, not casual)
Part 3: GWAS (Attempt #1)
In a real genetic study, we wouldn’t know the location of the true causal SNP(s) beforehand, so we’d need to run a genome-wide association study to see if we can figure out which variants are associated with the trait. As we’ve discussed in class, the method we’ll use for GWAS is marginal regression.
To start, let’s look at what happens when we run marginal regression on the first five SNPs in this dataset. Run this code chunk.
## fit models at first few SNPs
<- lm(y ~ X[,1])
mod1 <- lm(y ~ X[,2])
mod2 <- lm(y ~ X[,3])
mod3 <- lm(y ~ X[,4])
mod4 <- lm(y ~ X[,5]) mod5
## look at the model summaries
tidy(mod1)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.50 0.100 35.0 2.13e-77
2 X[, 1] NA NA NA NA
tidy(mod2)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.80 1.27 2.99 0.00319
2 X[, 2] -0.137 0.636 -0.216 0.830
tidy(mod3)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.50 0.100 35.0 2.13e-77
2 X[, 3] NA NA NA NA
tidy(mod4)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.15 0.346 12.0 4.12e-24
2 X[, 4] -0.385 0.197 -1.95 5.27e- 2
tidy(mod5)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.75 0.376 9.97 1.58e-18
2 X[, 5] -0.141 0.209 -0.676 5.00e- 1
Do you notice anything weird about the 1st and 3rd SNPs? What’s going on here?
YOUR ANSWER HERE
Part 4: Quality Control
We’re not going to be able to detect an association between our trait and any of the SNPs that have a MAF of zero (also known as monomorphic SNPs), so let’s remove them. We can remove these SNPs from the map
data frame with a filter
statement: **fill in the blanks (___) below and then run this code chunk.**
# keep only those SNPs with MAF > 0
<- hapmap$map %>%
map.clean filter(___)
Error in parse(text = input): <text>:3:11: unexpected input
2: map.clean <- hapmap$map %>%
3: filter(__
^
How many SNPs are left after filtering? How many did we remove?
Remaining SNPs: YOUR ANSWER HERE
Removed SNPs: YOUR ANSWER HERE
Now, let’s remove these monomorphic SNPs from our genotype matrix (X
). Since we’re working with a matrix here instead of a data frame, we can’t use the select
function. Here’s one way to remove the columns corresponding to the monomorphic SNPs from X
: (run this)
# create vector of which SNPs have a MAF of 0
<- which(maf == 0)
monomorphic head(monomorphic)
[1] 1 3 10 11 12 17
# remove columns in the monomorphic vector
<- X[,-monomorphic] X.clean
Confirm that the new “clean” genotype matrix has the correct number of rows and columns before you move on.
# your code here
As we saw in our Journal Club 2 paper, there are various pre-processing/“data cleaning” steps that researchers perform prior to implementing a genome-wide association study. We refer to these steps as quality control (QC for short). If you’re curious what QC usually entails, take a look at the other files you downloaded from the GWAS tutorial GitHub repository (particularly 1_Main_script_QC_GWAS.txt
) and/or skim this article:
Laurie CC, Doheny KF, Mirel DB, Pugh EW, Bierut LJ, Bhangale T, Boehm F, Caporaso NE, Cornelis MC, Edenberg HJ, Gabriel SB, Harris EL, Hu FB, Jacobs KB, Kraft P, Landi MT, Lumley T, Manolio TA, McHugh C, Painter I, Paschall J, Rice JP, Rice KM, Zheng X, Weir BS; GENEVA Investigators. Quality control and quality assurance in genotypic data for genome-wide association studies. Genet Epidemiol. 2010 Sep;34(6):591-602. doi: 10.1002/gepi.20516. PMID: 20718045; PMCID: PMC3061487.
To keep things simple for today, we’re going to restrict our QC to removing monomoprhic SNPs. But, be aware that there are additional QC steps that would normally be conducted before GWAS.
Part 5: GWAS (Attempt #2)
Chromosome 1
Even after removing the monomorphic SNPs, we still have 1283751 variants remaining. This might take awhile to analyze in R, so let’s focus on just the SNPs on the first chromosome to start.
Run the code chunk below to make a list of which SNPs live on chromosome 1:
<- which(map.clean$chromosome == 1) chr1.snps
Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'which': object 'map.clean' not found
head(chr1.snps)
Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'chr1.snps' not found
length(chr1.snps)
Error: object 'chr1.snps' not found
Now, we’re going to loop through each of the SNPs on chromosome 1, fitting a linear regression model at each SNP. For each model, we’ll record the estimates (betas
), standard errors (ses
), test statistics (tstats
) and p-values (pvals
) for the coefficient of interest (the slope). (run this)
# NOTE: this will take awhile to run
# set up empty vectors for storing results
<- c()
betas <- c()
ses <- c()
tstats <- c()
pvals
# loop through chromosome 1 SNPs
for(i in chr1.snps){
# print out occasional updates telling us what SNP we're analyzing
if(i %% 10000 == 0) print(paste('Analyzing SNP', i))
# fit model
<- lm(y ~ X.clean[,i])
mod # get coefficient information
<- tidy(mod)
coefinfo # record estimate, SE, test stat, and p-value
<- coefinfo$estimate[2]
betas[i] <- coefinfo$std.error[2]
ses[i] <- coefinfo$statistic[2]
tstats[i] <- coefinfo$p.value[2]
pvals[i] }
Error: object 'chr1.snps' not found
Let’s add our results to our map data frame that contains information about each SNP. (run this)
# start with the map info for the chr 1 SNPs
<- map.clean %>%
chr1.results filter(chromosome == 1)
Error: object 'map.clean' not found
# then add betas, SEs, etc.
<- chr1.results %>%
chr1.results mutate(Estimate = betas,
Std.Error = ses,
Test.Statistic = tstats,
P.Value = pvals)
Error: object 'chr1.results' not found
# look at results
head(chr1.results)
Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'chr1.results' not found
Now we’re ready to plot our results! We’d like to get a visual of which SNPs have small p-values (suggesting a possible association between the SNP and the trait). To start, create a scatterplot of the p-value of each SNP versus its position along the chromosome.
# your code here
What do you think of this visualization? Is it an effective tool for distinguishing which SNPs have small p-values?
YOUR ANSWER HERE
One way to improve this visualization is to apply a log transformation to the p-values. When we have lots of numbers squished into a small range (0 to 1, in this case), a log transformation can be a nice way to spread these out. Try plotting the log (base 10) of the p-values for each SNP instead.
# your code here
What do you notice now? Are the SNPs with small p-values the ones with small (more negative) values or large (closer to 0) values on the log base 10 scale?
YOUR ANSWER HERE
A traditional Manhattan plot uses a negative log transformation instead. Try plotting the \(-log_{10}\) p-values now.
# your code here
What similarities do you notice between this plot and the Manhattan plots we’ve seen before? What differences do you notice? Discuss.
YOUR ANSWER HERE
All Chromosomes (Genome-Wide)
Repeat the analysis above, but now looking at the SNPs on all chromosomes. Warning: this will take awhile to run!
Hint: The main thing you’ll need to change is which SNPs you’re looping over in your for loop.
# your code here
Part 6: Visualizing Results
Let’s use the qqman
package to create a Manhattan plot to visualize our genome-wide association results.
Basic Manhattan Plot
We’ll use the manhattan
function in this package to create a Manhattan plot. I’ve given you the basic structure of this command below. Fill in the blanks.
manhattan(___, # name of the dataset storing your results
chr = "___", # name of the column in your dataset with the chromosome number
bp = "___", # name of column storing the physical (base pair) position of the SNP
p = "___", # name of the column storing the association test p-values
snp = "___", # name of the column storing the SNP names (rs###)
ylim = c(0, 10) # sets the y-axis limits
)
Error in parse(text = input): <text>:1:12: unexpected input
1: manhattan(__
^
Do you notice any peaks on this Manhattan plot? Where are they located? Does that make sense, given what we know about the “truth” in this simulated scenario? Discuss.
YOUR ANSWER HERE
Going Deeper
The plot we created above is pretty basic. I want you to customize this a bit further! Your task: explore the qqman
documentation to learn more about how the manhattan
function works and the ways in which you can customize your visualization.
When you’re learning how to use a new R package, one helpful resource can be the package vignettes. Run the code chunk below to see a list of all vignettes available for the qqman
package:
# view list of vignettes
vignette(package = 'qqman')
To read a particular vignette, fill in the blanks in the code chunk below with the name of the vignette:
# view specific vignette
vignette("___", package = "qqman")
After reading this vignette, make at least one modification to the Manhattan plot code you ran above:
# your code here
Describe the change that you made and anything you learned from your updated visualization here:
YOUR ANSWER HERE
(OPTIONAL): Take this customization challenge one step further. Try to recreate the Manhattan plot above using ggplot
!
Part 7 (OPTIONAL): QQ Plots
We saw the “man” part of the qqman
package above (creating manhattan plots). The “qq” part of the qqman
package refers to its ability to create Q-Q plots, a common diagnostic tool for genome-wide association studies.
Explore the qqman
documentation to learn how to create a QQ plot.
Then, make one using the results of our genome-wide analysis.