install.packages(c('tidyverse', 'broom', 'qqman'))
Lab 2: Implementing and visualizing marginal regression
STAT 494: Statistical Genetics
Goals
- Explore real genetic data
- Implement GWAS in R using marginal regression
- Visualize GWAS results in R using Manhattan plots (“by hand” and using open-source R packages)
- Learn how to use a new R package by reading its documentation
Preparation
To prepare for this assignment, please install (or update) the following packages:
tidyverse
broom
qqman
snpStats
The first three packages are all available on CRAN (the Comprehensive R Archive Network) so you can install them in the usual way:
The fourth package is available on Bioconductor, a site dedicated to sharing open-source software for analysis of biological data. Many R packages related to statistical genetics are posted on this site! The installation process is slightly different for Bioconductor packages. Read more about the snpStats
package and how to install it here: [link].
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("snpStats") BiocManager
Please stop by office hours if you are running into any issues getting these packages installed before class!
Exercises
Part 0: Setup
Packages
Load all of the packages that you installed (or updated) above:
library(snpStats)
library(tidyverse)
library(broom)
library(qqman)
Data
To download the data that we’ll be using today, go to this GitHub page and download the 1_QC_GWAS.zip
file. Once you’ve downloaded that file successfully, you’ll need to unzip it. On most computers, double-clicking the zipped file will unzip it, but you can also do it from the command line if that doesn’t work.
The data you just downloaded include real genotype information and a simulated binary phenotype for 165 individuals of European ancestry from the International HapMap Project (“HapMap”). The data were processed by a group of researchers in the Netherlands and France for a tutorial they wrote on how to conduct a genome-wide association study. If you’re interested, you can read that tutorial here and find the corresponding code on GitHub. The authors decided to focus their tutorial on an “ethnically homogeneous dataset,” so although HapMap actually sampled individuals from a variety of populations across the world (see this list), the data that we’re using today only includes individuals from Utah with Northern and Western European ancestry (the “CEU” population).
As we saw in our Journal Club 2 article, for many years it was common practice for researchers to focus their analyses only on individuals of European ancestry. The creators of the GWAS tutorial from which we are pulling these data are following suit. We’ll discuss how to move away from this restrictive (and problematic, IMO) setting later in the semester, but for now we’ll use these data as provided.
The information that we need for our analysis today is split across three files: HapMap_3_r3_1.bed
, HapMap_3_r3_1.bim
, and HapMap_3_r3_1.fam
. Although there are many ways that genetic data are stored (and this is in and of itself an interesting research question given their large size), one of the most common file formats is known as the PLINK format. That’s what we see here. The .bed
file contains the genotypes, the .fam
file contains information on the individuals, and the .bim
file contains information on the SNPs.
PLINK
is a popular tool for analyzing genetic data and conducting GWAS. Learn more here. We’re going to be conducting our GWAS “by hand” so that we can understand the methods better, but in practice you’ll find that most people use PLINK
or other software programs instead. Learning to use PLINK could make for an interesting final project (especially if one of your learning goals is related to developing your computational skills)!
We can use the read.plink
function in the snpStats
package to read these data into R. First, update the strings below to reflect the correct path to the data on your computer. Here’s what it looks like for me:
# update file paths
<- '../data/1_QC_GWAS/HapMap_3_r3_1.bed'
bed <- '../data/1_QC_GWAS/HapMap_3_r3_1.bim'
bim <- '../data/1_QC_GWAS/HapMap_3_r3_1.fam'
fam
# if you want to save yourself some typing, here's another option
<- '../data/1_QC_GWAS/HapMap_3_r3_1'
gwas.dir <- paste0(gwas.dir, '.bed')
bed <- paste0(gwas.dir, '.bim')
bim <- paste0(gwas.dir, '.fam') fam
Now, you can run read.plink
:
# NOTE: this might take a minute or two
<- read.plink(bed, bim, fam) hapmap
Stop and Discuss. At the top of the previous code chunk, we specified the code chunk option cache: true
. What does this do, and why is it useful here? (Discuss at your table and use your resources!)
Whenever you have a time-consuming code chunk in your Quarto document (like this one), rendering can be a pain. In these cases, it can be helpful to set the code chunk option
cache: TRUE
. Once you’ve rendered the file once, all future renders will skip this chunk (provided none of the code in the chunk has changed) and instead just load in the results from the last run. This can be a huge time-saver!
Part 1: Data Exploration
The new hapmap
data object is a “list” of three items:
class(hapmap)
[1] "list"
names(hapmap)
[1] "genotypes" "fam" "map"
genotypes
contains the genotype datafam
contains information on each individualmap
contains information on each SNP
Explore genotype data
Get info about the genotype data:
$genotypes hapmap
A SnpMatrix with 165 rows and 1457897 columns
Row names: NA06989 ... NA12865
Col names: rs2185539 ... rs1973881
How many individuals do we have in this dataset?
165
How many SNPs?
1,457,897
Explore sample information
Look at the info we have on each individual:
head(hapmap$fam)
pedigree member father mother sex affected
NA06989 1328 NA06989 <NA> <NA> 2 2
NA11891 1377 NA11891 <NA> <NA> 1 2
NA11843 1349 NA11843 <NA> <NA> 1 1
NA12341 1330 NA12341 <NA> <NA> 2 2
NA12739 1444 NA12739 NA12748 NA12749 1 NA
NA10850 1344 NA10850 <NA> NA12058 2 NA
What are each of these columns telling us? (Discuss at your table!)
The
fam
file keeps track of information about each of the people in our study. In this case, we have the following:
pedigree
: the family ID number, ie the ID of the family, or “pedigree”, to which this individual belongsmember
: the individual ID number (same as the row name)father
: the ID of the individual’s father, if known and present in this sample (NA
otherwise)mother
: the ID of the individual’s mother, if known and present in this sample (NA
otherwise)sex
: binary sex (1
= male,2
= female,0
= unknown)affected
: a binary simulated phenotype (1
= control,2
= case)
Explore SNP information
We can also look at the info we have on each SNP:
head(hapmap$map)
chromosome snp.name cM position allele.1 allele.2
rs2185539 1 rs2185539 NA 556738 T C
rs11510103 1 rs11510103 NA 557616 G A
rs11240767 1 rs11240767 NA 718814 T C
rs3131972 1 rs3131972 NA 742584 A G
rs3131969 1 rs3131969 NA 744045 A G
rs1048488 1 rs1048488 NA 750775 C T
What are each of these columns telling us? (Discuss at your table!)
chromosome
: the chromosome on which the SNP residessnp.name
: the name of the SNP (commonly referred to as the “rsID”)cM
: the abbreviation cM stands for centiMorgans; this is a unit of distance that we use when talking about how far about SNPs are from one another along the genome; we refer to this type of distance as “genetic distance”position
: this tell us the base pair position of the SNP (e.g., position = 1 would be the very first nucleotide in our DNA sequence); we refer to this as “physical (base pair) distance”allele.1
: this is one of the alleles at this SNP (according to thePLINK
documentation, this is usually the minor allele)allele.2
: this is the other allele at this SNP (according to thePLINK
documentation, this is usually the major allele)
Calculate MAFs
One useful piece of information about these SNPs that the map
data frame does not already contain is the minor allele frequency (MAF) — that is, the frequency of the minor allele in this dataset. Thankfully, the snpStats
package provides some helpful code for quickly calculating the MAF for each SNP in the dataset:
<- col.summary(hapmap$genotypes)$MAF
maf head(maf)
[1] 0.00000000 0.00621118 0.00000000 0.15757576 0.13030303 0.15853659
Let’s add the minor allele frequencies to our map
data frame. (This will come in handy later.)
# add new MAF variable
$map <- hapmap$map %>%
hapmapmutate(MAF = maf)
# look at SNP info again
head(hapmap$map)
chromosome snp.name cM position allele.1 allele.2 MAF
rs2185539 1 rs2185539 NA 556738 T C 0.00000000
rs11510103 1 rs11510103 NA 557616 G A 0.00621118
rs11240767 1 rs11240767 NA 718814 T C 0.00000000
rs3131972 1 rs3131972 NA 742584 A G 0.15757576
rs3131969 1 rs3131969 NA 744045 A G 0.13030303
rs1048488 1 rs1048488 NA 750775 C T 0.15853659
What do you notice about the minor allele frequency of the first and third SNP in our dataset (rs2185539 and rs11240767)? What does this mean?
These SNPs have a MAF of zero! This means that everyone in the dataset has the same genotype at these positions. In the case of the first SNP, rs2185539, everyone is either
TT
orCC
; likewise for rs11240767. We refer to these SNPs as monomorphic (“mono” meaning “one”).Note that seeing a SNP with a MAF of zero in our dataset does not necessarily mean that the population MAF is likewise zero. In other words, there could be people, not in our sample, who do carry the other allele. We just don’t happen to see that in this sample.
Reformat data for analysis
The snpstats
package uses a unique format to store data. Currently, genotypes are coded as 01, 02, and 03 (with 00 representing missing values).
# look at first five rows/columns
$genotypes@.Data[1:5,1:5] hapmap
rs2185539 rs11510103 rs11240767 rs3131972 rs3131969
NA06989 03 03 03 02 02
NA11891 03 03 03 02 03
NA11843 03 03 03 03 03
NA12341 03 03 03 02 02
NA12739 03 03 03 03 03
Let’s convert this to the 0, 1, 2 (and NA) format that we’ve talked about in class. Run the code chunk below.
# what type of object is hapmap$genotypes
class(hapmap$genotypes)
[1] "SnpMatrix"
attr(,"package")
[1] "snpStats"
# convert from "SnpMatrix" to "numeric"
<- as(hapmap$genotypes, "numeric")
X
# what type of object do we have now
class(X)
[1] "matrix" "array"
# look at first five rows/columns to confirm conversion looks ok
1:5, 1:5] X[
rs2185539 rs11510103 rs11240767 rs3131972 rs3131969
NA06989 2 2 2 1 1
NA11891 2 2 2 1 2
NA11843 2 2 2 2 2
NA12341 2 2 2 1 1
NA12739 2 2 2 2 2
You should now see a matrix of 0’s, 1’s, and 2’s.
(OPTIONAL) Data Wrangling Review. Now that we’ve converted the genotype data into a more familiar format, how might you use tidyverse
data wrangling tools from COMP/STAT 112 or STAT 155 to calculate the minor allele frequency of each SNP?
# your code here
Part 2: Trait Simulation
Background
Simulation studies are a widely used technique in statistical research. The general idea is this…
We “simulate”, or artificially generate, data in a way where we know the “truth”. Then, we apply a statistical method to these simulated data to see how it performs. Since we created the data, we can compare the result we get from our statistical method to the “truth” and get a sense of how accurate the method is. This is a particularly useful framework when we want to evaluate a new method, understand how a method’s performance varies across different scenarios, or compare multiple methods against each other. It can also be useful in contexts, like statistical genetics, in which finding/working with “real” data can be challenging (due to data privacy concerns, logistical constraints, etc.).
Although the genotype data we are working with in this activity are real, the phenotype/trait (the “affected” column in the fam
sample information) is not. Let’s create our own!
Picking a causal SNP
Let’s simulate a trait that depends on the SNP known as rs2476601. Here’s what we know about this SNP:
$map %>%
hapmapfilter(snp.name == 'rs2476601')
chromosome snp.name cM position allele.1 allele.2 MAF
rs2476601 1 rs2476601 NA 114179091 A G 0.1196319
We can learn more about this SNP by searching the dbSNP
database: https://www.ncbi.nlm.nih.gov/snp/. Enter the rsID (the SNP’s name) into the dbSNP search bar and see what you can learn. Summarize your findings here.
There is a lot of information about this SNP on dbSNP. A few things that stood out to me:
- This SNP is located on chromosome 1.
- It is affiliated with (located in/near) two genes: PTPN22 and AP4B1-AS1
- Although there is some variation in allele frequencies across populations, in all of them the G allele is much more common (frequency > 90%) than the A allele. This is consistent with what we see in our data (MAF = 0.1196319).
- It has been implicated as a potential risk factor for diabetes, rheumatoid arthritis, and other (mostly autoimmune) diseases.
- It was mentioned in the Wellcome Trust GWAS paper that we read for Journal Club :)
Simulating a quantitative trait
Now, let’s create a quantitative trait y
that depends on your genotype at this SNP plus some random noise. We’ll give this SNP an “effect size” of 2: for every additional copy of the minor allele that you carry, your trait value will increase by 2 units (plus or minus some random noise). I’ve set this up for you – run the code chunk below.
# simulate trait
set.seed(494) # set seed for reproducibility
<- nrow(X) # calculate the number of individuals
n <- 2 * X[,'rs2476601'] + rnorm(n, 0, 1) # y = 2 * x + e
y
# look at the first few simulated trait values
head(y)
NA06989 NA11891 NA11843 NA12341 NA12739 NA10850
0.2082684 3.6267809 3.3422648 1.1696396 3.9450827 4.8350451
Since this is a simulation, we know by design that our trait y
depends on SNP rs2476601 and nothing else. We can use this information to check how well our methods are working. We’ll refer to rs2476601 as the causal SNP from here on out. (NOTE: causal, not casual)
Nothing to write here! Just run the code above.
Part 3: GWAS (Attempt #1)
In a real genetic study, we wouldn’t know the location of the true causal SNP(s) beforehand, so we’d need to run a genome-wide association study to see if we can figure out which variants are associated with the trait. As we’ve discussed in class, the method we’ll use for GWAS is marginal regression.
To start, let’s look at what happens when we run marginal regression on the first five SNPs in this dataset. Run this code chunk.
## fit models at first few SNPs
<- lm(y ~ X[,1])
mod1 <- lm(y ~ X[,2])
mod2 <- lm(y ~ X[,3])
mod3 <- lm(y ~ X[,4])
mod4 <- lm(y ~ X[,5]) mod5
## look at the model summaries
tidy(mod1)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.50 0.100 35.0 2.13e-77
2 X[, 1] NA NA NA NA
tidy(mod2)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.80 1.27 2.99 0.00319
2 X[, 2] -0.137 0.636 -0.216 0.830
tidy(mod3)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.50 0.100 35.0 2.13e-77
2 X[, 3] NA NA NA NA
tidy(mod4)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.15 0.346 12.0 4.12e-24
2 X[, 4] -0.385 0.197 -1.95 5.27e- 2
tidy(mod5)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.75 0.376 9.97 1.58e-18
2 X[, 5] -0.141 0.209 -0.676 5.00e- 1
Do you notice anything weird about the 1st and 3rd SNPs? What’s going on here?
Yes, we get NAs for the estimate, standard error, test statistics, and p-value for the slope of these two models. As we saw above, both of these SNPs are monomorphic: everyone in the dataset has the same genotype. Without variation in the genotypes that people carry at these positions, we can’t estimate the association between these SNPs and the trait.
(Like Lab 1, we have a situation of linear dependence here: the column of minor allele counts is a linear combination of the intercept column. Since our design matrix isn’t full rank, our least squares estimator is not defined.)
Part 4: Quality Control
We’re not going to be able to detect an association between our trait and any of the SNPs that have a MAF of zero (also known as monomorphic SNPs), so let’s remove them. We can remove these SNPs from the map
data frame with a filter
statement: **fill in the blanks (___) below and then run this code chunk.**
# keep only those SNPs with MAF > 0
<- hapmap$map %>%
map.clean filter(maf > 0)
How many SNPs are left after filtering? How many did we remove?
# check dimensions after filtering
dim(map.clean)
[1] 1283751 7
# compare to original dimensions
dim(hapmap$map)
[1] 1457897 7
Remaining SNPs: 1283751 SNPs
Removed SNPs: 174146 monomorphic SNPs
NOTE: Whenever we do any data cleaning, it’s important to keep track of our steps and report how many cases or variables were impacted!
Now, let’s remove these monomorphic SNPs from our genotype matrix (X
). Since we’re working with a matrix here instead of a data frame, we can’t use the select
function. Here’s one way to remove the columns corresponding to the monomorphic SNPs from X
: (run this)
# create vector of which SNPs have a MAF of 0
<- which(maf == 0)
monomorphic head(monomorphic)
[1] 1 3 10 11 12 17
# remove columns in the monomorphic vector
<- X[,-monomorphic] X.clean
Confirm that the new “clean” genotype matrix has the correct number of rows and columns before you move on.
dim(X.clean)
[1] 165 1283751
You should have 165 rows and 1283751 columns in this new genotype matrix. We didn’t remove any people, but we did remove some SNPs.
As we saw in our Journal Club 2 paper, there are various pre-processing/“data cleaning” steps that researchers perform prior to implementing a genome-wide association study. We refer to these steps as quality control (QC for short). If you’re curious what QC usually entails, take a look at the other files you downloaded from the GWAS tutorial GitHub repository (particularly 1_Main_script_QC_GWAS.txt
) and/or skim this article:
Laurie CC, Doheny KF, Mirel DB, Pugh EW, Bierut LJ, Bhangale T, Boehm F, Caporaso NE, Cornelis MC, Edenberg HJ, Gabriel SB, Harris EL, Hu FB, Jacobs KB, Kraft P, Landi MT, Lumley T, Manolio TA, McHugh C, Painter I, Paschall J, Rice JP, Rice KM, Zheng X, Weir BS; GENEVA Investigators. Quality control and quality assurance in genotypic data for genome-wide association studies. Genet Epidemiol. 2010 Sep;34(6):591-602. doi: 10.1002/gepi.20516. PMID: 20718045; PMCID: PMC3061487.
To keep things simple for today, we’re going to restrict our QC to removing monomoprhic SNPs. But, be aware that there are additional QC steps that would normally be conducted before GWAS.
Part 5: GWAS (Attempt #2)
Chromosome 1
Even after removing the monomorphic SNPs, we still have 1283751 variants remaining. This might take awhile to analyze in R, so let’s focus on just the SNPs on the first chromosome to start.
Run the code chunk below to make a list of which SNPs live on chromosome 1:
<- which(map.clean$chromosome == 1)
chr1.snps head(chr1.snps)
[1] 1 2 3 4 5 6
length(chr1.snps)
[1] 103646
Now, we’re going to loop through each of the SNPs on chromosome 1, fitting a linear regression model at each SNP. For each model, we’ll record the estimates (betas
), standard errors (ses
), test statistics (tstats
) and p-values (pvals
) for the coefficient of interest (the slope). (run this)
NOTE: the code chunk below takes awhile to run, so I set the
cache
code chunk option to betrue
!
# NOTE: this will take awhile to run
# set up empty vectors for storing results
<- c()
betas <- c()
ses <- c()
tstats <- c()
pvals
# loop through chromosome 1 SNPs
for(i in chr1.snps){
# print out occasional updates telling us what SNP we're analyzing
if(i %% 10000 == 0) print(paste('Analyzing SNP', i))
# fit model
<- lm(y ~ X.clean[,i])
mod # get coefficient information
<- tidy(mod)
coefinfo # record estimate, SE, test stat, and p-value
<- coefinfo$estimate[2]
betas[i] <- coefinfo$std.error[2]
ses[i] <- coefinfo$statistic[2]
tstats[i] <- coefinfo$p.value[2]
pvals[i] }
[1] "Analyzing SNP 10000"
[1] "Analyzing SNP 20000"
[1] "Analyzing SNP 30000"
[1] "Analyzing SNP 40000"
[1] "Analyzing SNP 50000"
[1] "Analyzing SNP 60000"
[1] "Analyzing SNP 70000"
[1] "Analyzing SNP 80000"
[1] "Analyzing SNP 90000"
[1] "Analyzing SNP 100000"
Let’s add our results to our map data frame that contains information about each SNP. (run this)
# start with the map info for the chr 1 SNPs
<- map.clean %>%
chr1.results filter(chromosome == 1)
# then add betas, SEs, etc.
<- chr1.results %>%
chr1.results mutate(Estimate = betas,
Std.Error = ses,
Test.Statistic = tstats,
P.Value = pvals)
# look at results
head(chr1.results)
chromosome snp.name cM position allele.1 allele.2 MAF
rs11510103 1 rs11510103 NA 557616 G A 0.00621118
rs3131972 1 rs3131972 NA 742584 A G 0.15757576
rs3131969 1 rs3131969 NA 744045 A G 0.13030303
rs1048488 1 rs1048488 NA 750775 C T 0.15853659
rs12562034 1 rs12562034 NA 758311 A G 0.10909091
rs12124819 1 rs12124819 NA 766409 G A 0.28658537
Estimate Std.Error Test.Statistic P.Value
rs11510103 -0.13716883 0.6359358 -0.2156960 0.82950466
rs3131972 -0.38450101 0.1969700 -1.9520787 0.05266367
rs3131969 -0.14108060 0.2086905 -0.6760278 0.49999293
rs1048488 -0.37371908 0.1970853 -1.8962297 0.05973218
rs12562034 0.06013127 0.2215724 0.2713843 0.78644313
rs12124819 -0.22912006 0.1498249 -1.5292522 0.12817656
Now we’re ready to plot our results! We’d like to get a visual of which SNPs have small p-values (suggesting a possible association between the SNP and the trait). To start, create a scatterplot of the p-value of each SNP versus its position along the chromosome.
%>%
chr1.results ggplot(aes(x = position, y = P.Value)) +
geom_point() +
labs(x = 'position (bp)', y = 'p-value') + # update axis titles
scale_x_continuous(labels = scales::comma) # reformat x-axis labels
What do you think of this visualization? Is it an effective tool for distinguishing which SNPs have small p-values?
Although this seems like a reasonable place to start in terms of plotting results, the result is a mess! It’s really hard to see which SNPs have the smallest p-values.
One way to improve this visualization is to apply a log transformation to the p-values. When we have lots of numbers squished into a small range (0 to 1, in this case), a log transformation can be a nice way to spread these out. Try plotting the log (base 10) of the p-values for each SNP instead.
%>%
chr1.results mutate(logp = log10(P.Value)) %>% # log transformation
ggplot(aes(x = position, y = logp)) +
geom_point() +
labs(x = 'position (bp)', y = expression(paste('log'[10],'(p-value)'))) +
scale_x_continuous(labels = scales::comma)
What do you notice now? Are the SNPs with small p-values the ones with small (more negative) values or large (closer to 0) values on the log base 10 scale?
The log transformation spread out the values so we can start to pick up on which p-values are smaller/larger than the others, now. Recall that the log of 1 is 0 and the log of anything between 0 and 1 will be negative. For example: log(0.1) = -1, log(0.01) = -2, and log(0.001) = -6.9077553. So, the SNPs with the smallest p-values in this plot are the ones with smaller (more negative) values.
A traditional Manhattan plot uses a negative log transformation instead. Try plotting the \(-log_{10}\) p-values now.
%>%
chr1.results mutate(minuslogp = -log10(P.Value)) %>% # -log10 transformation
ggplot(aes(x = position, y = minuslogp)) +
geom_point() +
labs(x = 'position (bp)', y = expression(paste('-log'[10],'(p-value)'))) +
scale_x_continuous(labels = scales::comma)
What similarities do you notice between this plot and the Manhattan plots we’ve seen before? What differences do you notice? Discuss.
This looks a lot like the Manhattan plots we saw in our second Journal Club article and the plot we’ve seen a couple of times as the background photo of a slide in the lecture slides. A few differences that I notice:
- we’re only looking at one chromosome — we’ll scale this up in the next section!
- there’s a gap in the middle of the plot (this is most likely where the centromere of chromosome 1 is located; centromeres tend to be difficult to genotype)
- there is no indication of which SNPs are significant (either via a horizontal dashed line or different colored points) — we’ll need to add this ourselves if we want it!
All Chromosomes (Genome-Wide)
Repeat the analysis above, but now looking at the SNPs on all chromosomes. Warning: this will take awhile to run!
Hint: The main thing you’ll need to change is which SNPs you’re looping over in your for loop.
This code chunk takes a long time to run. Like above, I recommend utilizing the
cache: true
code chunk option here.
# set up empty vectors for storing results
<- c()
betas <- c()
ses <- c()
tstats <- c()
pvals
# loop through all SNPs
for(i in 1:ncol(X.clean)){
# print out occasional updates telling us what SNP we're analyzing
if(i %% 10000 == 0) print(paste('Analyzing SNP', i))
# fit model
<- lm(y ~ X.clean[,i])
mod # get coefficient information
<- tidy(mod)
coefinfo # record estimate, SE, test stat, and p-value
<- coefinfo$estimate[2]
betas[i] <- coefinfo$std.error[2]
ses[i] <- coefinfo$statistic[2]
tstats[i] <- coefinfo$p.value[2]
pvals[i] }
[1] "Analyzing SNP 10000"
[1] "Analyzing SNP 20000"
[1] "Analyzing SNP 30000"
[1] "Analyzing SNP 40000"
[1] "Analyzing SNP 50000"
[1] "Analyzing SNP 60000"
[1] "Analyzing SNP 70000"
[1] "Analyzing SNP 80000"
[1] "Analyzing SNP 90000"
[1] "Analyzing SNP 100000"
[1] "Analyzing SNP 110000"
[1] "Analyzing SNP 120000"
[1] "Analyzing SNP 130000"
[1] "Analyzing SNP 140000"
[1] "Analyzing SNP 150000"
[1] "Analyzing SNP 160000"
[1] "Analyzing SNP 170000"
[1] "Analyzing SNP 180000"
[1] "Analyzing SNP 190000"
[1] "Analyzing SNP 200000"
[1] "Analyzing SNP 210000"
[1] "Analyzing SNP 220000"
[1] "Analyzing SNP 230000"
[1] "Analyzing SNP 240000"
[1] "Analyzing SNP 250000"
[1] "Analyzing SNP 260000"
[1] "Analyzing SNP 270000"
[1] "Analyzing SNP 280000"
[1] "Analyzing SNP 290000"
[1] "Analyzing SNP 300000"
[1] "Analyzing SNP 310000"
[1] "Analyzing SNP 320000"
[1] "Analyzing SNP 330000"
[1] "Analyzing SNP 340000"
[1] "Analyzing SNP 350000"
[1] "Analyzing SNP 360000"
[1] "Analyzing SNP 370000"
[1] "Analyzing SNP 380000"
[1] "Analyzing SNP 390000"
[1] "Analyzing SNP 400000"
[1] "Analyzing SNP 410000"
[1] "Analyzing SNP 420000"
[1] "Analyzing SNP 430000"
[1] "Analyzing SNP 440000"
[1] "Analyzing SNP 450000"
[1] "Analyzing SNP 460000"
[1] "Analyzing SNP 470000"
[1] "Analyzing SNP 480000"
[1] "Analyzing SNP 490000"
[1] "Analyzing SNP 500000"
[1] "Analyzing SNP 510000"
[1] "Analyzing SNP 520000"
[1] "Analyzing SNP 530000"
[1] "Analyzing SNP 540000"
[1] "Analyzing SNP 550000"
[1] "Analyzing SNP 560000"
[1] "Analyzing SNP 570000"
[1] "Analyzing SNP 580000"
[1] "Analyzing SNP 590000"
[1] "Analyzing SNP 600000"
[1] "Analyzing SNP 610000"
[1] "Analyzing SNP 620000"
[1] "Analyzing SNP 630000"
[1] "Analyzing SNP 640000"
[1] "Analyzing SNP 650000"
[1] "Analyzing SNP 660000"
[1] "Analyzing SNP 670000"
[1] "Analyzing SNP 680000"
[1] "Analyzing SNP 690000"
[1] "Analyzing SNP 700000"
[1] "Analyzing SNP 710000"
[1] "Analyzing SNP 720000"
[1] "Analyzing SNP 730000"
[1] "Analyzing SNP 740000"
[1] "Analyzing SNP 750000"
[1] "Analyzing SNP 760000"
[1] "Analyzing SNP 770000"
[1] "Analyzing SNP 780000"
[1] "Analyzing SNP 790000"
[1] "Analyzing SNP 800000"
[1] "Analyzing SNP 810000"
[1] "Analyzing SNP 820000"
[1] "Analyzing SNP 830000"
[1] "Analyzing SNP 840000"
[1] "Analyzing SNP 850000"
[1] "Analyzing SNP 860000"
[1] "Analyzing SNP 870000"
[1] "Analyzing SNP 880000"
[1] "Analyzing SNP 890000"
[1] "Analyzing SNP 900000"
[1] "Analyzing SNP 910000"
[1] "Analyzing SNP 920000"
[1] "Analyzing SNP 930000"
[1] "Analyzing SNP 940000"
[1] "Analyzing SNP 950000"
[1] "Analyzing SNP 960000"
[1] "Analyzing SNP 970000"
[1] "Analyzing SNP 980000"
[1] "Analyzing SNP 990000"
[1] "Analyzing SNP 1000000"
[1] "Analyzing SNP 1010000"
[1] "Analyzing SNP 1020000"
[1] "Analyzing SNP 1030000"
[1] "Analyzing SNP 1040000"
[1] "Analyzing SNP 1050000"
[1] "Analyzing SNP 1060000"
[1] "Analyzing SNP 1070000"
[1] "Analyzing SNP 1080000"
[1] "Analyzing SNP 1090000"
[1] "Analyzing SNP 1100000"
[1] "Analyzing SNP 1110000"
[1] "Analyzing SNP 1120000"
[1] "Analyzing SNP 1130000"
[1] "Analyzing SNP 1140000"
[1] "Analyzing SNP 1150000"
[1] "Analyzing SNP 1160000"
[1] "Analyzing SNP 1170000"
[1] "Analyzing SNP 1180000"
[1] "Analyzing SNP 1190000"
[1] "Analyzing SNP 1200000"
[1] "Analyzing SNP 1210000"
[1] "Analyzing SNP 1220000"
[1] "Analyzing SNP 1230000"
[1] "Analyzing SNP 1240000"
[1] "Analyzing SNP 1250000"
[1] "Analyzing SNP 1260000"
[1] "Analyzing SNP 1270000"
[1] "Analyzing SNP 1280000"
I also recommend splitting the time-intensive task (eg looping through all SNPs, above) and later less time-intensive tasks (eg reformatting results, below) into separate code chunks. This way you can cache the time-intensive analysis and play around with the code in the follow-up chunks without having to re-run the time-intensive portions of your analysis every time you edit the latter.
# start with the map info for the chr 1 SNPs
<- map.clean
all.results
# then add betas, SEs, etc.
<- all.results %>%
all.results mutate(Estimate = betas,
Std.Error = ses,
Test.Statistic = tstats,
P.Value = pvals)
# look at results
head(all.results)
chromosome snp.name cM position allele.1 allele.2 MAF
rs11510103 1 rs11510103 NA 557616 G A 0.00621118
rs3131972 1 rs3131972 NA 742584 A G 0.15757576
rs3131969 1 rs3131969 NA 744045 A G 0.13030303
rs1048488 1 rs1048488 NA 750775 C T 0.15853659
rs12562034 1 rs12562034 NA 758311 A G 0.10909091
rs12124819 1 rs12124819 NA 766409 G A 0.28658537
Estimate Std.Error Test.Statistic P.Value
rs11510103 -0.13716883 0.6359358 -0.2156960 0.82950466
rs3131972 -0.38450101 0.1969700 -1.9520787 0.05266367
rs3131969 -0.14108060 0.2086905 -0.6760278 0.49999293
rs1048488 -0.37371908 0.1970853 -1.8962297 0.05973218
rs12562034 0.06013127 0.2215724 0.2713843 0.78644313
rs12124819 -0.22912006 0.1498249 -1.5292522 0.12817656
R is notoriously slow when it comes to for
loops. For folks that have taken STAT/COMP 212 or another advanced coding course, can you think of a better (eg more efficient) way to code this up?
Part 6: Visualizing Results
Let’s use the qqman
package to create a Manhattan plot to visualize our genome-wide association results.
Basic Manhattan Plot
We’ll use the manhattan
function in this package to create a Manhattan plot. I’ve given you the basic structure of this command below. Fill in the blanks.
When our data are this big, even plotting the results can take awhile. I recommend using
cache: true
again here.
manhattan(all.results, # name of the dataset storing your results
chr = "chromosome", # name of the column in your dataset with the chromosome number
bp = "position", # name of column storing the physical (base pair) position of the SNP
p = "P.Value", # name of the column storing the association test p-values
snp = "snp.name", # name of the column storing the SNP names (rs###)
ylim = c(0, 10) # sets the y-axis limits
)
Do you notice any peaks on this Manhattan plot? Where are they located? Does that make sense, given what we know about the “truth” in this simulated scenario? Discuss.
It looks like there is a peak on chromosome 1. This makes sense! We know that the true causal SNP is located on that chromosome.
Going Deeper
The plot we created above is pretty basic. I want you to customize this a bit further! Your task: explore the qqman
documentation to learn more about how the manhattan
function works and the ways in which you can customize your visualization.
When you’re learning how to use a new R package, one helpful resource can be the package vignettes. Run the code chunk below to see a list of all vignettes available for the qqman
package:
# view list of vignettes
vignette(package = 'qqman')
It looks like this package only has one vignette, called “qqman”.
If you ever find yourself in the position of needing/wanting to create your own R package, writing a vignette (or multiple) will be an important part of that process. A good vignette can go a long way in helping people learn how to use your package. Read more about this process here.
To read a particular vignette, fill in the blanks in the code chunk below with the name of the vignette:
# view specific vignette
vignette("qqman", package = "qqman")
After reading this vignette, make at least one modification to the Manhattan plot code you ran above:
Answers may vary, but here are a couple ideas I came up with:
# focus on autosomes (chromosomes 1-22)
%>%
all.results filter(chromosome < 23) %>%
manhattan(chr = "chromosome", bp = "position",
p = "P.Value", snp = "snp.name", ylim = c(0, 10))
# highlight the genome-wide significant results
<- all.results %>%
sig.snps filter(P.Value < 5e-08) %>%
pull(snp.name)
manhattan(all.results, chr = "chromosome", bp = "position",
p = "P.Value", snp = "snp.name", ylim = c(0, 10),
highlight = sig.snps)
# change the color scheme
# use official Mac colors: https://www.macalester.edu/communications/visual-identity/colors/
manhattan(all.results, chr = "chromosome", bp = "position",
p = "P.Value", snp = "snp.name", ylim = c(0, 10),
col = c('#01426A', '#D44420'))
# remove "suggestive" line
# use Wellcome Trust p-value threshold
manhattan(all.results, chr = "chromosome", bp = "position",
p = "P.Value", snp = "snp.name", ylim = c(0, 10),
suggestiveline = F,
genomewideline = -log10(5e-07))
# zoom into chromosome 1
# and annotate top SNPs
manhattan(subset(all.results, chromosome == 1),
chr = "chromosome", bp = "position",
p = "P.Value", snp = "snp.name", ylim = c(0, 20),
annotatePval = 5e-10, annotateTop = FALSE)
Describe the change that you made and anything you learned from your updated visualization here:
Answers will vary. Review the
qqman
vignette for more examples of customizations that can be made here!
(OPTIONAL): Take this customization challenge one step further. Try to recreate the Manhattan plot above using ggplot
!
%>%
all.results mutate(minuslogp = -log10(P.Value),
chr = as.factor(chromosome)) %>%
ggplot(aes(x = position, y = minuslogp, color = chr)) +
geom_point() +
labs(x = 'position (bp)', y = expression(paste('-log'[10],'(p-value)'))) +
scale_x_continuous(labels = scales::comma)
Notice that the results from different chromosomes are getting plotted on top of one another! This is because of the way we keep track of SNP locations: the position (bp) is relative to the chromosome, so the numbers start back over at zero with each new chromosome. We can use the
group
aesthetic in ggplot to fix this:
%>%
all.results mutate(minuslogp = -log10(P.Value),
chr = as.factor(chromosome)) %>%
ggplot(aes(x = chr, y = minuslogp, group = interaction(chr, position), color = chr)) +
geom_point(position = position_dodge(0.8)) +
labs(x = 'chromosome', y = expression(paste('-log'[10],'(p-value)')))
And if you want to play around with the color palette, try something like this:
# try a different color palette
library(NatParksPalettes)
%>%
all.results mutate(minuslogp = -log10(P.Value),
chr = as.factor(chromosome)) %>%
ggplot(aes(x = chr, y = minuslogp, group = interaction(chr, position), color = chr)) +
geom_point(position = position_dodge(0.8)) +
labs(x = 'chromosome', y = expression(paste('-log'[10],'(p-value)'))) +
scale_color_manual(values=natparks.pals("Olympic", 25)) +
theme(legend.position="none")
# color palette to match the manh plot in the slides
## inspired by https://github.com/UW-GAC/analysis_pipeline
library(RColorBrewer)
<- levels(as.factor(all.results$chromosome))
chr <- setNames(rep_len(brewer.pal(8, "Dark2"), length(chr)), chr)
cmap
# make plot
%>%
all.results mutate(minuslogp = -log10(P.Value),
chr = as.factor(chromosome)) %>%
ggplot(aes(x = chr, y = minuslogp, group = interaction(chr, position), color = chr)) +
geom_point(position = position_dodge(0.8)) +
labs(x = 'chromosome', y = expression(paste('-log'[10],'(p-value)'))) +
scale_color_manual(values=cmap, breaks=names(cmap)) +
theme(legend.position="none")
Part 7 (OPTIONAL): QQ Plots
We saw the “man” part of the qqman
package above (creating manhattan plots). The “qq” part of the qqman
package refers to its ability to create Q-Q plots, a common diagnostic tool for genome-wide association studies.
Explore the qqman
documentation to learn how to create a QQ plot.
Then, make one using the results of our genome-wide analysis.
qq(all.results$`P.Value`, main = 'QQ Plot of GWAS Results')
Based on what we’ve discussed about QQ plots thus far, do you see any reason for concern here?
To me, this QQ plot looks pretty reasonable. Most of the points fall along the y = x line, which makes sense given that we know most SNPs are not associated with our simulated trait. There is a “tail” of points rising above y = x, which means that we have a handful of SNPs with smaller p-values than we would expect to see if this trait did not have any genetic basis. This “tail” can arise when you have some SNPs that are associated with our trait. Again, this is expected here: we simulated this trait, so we know that there is a SNP on chromosome 1 that is actually causally associated with our trait. However, our simulation set-up with a single causal SNP raises the question of why we see more than one SNP with a “smaller than expected” p-value. This is happening because nearby SNPs are correlated with one another. Take a look at the zoomed-in Manhattan plot I created above. Note that there are quite a few SNPs on chromosome 1 with small / “significant” p-values. Even though only one SNP, rs2476601, is actually causal, others are “along for the ride” because they’re nearby and correlated with it.