Lab 2: Implementing and visualizing marginal regression

STAT 494: Statistical Genetics

Author

Solutions

Published

February 14, 2025

Goals

Explore real genetic data
Implement GWAS in R using marginal regression
Visualize GWAS results in R using Manhattan plots (“by hand” and using open-source R packages)
Learn how to use a new R package by reading its documentation

Preparation

To prepare for this assignment, please install (or update) the following packages:

tidyverse
broom
qqman
snpStats

The first three packages are all available on CRAN (the Comprehensive R Archive Network) so you can install them in the usual way:

install.packages(c('tidyverse', 'broom', 'qqman'))

The fourth package is available on Bioconductor, a site dedicated to sharing open-source software for analysis of biological data. Many R packages related to statistical genetics are posted on this site! The installation process is slightly different for Bioconductor packages. Read more about the snpStats package and how to install it here: [link].

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("snpStats")

Please stop by office hours if you are running into any issues getting these packages installed before class!

Exercises

Part 0: Setup

Packages

Load all of the packages that you installed (or updated) above:

library(snpStats)
library(tidyverse)
library(broom)
library(qqman)

Data

To download the data that we’ll be using today, go to this GitHub page and download the 1_QC_GWAS.zip file. Once you’ve downloaded that file successfully, you’ll need to unzip it. On most computers, double-clicking the zipped file will unzip it, but you can also do it from the command line if that doesn’t work.

The data you just downloaded include real genotype information and a simulated binary phenotype for 165 individuals of European ancestry from the International HapMap Project (“HapMap”). The data were processed by a group of researchers in the Netherlands and France for a tutorial they wrote on how to conduct a genome-wide association study. If you’re interested, you can read that tutorial here and find the corresponding code on GitHub. The authors decided to focus their tutorial on an “ethnically homogeneous dataset,” so although HapMap actually sampled individuals from a variety of populations across the world (see this list), the data that we’re using today only includes individuals from Utah with Northern and Western European ancestry (the “CEU” population).

Note

As we saw in our Journal Club 2 article, for many years it was common practice for researchers to focus their analyses only on individuals of European ancestry. The creators of the GWAS tutorial from which we are pulling these data are following suit. We’ll discuss how to move away from this restrictive (and problematic, IMO) setting later in the semester, but for now we’ll use these data as provided.

The information that we need for our analysis today is split across three files: HapMap_3_r3_1.bed, HapMap_3_r3_1.bim, and HapMap_3_r3_1.fam. Although there are many ways that genetic data are stored (and this is in and of itself an interesting research question given their large size), one of the most common file formats is known as the PLINK format. That’s what we see here. The .bed file contains the genotypes, the .fam file contains information on the individuals, and the .bim file contains information on the SNPs.

Note

PLINK is a popular tool for analyzing genetic data and conducting GWAS. Learn more here. We’re going to be conducting our GWAS “by hand” so that we can understand the methods better, but in practice you’ll find that most people use PLINK or other software programs instead. Learning to use PLINK could make for an interesting final project (especially if one of your learning goals is related to developing your computational skills)!

We can use the read.plink function in the snpStats package to read these data into R. First, update the strings below to reflect the correct path to the data on your computer. Here’s what it looks like for me:

# update file paths
bed <- '../data/1_QC_GWAS/HapMap_3_r3_1.bed'
bim <- '../data/1_QC_GWAS/HapMap_3_r3_1.bim'
fam <- '../data/1_QC_GWAS/HapMap_3_r3_1.fam'

# if you want to save yourself some typing, here's another option
gwas.dir <- '../data/1_QC_GWAS/HapMap_3_r3_1'
bed <- paste0(gwas.dir, '.bed')
bim <- paste0(gwas.dir, '.bim')
fam <- paste0(gwas.dir, '.fam')

Now, you can run read.plink:

# NOTE: this might take a minute or two
hapmap <- read.plink(bed, bim, fam)

Stop and Discuss. At the top of the previous code chunk, we specified the code chunk option cache: true. What does this do, and why is it useful here? (Discuss at your table and use your resources!)

Whenever you have a time-consuming code chunk in your Quarto document (like this one), rendering can be a pain. In these cases, it can be helpful to set the code chunk option cache: TRUE. Once you’ve rendered the file once, all future renders will skip this chunk (provided none of the code in the chunk has changed) and instead just load in the results from the last run. This can be a huge time-saver!

Read more about caching here and here.

Part 1: Data Exploration

The new hapmap data object is a “list” of three items:

class(hapmap)

[1] "list"

names(hapmap)

[1] "genotypes" "fam"       "map"

genotypes contains the genotype data
fam contains information on each individual
map contains information on each SNP

Explore genotype data

Get info about the genotype data:

hapmap$genotypes

A SnpMatrix with  165 rows and  1457897 columns
Row names:  NA06989 ... NA12865 
Col names:  rs2185539 ... rs1973881

How many individuals do we have in this dataset?

165

How many SNPs?

1,457,897

Explore sample information

Look at the info we have on each individual:

head(hapmap$fam)

        pedigree  member  father  mother sex affected
NA06989     1328 NA06989    <NA>    <NA>   2        2
NA11891     1377 NA11891    <NA>    <NA>   1        2
NA11843     1349 NA11843    <NA>    <NA>   1        1
NA12341     1330 NA12341    <NA>    <NA>   2        2
NA12739     1444 NA12739 NA12748 NA12749   1       NA
NA10850     1344 NA10850    <NA> NA12058   2       NA

What are each of these columns telling us? (Discuss at your table!)

The fam file keeps track of information about each of the people in our study. In this case, we have the following:

pedigree: the family ID number, ie the ID of the family, or “pedigree”, to which this individual belongs

member: the individual ID number (same as the row name)

father: the ID of the individual’s father, if known and present in this sample (NA otherwise)

mother: the ID of the individual’s mother, if known and present in this sample (NA otherwise)

sex: binary sex (1 = male, 2 = female, 0 = unknown)

affected: a binary simulated phenotype (1 = control, 2 = case)

Explore SNP information

We can also look at the info we have on each SNP:

head(hapmap$map)

           chromosome   snp.name cM position allele.1 allele.2
rs2185539           1  rs2185539 NA   556738        T        C
rs11510103          1 rs11510103 NA   557616        G        A
rs11240767          1 rs11240767 NA   718814        T        C
rs3131972           1  rs3131972 NA   742584        A        G
rs3131969           1  rs3131969 NA   744045        A        G
rs1048488           1  rs1048488 NA   750775        C        T

What are each of these columns telling us? (Discuss at your table!)

chromosome: the chromosome on which the SNP resides

snp.name: the name of the SNP (commonly referred to as the “rsID”)

cM: the abbreviation cM stands for centiMorgans; this is a unit of distance that we use when talking about how far about SNPs are from one another along the genome; we refer to this type of distance as “genetic distance”

position: this tell us the base pair position of the SNP (e.g., position = 1 would be the very first nucleotide in our DNA sequence); we refer to this as “physical (base pair) distance”

allele.1: this is one of the alleles at this SNP (according to the PLINK documentation, this is usually the minor allele)

allele.2: this is the other allele at this SNP (according to the PLINK documentation, this is usually the major allele)

Calculate MAFs

One useful piece of information about these SNPs that the map data frame does not already contain is the minor allele frequency (MAF) — that is, the frequency of the minor allele in this dataset. Thankfully, the snpStats package provides some helpful code for quickly calculating the MAF for each SNP in the dataset:

maf <- col.summary(hapmap$genotypes)$MAF
head(maf)

[1] 0.00000000 0.00621118 0.00000000 0.15757576 0.13030303 0.15853659

Let’s add the minor allele frequencies to our map data frame. (This will come in handy later.)

# add new MAF variable 
hapmap$map <- hapmap$map %>%
  mutate(MAF = maf)

# look at SNP info again
head(hapmap$map)

           chromosome   snp.name cM position allele.1 allele.2        MAF
rs2185539           1  rs2185539 NA   556738        T        C 0.00000000
rs11510103          1 rs11510103 NA   557616        G        A 0.00621118
rs11240767          1 rs11240767 NA   718814        T        C 0.00000000
rs3131972           1  rs3131972 NA   742584        A        G 0.15757576
rs3131969           1  rs3131969 NA   744045        A        G 0.13030303
rs1048488           1  rs1048488 NA   750775        C        T 0.15853659

What do you notice about the minor allele frequency of the first and third SNP in our dataset (rs2185539 and rs11240767)? What does this mean?

These SNPs have a MAF of zero! This means that everyone in the dataset has the same genotype at these positions. In the case of the first SNP, rs2185539, everyone is either TT or CC; likewise for rs11240767. We refer to these SNPs as monomorphic (“mono” meaning “one”).

Note that seeing a SNP with a MAF of zero in our dataset does not necessarily mean that the population MAF is likewise zero. In other words, there could be people, not in our sample, who do carry the other allele. We just don’t happen to see that in this sample.

Reformat data for analysis

The snpstats package uses a unique format to store data. Currently, genotypes are coded as 01, 02, and 03 (with 00 representing missing values).

# look at first five rows/columns
hapmap$genotypes@.Data[1:5,1:5]

        rs2185539 rs11510103 rs11240767 rs3131972 rs3131969
NA06989        03         03         03        02        02
NA11891        03         03         03        02        03
NA11843        03         03         03        03        03
NA12341        03         03         03        02        02
NA12739        03         03         03        03        03

Let’s convert this to the 0, 1, 2 (and NA) format that we’ve talked about in class. Run the code chunk below.

# what type of object is hapmap$genotypes
class(hapmap$genotypes)

[1] "SnpMatrix"
attr(,"package")
[1] "snpStats"

# convert from "SnpMatrix" to "numeric"
X <- as(hapmap$genotypes, "numeric")

# what type of object do we have now 
class(X)

[1] "matrix" "array"

# look at first five rows/columns to confirm conversion looks ok
X[1:5, 1:5]

        rs2185539 rs11510103 rs11240767 rs3131972 rs3131969
NA06989         2          2          2         1         1
NA11891         2          2          2         1         2
NA11843         2          2          2         2         2
NA12341         2          2          2         1         1
NA12739         2          2          2         2         2

You should now see a matrix of 0’s, 1’s, and 2’s.

(OPTIONAL) Data Wrangling Review. Now that we’ve converted the genotype data into a more familiar format, how might you use tidyverse data wrangling tools from COMP/STAT 112 or STAT 155 to calculate the minor allele frequency of each SNP?

# your code here

Part 2: Trait Simulation

Background

Simulation studies are a widely used technique in statistical research. The general idea is this…

We “simulate”, or artificially generate, data in a way where we know the “truth”. Then, we apply a statistical method to these simulated data to see how it performs. Since we created the data, we can compare the result we get from our statistical method to the “truth” and get a sense of how accurate the method is. This is a particularly useful framework when we want to evaluate a new method, understand how a method’s performance varies across different scenarios, or compare multiple methods against each other. It can also be useful in contexts, like statistical genetics, in which finding/working with “real” data can be challenging (due to data privacy concerns, logistical constraints, etc.).

Although the genotype data we are working with in this activity are real, the phenotype/trait (the “affected” column in the fam sample information) is not. Let’s create our own!

Picking a causal SNP

Let’s simulate a trait that depends on the SNP known as rs2476601. Here’s what we know about this SNP:

hapmap$map %>%
  filter(snp.name == 'rs2476601')

          chromosome  snp.name cM  position allele.1 allele.2       MAF
rs2476601          1 rs2476601 NA 114179091        A        G 0.1196319

We can learn more about this SNP by searching the dbSNP database: https://www.ncbi.nlm.nih.gov/snp/. Enter the rsID (the SNP’s name) into the dbSNP search bar and see what you can learn. Summarize your findings here.

There is a lot of information about this SNP on dbSNP. A few things that stood out to me:

This SNP is located on chromosome 1.

It is affiliated with (located in/near) two genes: PTPN22 and AP4B1-AS1

Although there is some variation in allele frequencies across populations, in all of them the G allele is much more common (frequency > 90%) than the A allele. This is consistent with what we see in our data (MAF = 0.1196319).

It has been implicated as a potential risk factor for diabetes, rheumatoid arthritis, and other (mostly autoimmune) diseases.

It was mentioned in the Wellcome Trust GWAS paper that we read for Journal Club :)

Simulating a quantitative trait

Now, let’s create a quantitative trait y that depends on your genotype at this SNP plus some random noise. We’ll give this SNP an “effect size” of 2: for every additional copy of the minor allele that you carry, your trait value will increase by 2 units (plus or minus some random noise). I’ve set this up for you – run the code chunk below.

# simulate trait
set.seed(494) # set seed for reproducibility
n <- nrow(X) # calculate the number of individuals 
y <- 2 * X[,'rs2476601'] + rnorm(n, 0, 1) # y = 2 * x + e

# look at the first few simulated trait values
head(y)

  NA06989   NA11891   NA11843   NA12341   NA12739   NA10850 
0.2082684 3.6267809 3.3422648 1.1696396 3.9450827 4.8350451

Since this is a simulation, we know by design that our trait y depends on SNP rs2476601 and nothing else. We can use this information to check how well our methods are working. We’ll refer to rs2476601 as the causal SNP from here on out. (NOTE: causal, not casual)

Nothing to write here! Just run the code above.

Part 3: GWAS (Attempt #1)

In a real genetic study, we wouldn’t know the location of the true causal SNP(s) beforehand, so we’d need to run a genome-wide association study to see if we can figure out which variants are associated with the trait. As we’ve discussed in class, the method we’ll use for GWAS is marginal regression.

To start, let’s look at what happens when we run marginal regression on the first five SNPs in this dataset. Run this code chunk.

## fit models at first few SNPs
mod1 <- lm(y ~ X[,1])
mod2 <- lm(y ~ X[,2])
mod3 <- lm(y ~ X[,3])
mod4 <- lm(y ~ X[,4])
mod5 <- lm(y ~ X[,5])

## look at the model summaries
tidy(mod1)

# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     3.50     0.100      35.0  2.13e-77
2 X[, 1]         NA       NA          NA   NA

tidy(mod2)

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    3.80      1.27      2.99  0.00319
2 X[, 2]        -0.137     0.636    -0.216 0.830

tidy(mod3)

# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     3.50     0.100      35.0  2.13e-77
2 X[, 3]         NA       NA          NA   NA

tidy(mod4)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    4.15      0.346     12.0  4.12e-24
2 X[, 4]        -0.385     0.197     -1.95 5.27e- 2

tidy(mod5)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    3.75      0.376     9.97  1.58e-18
2 X[, 5]        -0.141     0.209    -0.676 5.00e- 1

Do you notice anything weird about the 1st and 3rd SNPs? What’s going on here?

Yes, we get NAs for the estimate, standard error, test statistics, and p-value for the slope of these two models. As we saw above, both of these SNPs are monomorphic: everyone in the dataset has the same genotype. Without variation in the genotypes that people carry at these positions, we can’t estimate the association between these SNPs and the trait.

(Like Lab 1, we have a situation of linear dependence here: the column of minor allele counts is a linear combination of the intercept column. Since our design matrix isn’t full rank, our least squares estimator is not defined.)

Part 4: Quality Control

We’re not going to be able to detect an association between our trait and any of the SNPs that have a MAF of zero (also known as monomorphic SNPs), so let’s remove them. We can remove these SNPs from the map data frame with a filter statement: **fill in the blanks (___) below and then run this code chunk.**

# keep only those SNPs with MAF > 0
map.clean <- hapmap$map %>%
  filter(maf > 0)

How many SNPs are left after filtering? How many did we remove?

# check dimensions after filtering
dim(map.clean)

[1] 1283751       7

# compare to original dimensions
dim(hapmap$map)

[1] 1457897       7

Remaining SNPs: 1283751 SNPs

Removed SNPs: 174146 monomorphic SNPs

NOTE: Whenever we do any data cleaning, it’s important to keep track of our steps and report how many cases or variables were impacted!

Now, let’s remove these monomorphic SNPs from our genotype matrix (X). Since we’re working with a matrix here instead of a data frame, we can’t use the select function. Here’s one way to remove the columns corresponding to the monomorphic SNPs from X: (run this)

# create vector of which SNPs have a MAF of 0
monomorphic <- which(maf == 0) 
head(monomorphic)

[1]  1  3 10 11 12 17

# remove columns in the monomorphic vector
X.clean <- X[,-monomorphic]

Confirm that the new “clean” genotype matrix has the correct number of rows and columns before you move on.

dim(X.clean)

[1]     165 1283751

You should have 165 rows and 1283751 columns in this new genotype matrix. We didn’t remove any people, but we did remove some SNPs.

Note

As we saw in our Journal Club 2 paper, there are various pre-processing/“data cleaning” steps that researchers perform prior to implementing a genome-wide association study. We refer to these steps as quality control (QC for short). If you’re curious what QC usually entails, take a look at the other files you downloaded from the GWAS tutorial GitHub repository (particularly 1_Main_script_QC_GWAS.txt) and/or skim this article:

Laurie CC, Doheny KF, Mirel DB, Pugh EW, Bierut LJ, Bhangale T, Boehm F, Caporaso NE, Cornelis MC, Edenberg HJ, Gabriel SB, Harris EL, Hu FB, Jacobs KB, Kraft P, Landi MT, Lumley T, Manolio TA, McHugh C, Painter I, Paschall J, Rice JP, Rice KM, Zheng X, Weir BS; GENEVA Investigators. Quality control and quality assurance in genotypic data for genome-wide association studies. Genet Epidemiol. 2010 Sep;34(6):591-602. doi: 10.1002/gepi.20516. PMID: 20718045; PMCID: PMC3061487.

To keep things simple for today, we’re going to restrict our QC to removing monomoprhic SNPs. But, be aware that there are additional QC steps that would normally be conducted before GWAS.

Part 5: GWAS (Attempt #2)

Chromosome 1

Even after removing the monomorphic SNPs, we still have 1283751 variants remaining. This might take awhile to analyze in R, so let’s focus on just the SNPs on the first chromosome to start.

Run the code chunk below to make a list of which SNPs live on chromosome 1:

chr1.snps <- which(map.clean$chromosome == 1)
head(chr1.snps)

[1] 1 2 3 4 5 6

length(chr1.snps)

[1] 103646

Now, we’re going to loop through each of the SNPs on chromosome 1, fitting a linear regression model at each SNP. For each model, we’ll record the estimates (betas), standard errors (ses), test statistics (tstats) and p-values (pvals) for the coefficient of interest (the slope). (run this)

NOTE: the code chunk below takes awhile to run, so I set the cache code chunk option to be true!

# NOTE: this will take awhile to run
# set up empty vectors for storing results
betas <- c()
ses <- c()
tstats <- c()
pvals <- c()

# loop through chromosome 1 SNPs
for(i in chr1.snps){
  # print out occasional updates telling us what SNP we're analyzing
  if(i %% 10000 == 0) print(paste('Analyzing SNP', i)) 
  # fit model
  mod <- lm(y ~ X.clean[,i])
  # get coefficient information
  coefinfo <- tidy(mod)
  # record estimate, SE, test stat, and p-value
  betas[i] <- coefinfo$estimate[2]
  ses[i] <- coefinfo$std.error[2]
  tstats[i] <- coefinfo$statistic[2]
  pvals[i] <- coefinfo$p.value[2]
}

[1] "Analyzing SNP 10000"
[1] "Analyzing SNP 20000"
[1] "Analyzing SNP 30000"
[1] "Analyzing SNP 40000"
[1] "Analyzing SNP 50000"
[1] "Analyzing SNP 60000"
[1] "Analyzing SNP 70000"
[1] "Analyzing SNP 80000"
[1] "Analyzing SNP 90000"
[1] "Analyzing SNP 100000"

Let’s add our results to our map data frame that contains information about each SNP. (run this)

# start with the map info for the chr 1 SNPs
chr1.results <- map.clean %>%
  filter(chromosome == 1)

# then add betas, SEs, etc.
chr1.results <- chr1.results %>%
  mutate(Estimate = betas,
         Std.Error = ses,
         Test.Statistic = tstats,
         P.Value = pvals)

# look at results
head(chr1.results)

           chromosome   snp.name cM position allele.1 allele.2        MAF
rs11510103          1 rs11510103 NA   557616        G        A 0.00621118
rs3131972           1  rs3131972 NA   742584        A        G 0.15757576
rs3131969           1  rs3131969 NA   744045        A        G 0.13030303
rs1048488           1  rs1048488 NA   750775        C        T 0.15853659
rs12562034          1 rs12562034 NA   758311        A        G 0.10909091
rs12124819          1 rs12124819 NA   766409        G        A 0.28658537
              Estimate Std.Error Test.Statistic    P.Value
rs11510103 -0.13716883 0.6359358     -0.2156960 0.82950466
rs3131972  -0.38450101 0.1969700     -1.9520787 0.05266367
rs3131969  -0.14108060 0.2086905     -0.6760278 0.49999293
rs1048488  -0.37371908 0.1970853     -1.8962297 0.05973218
rs12562034  0.06013127 0.2215724      0.2713843 0.78644313
rs12124819 -0.22912006 0.1498249     -1.5292522 0.12817656

Now we’re ready to plot our results! We’d like to get a visual of which SNPs have small p-values (suggesting a possible association between the SNP and the trait). To start, create a scatterplot of the p-value of each SNP versus its position along the chromosome.

chr1.results %>%
  ggplot(aes(x = position, y = P.Value)) + 
  geom_point() + 
  labs(x = 'position (bp)', y = 'p-value') + # update axis titles
  scale_x_continuous(labels = scales::comma) # reformat x-axis labels

What do you think of this visualization? Is it an effective tool for distinguishing which SNPs have small p-values?

Although this seems like a reasonable place to start in terms of plotting results, the result is a mess! It’s really hard to see which SNPs have the smallest p-values.

One way to improve this visualization is to apply a log transformation to the p-values. When we have lots of numbers squished into a small range (0 to 1, in this case), a log transformation can be a nice way to spread these out. Try plotting the log (base 10) of the p-values for each SNP instead.

chr1.results %>%
  mutate(logp = log10(P.Value)) %>% # log transformation
  ggplot(aes(x = position, y = logp)) + 
  geom_point() + 
  labs(x = 'position (bp)', y = expression(paste('log'[10],'(p-value)'))) + 
  scale_x_continuous(labels = scales::comma)

What do you notice now? Are the SNPs with small p-values the ones with small (more negative) values or large (closer to 0) values on the log base 10 scale?

The log transformation spread out the values so we can start to pick up on which p-values are smaller/larger than the others, now. Recall that the log of 1 is 0 and the log of anything between 0 and 1 will be negative. For example: log(0.1) = -1, log(0.01) = -2, and log(0.001) = -6.9077553. So, the SNPs with the smallest p-values in this plot are the ones with smaller (more negative) values.

A traditional Manhattan plot uses a negative log transformation instead. Try plotting the \(-log_{10}\) p-values now.

chr1.results %>%
  mutate(minuslogp = -log10(P.Value)) %>% # -log10 transformation
  ggplot(aes(x = position, y = minuslogp)) + 
  geom_point() + 
  labs(x = 'position (bp)', y = expression(paste('-log'[10],'(p-value)'))) + 
  scale_x_continuous(labels = scales::comma)

What similarities do you notice between this plot and the Manhattan plots we’ve seen before? What differences do you notice? Discuss.

This looks a lot like the Manhattan plots we saw in our second Journal Club article and the plot we’ve seen a couple of times as the background photo of a slide in the lecture slides. A few differences that I notice:

we’re only looking at one chromosome — we’ll scale this up in the next section!

there’s a gap in the middle of the plot (this is most likely where the centromere of chromosome 1 is located; centromeres tend to be difficult to genotype)

there is no indication of which SNPs are significant (either via a horizontal dashed line or different colored points) — we’ll need to add this ourselves if we want it!

All Chromosomes (Genome-Wide)

Repeat the analysis above, but now looking at the SNPs on all chromosomes. Warning: this will take awhile to run!

Hint: The main thing you’ll need to change is which SNPs you’re looping over in your for loop.

This code chunk takes a long time to run. Like above, I recommend utilizing the cache: true code chunk option here.

# set up empty vectors for storing results
betas <- c()
ses <- c()
tstats <- c()
pvals <- c()

# loop through all SNPs
for(i in 1:ncol(X.clean)){
  # print out occasional updates telling us what SNP we're analyzing
  if(i %% 10000 == 0) print(paste('Analyzing SNP', i)) 
  # fit model
  mod <- lm(y ~ X.clean[,i])
  # get coefficient information
  coefinfo <- tidy(mod)
  # record estimate, SE, test stat, and p-value
  betas[i] <- coefinfo$estimate[2]
  ses[i] <- coefinfo$std.error[2]
  tstats[i] <- coefinfo$statistic[2]
  pvals[i] <- coefinfo$p.value[2]
}

[1] "Analyzing SNP 10000"
[1] "Analyzing SNP 20000"
[1] "Analyzing SNP 30000"
[1] "Analyzing SNP 40000"
[1] "Analyzing SNP 50000"
[1] "Analyzing SNP 60000"
[1] "Analyzing SNP 70000"
[1] "Analyzing SNP 80000"
[1] "Analyzing SNP 90000"
[1] "Analyzing SNP 100000"
[1] "Analyzing SNP 110000"
[1] "Analyzing SNP 120000"
[1] "Analyzing SNP 130000"
[1] "Analyzing SNP 140000"
[1] "Analyzing SNP 150000"
[1] "Analyzing SNP 160000"
[1] "Analyzing SNP 170000"
[1] "Analyzing SNP 180000"
[1] "Analyzing SNP 190000"
[1] "Analyzing SNP 200000"
[1] "Analyzing SNP 210000"
[1] "Analyzing SNP 220000"
[1] "Analyzing SNP 230000"
[1] "Analyzing SNP 240000"
[1] "Analyzing SNP 250000"
[1] "Analyzing SNP 260000"
[1] "Analyzing SNP 270000"
[1] "Analyzing SNP 280000"
[1] "Analyzing SNP 290000"
[1] "Analyzing SNP 300000"
[1] "Analyzing SNP 310000"
[1] "Analyzing SNP 320000"
[1] "Analyzing SNP 330000"
[1] "Analyzing SNP 340000"
[1] "Analyzing SNP 350000"
[1] "Analyzing SNP 360000"
[1] "Analyzing SNP 370000"
[1] "Analyzing SNP 380000"
[1] "Analyzing SNP 390000"
[1] "Analyzing SNP 400000"
[1] "Analyzing SNP 410000"
[1] "Analyzing SNP 420000"
[1] "Analyzing SNP 430000"
[1] "Analyzing SNP 440000"
[1] "Analyzing SNP 450000"
[1] "Analyzing SNP 460000"
[1] "Analyzing SNP 470000"
[1] "Analyzing SNP 480000"
[1] "Analyzing SNP 490000"
[1] "Analyzing SNP 500000"
[1] "Analyzing SNP 510000"
[1] "Analyzing SNP 520000"
[1] "Analyzing SNP 530000"
[1] "Analyzing SNP 540000"
[1] "Analyzing SNP 550000"
[1] "Analyzing SNP 560000"
[1] "Analyzing SNP 570000"
[1] "Analyzing SNP 580000"
[1] "Analyzing SNP 590000"
[1] "Analyzing SNP 600000"
[1] "Analyzing SNP 610000"
[1] "Analyzing SNP 620000"
[1] "Analyzing SNP 630000"
[1] "Analyzing SNP 640000"
[1] "Analyzing SNP 650000"
[1] "Analyzing SNP 660000"
[1] "Analyzing SNP 670000"
[1] "Analyzing SNP 680000"
[1] "Analyzing SNP 690000"
[1] "Analyzing SNP 700000"
[1] "Analyzing SNP 710000"
[1] "Analyzing SNP 720000"
[1] "Analyzing SNP 730000"
[1] "Analyzing SNP 740000"
[1] "Analyzing SNP 750000"
[1] "Analyzing SNP 760000"
[1] "Analyzing SNP 770000"
[1] "Analyzing SNP 780000"
[1] "Analyzing SNP 790000"
[1] "Analyzing SNP 800000"
[1] "Analyzing SNP 810000"
[1] "Analyzing SNP 820000"
[1] "Analyzing SNP 830000"
[1] "Analyzing SNP 840000"
[1] "Analyzing SNP 850000"
[1] "Analyzing SNP 860000"
[1] "Analyzing SNP 870000"
[1] "Analyzing SNP 880000"
[1] "Analyzing SNP 890000"
[1] "Analyzing SNP 900000"
[1] "Analyzing SNP 910000"
[1] "Analyzing SNP 920000"
[1] "Analyzing SNP 930000"
[1] "Analyzing SNP 940000"
[1] "Analyzing SNP 950000"
[1] "Analyzing SNP 960000"
[1] "Analyzing SNP 970000"
[1] "Analyzing SNP 980000"
[1] "Analyzing SNP 990000"
[1] "Analyzing SNP 1000000"
[1] "Analyzing SNP 1010000"
[1] "Analyzing SNP 1020000"
[1] "Analyzing SNP 1030000"
[1] "Analyzing SNP 1040000"
[1] "Analyzing SNP 1050000"
[1] "Analyzing SNP 1060000"
[1] "Analyzing SNP 1070000"
[1] "Analyzing SNP 1080000"
[1] "Analyzing SNP 1090000"
[1] "Analyzing SNP 1100000"
[1] "Analyzing SNP 1110000"
[1] "Analyzing SNP 1120000"
[1] "Analyzing SNP 1130000"
[1] "Analyzing SNP 1140000"
[1] "Analyzing SNP 1150000"
[1] "Analyzing SNP 1160000"
[1] "Analyzing SNP 1170000"
[1] "Analyzing SNP 1180000"
[1] "Analyzing SNP 1190000"
[1] "Analyzing SNP 1200000"
[1] "Analyzing SNP 1210000"
[1] "Analyzing SNP 1220000"
[1] "Analyzing SNP 1230000"
[1] "Analyzing SNP 1240000"
[1] "Analyzing SNP 1250000"
[1] "Analyzing SNP 1260000"
[1] "Analyzing SNP 1270000"
[1] "Analyzing SNP 1280000"

I also recommend splitting the time-intensive task (eg looping through all SNPs, above) and later less time-intensive tasks (eg reformatting results, below) into separate code chunks. This way you can cache the time-intensive analysis and play around with the code in the follow-up chunks without having to re-run the time-intensive portions of your analysis every time you edit the latter.

# start with the map info for the chr 1 SNPs
all.results <- map.clean

# then add betas, SEs, etc.
all.results <- all.results %>%
  mutate(Estimate = betas,
         Std.Error = ses,
         Test.Statistic = tstats,
         P.Value = pvals)

# look at results
head(all.results)

           chromosome   snp.name cM position allele.1 allele.2        MAF
rs11510103          1 rs11510103 NA   557616        G        A 0.00621118
rs3131972           1  rs3131972 NA   742584        A        G 0.15757576
rs3131969           1  rs3131969 NA   744045        A        G 0.13030303
rs1048488           1  rs1048488 NA   750775        C        T 0.15853659
rs12562034          1 rs12562034 NA   758311        A        G 0.10909091
rs12124819          1 rs12124819 NA   766409        G        A 0.28658537
              Estimate Std.Error Test.Statistic    P.Value
rs11510103 -0.13716883 0.6359358     -0.2156960 0.82950466
rs3131972  -0.38450101 0.1969700     -1.9520787 0.05266367
rs3131969  -0.14108060 0.2086905     -0.6760278 0.49999293
rs1048488  -0.37371908 0.1970853     -1.8962297 0.05973218
rs12562034  0.06013127 0.2215724      0.2713843 0.78644313
rs12124819 -0.22912006 0.1498249     -1.5292522 0.12817656

Going Deeper

R is notoriously slow when it comes to for loops. For folks that have taken STAT/COMP 212 or another advanced coding course, can you think of a better (eg more efficient) way to code this up?

Part 6: Visualizing Results

Let’s use the qqman package to create a Manhattan plot to visualize our genome-wide association results.

Basic Manhattan Plot

We’ll use the manhattan function in this package to create a Manhattan plot. I’ve given you the basic structure of this command below. Fill in the blanks.

When our data are this big, even plotting the results can take awhile. I recommend using cache: true again here.

manhattan(all.results, # name of the dataset storing your results
          chr = "chromosome", # name of the column in your dataset with the chromosome number
          bp = "position", # name of column storing the physical (base pair) position of the SNP
          p = "P.Value", # name of the column storing the association test p-values
          snp = "snp.name", # name of the column storing the SNP names (rs###)
          ylim = c(0, 10) # sets the y-axis limits
          )

Do you notice any peaks on this Manhattan plot? Where are they located? Does that make sense, given what we know about the “truth” in this simulated scenario? Discuss.

It looks like there is a peak on chromosome 1. This makes sense! We know that the true causal SNP is located on that chromosome.

Going Deeper

The plot we created above is pretty basic. I want you to customize this a bit further! Your task: explore the qqman documentation to learn more about how the manhattan function works and the ways in which you can customize your visualization.

When you’re learning how to use a new R package, one helpful resource can be the package vignettes. Run the code chunk below to see a list of all vignettes available for the qqman package:

# view list of vignettes
vignette(package = 'qqman')

It looks like this package only has one vignette, called “qqman”.

Going Deeper

If you ever find yourself in the position of needing/wanting to create your own R package, writing a vignette (or multiple) will be an important part of that process. A good vignette can go a long way in helping people learn how to use your package. Read more about this process here.

To read a particular vignette, fill in the blanks in the code chunk below with the name of the vignette:

# view specific vignette
vignette("qqman", package = "qqman")

After reading this vignette, make at least one modification to the Manhattan plot code you ran above:

Answers may vary, but here are a couple ideas I came up with:

# focus on autosomes (chromosomes 1-22)
all.results %>%
  filter(chromosome < 23) %>%
  manhattan(chr = "chromosome", bp = "position", 
            p = "P.Value", snp = "snp.name", ylim = c(0, 10))

# highlight the genome-wide significant results
sig.snps <- all.results %>%
  filter(P.Value < 5e-08) %>% 
  pull(snp.name)

manhattan(all.results, chr = "chromosome", bp = "position", 
          p = "P.Value", snp = "snp.name", ylim = c(0, 10),
          highlight = sig.snps)

# change the color scheme
# use official Mac colors: https://www.macalester.edu/communications/visual-identity/colors/
manhattan(all.results, chr = "chromosome", bp = "position", 
          p = "P.Value", snp = "snp.name", ylim = c(0, 10),
          col = c('#01426A', '#D44420'))

# remove "suggestive" line
# use Wellcome Trust p-value threshold
manhattan(all.results, chr = "chromosome", bp = "position", 
          p = "P.Value", snp = "snp.name", ylim = c(0, 10),
          suggestiveline = F, 
          genomewideline = -log10(5e-07))

# zoom into chromosome 1
# and annotate top SNPs
manhattan(subset(all.results, chromosome == 1), 
          chr = "chromosome", bp = "position", 
          p = "P.Value", snp = "snp.name", ylim = c(0, 20),
          annotatePval = 5e-10, annotateTop = FALSE)

Describe the change that you made and anything you learned from your updated visualization here:

Answers will vary. Review the qqman vignette for more examples of customizations that can be made here!

(OPTIONAL): Take this customization challenge one step further. Try to recreate the Manhattan plot above using ggplot!

all.results %>%
  mutate(minuslogp = -log10(P.Value),
         chr = as.factor(chromosome)) %>%
  ggplot(aes(x = position, y = minuslogp, color = chr)) + 
  geom_point() + 
  labs(x = 'position (bp)', y = expression(paste('-log'[10],'(p-value)'))) + 
  scale_x_continuous(labels = scales::comma)

Notice that the results from different chromosomes are getting plotted on top of one another! This is because of the way we keep track of SNP locations: the position (bp) is relative to the chromosome, so the numbers start back over at zero with each new chromosome. We can use the group aesthetic in ggplot to fix this:

all.results %>%
  mutate(minuslogp = -log10(P.Value),
         chr = as.factor(chromosome)) %>%
  ggplot(aes(x = chr, y = minuslogp, group = interaction(chr, position), color = chr)) + 
  geom_point(position = position_dodge(0.8)) + 
  labs(x = 'chromosome', y = expression(paste('-log'[10],'(p-value)')))

And if you want to play around with the color palette, try something like this:

# try a different color palette
library(NatParksPalettes)
all.results %>%
  mutate(minuslogp = -log10(P.Value),
         chr = as.factor(chromosome)) %>%
  ggplot(aes(x = chr, y = minuslogp, group = interaction(chr, position), color = chr)) + 
  geom_point(position = position_dodge(0.8)) + 
  labs(x = 'chromosome', y = expression(paste('-log'[10],'(p-value)'))) + 
  scale_color_manual(values=natparks.pals("Olympic", 25)) + 
  theme(legend.position="none")

# color palette to match the manh plot in the slides
## inspired by https://github.com/UW-GAC/analysis_pipeline
library(RColorBrewer)
chr <- levels(as.factor(all.results$chromosome))
cmap <- setNames(rep_len(brewer.pal(8, "Dark2"), length(chr)), chr)

# make plot
all.results %>%
  mutate(minuslogp = -log10(P.Value),
         chr = as.factor(chromosome)) %>%
  ggplot(aes(x = chr, y = minuslogp, group = interaction(chr, position), color = chr)) + 
  geom_point(position = position_dodge(0.8)) + 
  labs(x = 'chromosome', y = expression(paste('-log'[10],'(p-value)'))) + 
  scale_color_manual(values=cmap, breaks=names(cmap)) +
  theme(legend.position="none")

Part 7 (OPTIONAL): QQ Plots

We saw the “man” part of the qqman package above (creating manhattan plots). The “qq” part of the qqman package refers to its ability to create Q-Q plots, a common diagnostic tool for genome-wide association studies.

Explore the qqman documentation to learn how to create a QQ plot.

Then, make one using the results of our genome-wide analysis.

qq(all.results$`P.Value`, main = 'QQ Plot of GWAS Results')

Based on what we’ve discussed about QQ plots thus far, do you see any reason for concern here?

To me, this QQ plot looks pretty reasonable. Most of the points fall along the y = x line, which makes sense given that we know most SNPs are not associated with our simulated trait. There is a “tail” of points rising above y = x, which means that we have a handful of SNPs with smaller p-values than we would expect to see if this trait did not have any genetic basis. This “tail” can arise when you have some SNPs that are associated with our trait. Again, this is expected here: we simulated this trait, so we know that there is a SNP on chromosome 1 that is actually causally associated with our trait. However, our simulation set-up with a single causal SNP raises the question of why we see more than one SNP with a “smaller than expected” p-value. This is happening because nearby SNPs are correlated with one another. Take a look at the zoomed-in Manhattan plot I created above. Note that there are quite a few SNPs on chromosome 1 with small / “significant” p-values. Even though only one SNP, rs2476601, is actually causal, others are “along for the ride” because they’re nearby and correlated with it.