Lab 2: Implementing and visualizing marginal regression

STAT 494: Statistical Genetics

Published

February 6, 2025

Goals

Explore real genetic data
Implement GWAS in R using marginal regression
Visualize GWAS results in R using Manhattan plots (“by hand” and using open-source R packages)
Learn how to use a new R package by reading its documentation

Preparation

To prepare for this assignment, please install (or update) the following packages:

tidyverse
broom
qqman
snpStats

The first three packages are all available on CRAN (the Comprehensive R Archive Network) so you can install them in the usual way:

install.packages(c('tidyverse', 'broom', 'qqman'))

The fourth package is available on Bioconductor, a site dedicated to sharing open-source software for analysis of biological data. Many R packages related to statistical genetics are posted on this site! The installation process is slightly different for Bioconductor packages. Read more about the snpStats package and how to install it here: [link].

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("snpStats")

Please stop by office hours if you are running into any issues getting these packages installed before class!

Exercises

Part 0: Setup

Packages

Load all of the packages that you installed (or updated) above:

library(snpStats)
library(tidyverse)
library(broom)
library(qqman)

Data

To download the data that we’ll be using today, go to this GitHub page and download the 1_QC_GWAS.zip file. Once you’ve downloaded that file successfully, you’ll need to unzip it. On most computers, double-clicking the zipped file will unzip it, but you can also do it from the command line if that doesn’t work.

The data you just downloaded include real genotype information and a simulated binary phenotype for 165 individuals of European ancestry from the International HapMap Project (“HapMap”). The data were processed by a group of researchers in the Netherlands and France for a tutorial they wrote on how to conduct a genome-wide association study. If you’re interested, you can read that tutorial here and find the corresponding code on GitHub. The authors decided to focus their tutorial on an “ethnically homogeneous dataset,” so although HapMap actually sampled individuals from a variety of populations across the world (see this list), the data that we’re using today only includes individuals from Utah with Northern and Western European ancestry (the “CEU” population).

Note

As we saw in our Journal Club 2 article, for many years it was common practice for researchers to focus their analyses only on individuals of European ancestry. The creators of the GWAS tutorial from which we are pulling these data are following suit. We’ll discuss how to move away from this restrictive (and problematic, IMO) setting later in the semester, but for now we’ll use these data as provided.

The information that we need for our analysis today is split across three files: HapMap_3_r3_1.bed, HapMap_3_r3_1.bim, and HapMap_3_r3_1.fam. Although there are many ways that genetic data are stored (and this is in and of itself an interesting research question given their large size), one of the most common file formats is known as the PLINK format. That’s what we see here. The .bed file contains the genotypes, the .fam file contains information on the individuals, and the .bim file contains information on the SNPs.

Note

PLINK is a popular tool for analyzing genetic data and conducting GWAS. Learn more here. We’re going to be conducting our GWAS “by hand” so that we can understand the methods better, but in practice you’ll find that most people use PLINK or other software programs instead. Learning to use PLINK could make for an interesting final project (especially if one of your learning goals is related to developing your computational skills)!

We can use the read.plink function in the snpStats package to read these data into R. First, update the strings below to reflect the correct path to the data on your computer. Here’s what it looks like for me:

# update file paths
bed <- '../data/1_QC_GWAS/HapMap_3_r3_1.bed'
bim <- '../data/1_QC_GWAS/HapMap_3_r3_1.bim'
fam <- '../data/1_QC_GWAS/HapMap_3_r3_1.fam'

Now, you can run read.plink:

# NOTE: this might take a minute or two
hapmap <- read.plink(bed, bim, fam)

Stop and Discuss. At the top of the previous code chunk, we specified the code chunk option cache: true. What does this do, and why is it useful here? (Discuss at your table and use your resources!)

Part 1: Data Exploration

The new hapmap data object is a “list” of three items:

class(hapmap)

[1] "list"

names(hapmap)

[1] "genotypes" "fam"       "map"

genotypes contains the genotype data
fam contains information on each individual
map contains information on each SNP

Explore genotype data

Get info about the genotype data:

hapmap$genotypes

A SnpMatrix with  165 rows and  1457897 columns
Row names:  NA06989 ... NA12865 
Col names:  rs2185539 ... rs1973881

How many individuals do we have in this dataset?

How many SNPs?

Explore sample information

Look at the info we have on each individual:

head(hapmap$fam)

        pedigree  member  father  mother sex affected
NA06989     1328 NA06989    <NA>    <NA>   2        2
NA11891     1377 NA11891    <NA>    <NA>   1        2
NA11843     1349 NA11843    <NA>    <NA>   1        1
NA12341     1330 NA12341    <NA>    <NA>   2        2
NA12739     1444 NA12739 NA12748 NA12749   1       NA
NA10850     1344 NA10850    <NA> NA12058   2       NA

What are each of these columns telling us? (Discuss at your table!)

Explore SNP information

We can also look at the info we have on each SNP:

head(hapmap$map)

           chromosome   snp.name cM position allele.1 allele.2
rs2185539           1  rs2185539 NA   556738        T        C
rs11510103          1 rs11510103 NA   557616        G        A
rs11240767          1 rs11240767 NA   718814        T        C
rs3131972           1  rs3131972 NA   742584        A        G
rs3131969           1  rs3131969 NA   744045        A        G
rs1048488           1  rs1048488 NA   750775        C        T

What are each of these columns telling us? (Discuss at your table!)

Calculate MAFs

One useful piece of information about these SNPs that the map data frame does not already contain is the minor allele frequency (MAF) — that is, the frequency of the minor allele in this dataset. Thankfully, the snpStats package provides some helpful code for quickly calculating the MAF for each SNP in the dataset:

maf <- col.summary(hapmap$genotypes)$MAF
head(maf)

[1] 0.00000000 0.00621118 0.00000000 0.15757576 0.13030303 0.15853659

Let’s add the minor allele frequencies to our map data frame. (This will come in handy later.)

# add new MAF variable 
hapmap$map <- hapmap$map %>%
  mutate(MAF = maf)

# look at SNP info again
head(hapmap$map)

           chromosome   snp.name cM position allele.1 allele.2        MAF
rs2185539           1  rs2185539 NA   556738        T        C 0.00000000
rs11510103          1 rs11510103 NA   557616        G        A 0.00621118
rs11240767          1 rs11240767 NA   718814        T        C 0.00000000
rs3131972           1  rs3131972 NA   742584        A        G 0.15757576
rs3131969           1  rs3131969 NA   744045        A        G 0.13030303
rs1048488           1  rs1048488 NA   750775        C        T 0.15853659

What do you notice about the minor allele frequency of the first and third SNP in our dataset (rs2185539 and rs11240767)? What does this mean?

Reformat data for analysis

The snptats package uses a unique format to store data. Currently, genotypes are coded as 01, 02, and 03 (with 00 representing missing values).

# look at first five rows/columns
hapmap$genotypes@.Data[1:5,1:5]

        rs2185539 rs11510103 rs11240767 rs3131972 rs3131969
NA06989        03         03         03        02        02
NA11891        03         03         03        02        03
NA11843        03         03         03        03        03
NA12341        03         03         03        02        02
NA12739        03         03         03        03        03

Let’s convert this to the 0, 1, 2 (and NA) format that we’ve talked about in class. Run the code chunk below.

# what type of object is hapmap$genotypes
class(hapmap$genotypes)

[1] "SnpMatrix"
attr(,"package")
[1] "snpStats"

# convert from "SnpMatrix" to "numeric"
X <- as(hapmap$genotypes, "numeric")

# what type of object do we have now 
class(X)

[1] "matrix" "array"

# look at first five rows/columns to confirm conversion looks ok
X[1:5, 1:5]

        rs2185539 rs11510103 rs11240767 rs3131972 rs3131969
NA06989         2          2          2         1         1
NA11891         2          2          2         1         2
NA11843         2          2          2         2         2
NA12341         2          2          2         1         1
NA12739         2          2          2         2         2

You should now see a matrix of 0’s, 1’s, and 2’s.

(OPTIONAL) Data Wrangling Review. Now that we’ve converted the genotype data into a more familiar format, how might you use tidyverse data wrangling tools from COMP/STAT 112 or STAT 155 to calculate the minor allele frequency of each SNP?

# your code here

Part 2: Trait Simulation

Background

Simulation studies are a widely used technique in statistical research. The general idea is this…

We “simulate”, or artificially generate, data in a way where we know the “truth”. Then, we apply a statistical method to these simulated data to see how it performs. Since we created the data, we can compare the result we get from our statistical method to the “truth” and get a sense of how accurate the method is. This is a particularly useful framework when we want to evaluate a new method, understand how a method’s performance varies across different scenarios, or compare multiple methods against each other. It can also be useful in contexts, like statistical genetics, in which finding/working with “real” data can be challenging (due to data privacy concerns, logistical constraints, etc.).

Although the genotype data we are working with in this activity are real, the phenotype/trait (the “affected” column in the fam sample information) is not. Let’s create our own!

Picking a causal SNP

Let’s simulate a trait that depends on the SNP known as rs2476601. Here’s what we know about this SNP:

hapmap$map %>%
  filter(snp.name == 'rs2476601')

          chromosome  snp.name cM  position allele.1 allele.2       MAF
rs2476601          1 rs2476601 NA 114179091        A        G 0.1196319

We can learn more about this SNP by searching the dbSNP database: https://www.ncbi.nlm.nih.gov/snp/. Enter the rsID (the SNP’s name) into the dbSNP search bar and see what you can learn. Summarize your findings here.

Simulating a quantitative trait

Now, let’s create a quantitative trait y that depends on your genotype at this SNP plus some random noise. We’ll give this SNP an “effect size” of 2: for every additional copy of the minor allele that you carry, your trait value will increase by 2 units (plus or minus some random noise). I’ve set this up for you – run the code chunk below.

# simulate trait
set.seed(494) # set seed for reproducibility
n <- nrow(X) # calculate the number of individuals 
y <- 2 * X[,'rs2476601'] + rnorm(n, 0, 1) # y = 2 * x + e

# look at the first few simulated trait values
head(y)

  NA06989   NA11891   NA11843   NA12341   NA12739   NA10850 
0.2082684 3.6267809 3.3422648 1.1696396 3.9450827 4.8350451

Since this is a simulation, we know by design that our trait y depends on SNP rs2476601 and nothing else. We can use this information to check how well our methods are working. We’ll refer to rs2476601 as the causal SNP from here on out. (NOTE: causal, not casual)

Part 3: GWAS (Attempt #1)

In a real genetic study, we wouldn’t know the location of the true causal SNP(s) beforehand, so we’d need to run a genome-wide association study to see if we can figure out which variants are associated with the trait. As we’ve discussed in class, the method we’ll use for GWAS is marginal regression.

To start, let’s look at what happens when we run marginal regression on the first five SNPs in this dataset. Run this code chunk.

## fit models at first few SNPs
mod1 <- lm(y ~ X[,1])
mod2 <- lm(y ~ X[,2])
mod3 <- lm(y ~ X[,3])
mod4 <- lm(y ~ X[,4])
mod5 <- lm(y ~ X[,5])

## look at the model summaries
tidy(mod1)

# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     3.50     0.100      35.0  2.13e-77
2 X[, 1]         NA       NA          NA   NA

tidy(mod2)

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    3.80      1.27      2.99  0.00319
2 X[, 2]        -0.137     0.636    -0.216 0.830

tidy(mod3)

# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     3.50     0.100      35.0  2.13e-77
2 X[, 3]         NA       NA          NA   NA

tidy(mod4)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    4.15      0.346     12.0  4.12e-24
2 X[, 4]        -0.385     0.197     -1.95 5.27e- 2

tidy(mod5)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    3.75      0.376     9.97  1.58e-18
2 X[, 5]        -0.141     0.209    -0.676 5.00e- 1

Do you notice anything weird about the 1st and 3rd SNPs? What’s going on here?

Part 4: Quality Control

We’re not going to be able to detect an association between our trait and any of the SNPs that have a MAF of zero (also known as monomorphic SNPs), so let’s remove them. We can remove these SNPs from the map data frame with a filter statement: **fill in the blanks (___) below and then run this code chunk.**

# keep only those SNPs with MAF > 0
map.clean <- hapmap$map %>%
  filter(___)

Error in parse(text = input): <text>:3:11: unexpected input
2: map.clean <- hapmap$map %>%
3:   filter(__
             ^

How many SNPs are left after filtering? How many did we remove?

Now, let’s remove these monomorphic SNPs from our genotype matrix (X). Since we’re working with a matrix here instead of a data frame, we can’t use the select function. Here’s one way to remove the columns corresponding to the monomorphic SNPs from X: (run this)

# create vector of which SNPs have a MAF of 0
monomorphic <- which(maf == 0) 
head(monomorphic)

[1]  1  3 10 11 12 17

# remove columns in the monomorphic vector
X.clean <- X[,-monomorphic]

Confirm that the new “clean” genotype matrix has the correct number of rows and columns before you move on.

# your code here

Note

As we saw in our Journal Club 2 paper, there are various pre-processing/“data cleaning” steps that researchers perform prior to implementing a genome-wide association study. We refer to these steps as quality control (QC for short). If you’re curious what QC usually entails, take a look at the other files you downloaded from the GWAS tutorial GitHub repository (particularly 1_Main_script_QC_GWAS.txt) and/or skim this article:

To keep things simple for today, we’re going to restrict our QC to removing monomoprhic SNPs. But, be aware that there are additional QC steps that would normally be conducted before GWAS.

Part 5: GWAS (Attempt #2)

Chromosome 1

Even after removing the monomorphic SNPs, we still have 1283751 variants remaining. This might take awhile to analyze in R, so let’s focus on just the SNPs on the first chromosome to start.

Run the code chunk below to make a list of which SNPs live on chromosome 1:

chr1.snps <- which(map.clean$chromosome == 1)

Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'which': object 'map.clean' not found

head(chr1.snps)

Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'chr1.snps' not found

length(chr1.snps)

Error: object 'chr1.snps' not found

Now, we’re going to loop through each of the SNPs on chromosome 1, fitting a linear regression model at each SNP. For each model, we’ll record the estimates (betas), standard errors (ses), test statistics (tstats) and p-values (pvals) for the coefficient of interest (the slope). (run this)

# NOTE: this will take awhile to run
# set up empty vectors for storing results
betas <- c()
ses <- c()
tstats <- c()
pvals <- c()

# loop through chromosome 1 SNPs
for(i in chr1.snps){
  # print out occasional updates telling us what SNP we're analyzing
  if(i %% 10000 == 0) print(paste('Analyzing SNP', i)) 
  # fit model
  mod <- lm(y ~ X.clean[,i])
  # get coefficient information
  coefinfo <- tidy(mod)
  # record estimate, SE, test stat, and p-value
  betas[i] <- coefinfo$estimate[2]
  ses[i] <- coefinfo$std.error[2]
  tstats[i] <- coefinfo$statistic[2]
  pvals[i] <- coefinfo$p.value[2]
}

Error: object 'chr1.snps' not found

Let’s add our results to our map data frame that contains information about each SNP. (run this)

# start with the map info for the chr 1 SNPs
chr1.results <- map.clean %>%
  filter(chromosome == 1)

Error: object 'map.clean' not found

# then add betas, SEs, etc.
chr1.results <- chr1.results %>%
  mutate(Estimate = betas,
         Std.Error = ses,
         Test.Statistic = tstats,
         P.Value = pvals)

Error: object 'chr1.results' not found

# look at results
head(chr1.results)

Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'chr1.results' not found

Now we’re ready to plot our results! We’d like to get a visual of which SNPs have small p-values (suggesting a possible association between the SNP and the trait). To start, create a scatterplot of the p-value of each SNP versus its position along the chromosome.

# your code here

What do you think of this visualization? Is it an effective tool for distinguishing which SNPs have small p-values?

One way to improve this visualization is to apply a log transformation to the p-values. When we have lots of numbers squished into a small range (0 to 1, in this case), a log transformation can be a nice way to spread these out. Try plotting the log (base 10) of the p-values for each SNP instead.

# your code here

What do you notice now? Are the SNPs with small p-values the ones with small (more negative) values or large (closer to 0) values on the log base 10 scale?

A traditional Manhattan plot uses a negative log transformation instead. Try plotting the \(-log_{10}\) p-values now.

# your code here

What similarities do you notice between this plot and the Manhattan plots we’ve seen before? What differences do you notice? Discuss.

All Chromosomes (Genome-Wide)

Repeat the analysis above, but now looking at the SNPs on all chromosomes. Warning: this will take awhile to run!

Hint: The main thing you’ll need to change is which SNPs you’re looping over in your for loop.

# your code here

Part 6: Visualizing Results

Let’s use the qqman package to create a Manhattan plot to visualize our genome-wide association results.

Basic Manhattan Plot

We’ll use the manhattan function in this package to create a Manhattan plot. I’ve given you the basic structure of this command below. Fill in the blanks.

manhattan(___, # name of the dataset storing your results
          chr = "___", # name of the column in your dataset with the chromosome number
          bp = "___", # name of column storing the physical (base pair) position of the SNP
          p = "___", # name of the column storing the association test p-values
          snp = "___", # name of the column storing the SNP names (rs###)
          ylim = c(0, 10) # sets the y-axis limits
          )

Error in parse(text = input): <text>:1:12: unexpected input
1: manhattan(__
               ^

Do you notice any peaks on this Manhattan plot? Where are they located? Does that make sense, given what we know about the “truth” in this simulated scenario? Discuss.

Going Deeper

The plot we created above is pretty basic. I want you to customize this a bit further! Your task: explore the qqman documentation to learn more about how the manhattan function works and the ways in which you can customize your visualization.

When you’re learning how to use a new R package, one helpful resource can be the package vignettes. Run the code chunk below to see a list of all vignettes available for the qqman package:

# view list of vignettes
vignette(package = 'qqman')

To read a particular vignette, fill in the blanks in the code chunk below with the name of the vignette:

# view specific vignette
vignette("___", package = "qqman")

After reading this vignette, make at least one modification to the Manhattan plot code you ran above:

# your code here

Describe the change that you made and anything you learned from your updated visualization here:

(OPTIONAL): Take this customization challenge one step further. Try to recreate the Manhattan plot above using ggplot!

Part 7 (OPTIONAL): QQ Plots

We saw the “man” part of the qqman package above (creating manhattan plots). The “qq” part of the qqman package refers to its ability to create Q-Q plots, a common diagnostic tool for genome-wide association studies.

Explore the qqman documentation to learn how to create a QQ plot.

Then, make one using the results of our genome-wide analysis.