Lab 3: The multiple testing problem

STAT 494: Statistical Genetics

Published

February 13, 2025

Goals

Implement the Bonferroni and simulation-based multiple testing correction approaches
Use simulation studies to assess the family-wise error rate of a specified significance threshold and to compare these two multiple testing correction approaches
Explore the concept of linkage disequilibrium and how it affects the multiple testing problem

Getting Started

Access the source code for this document (click the </> Code button above)
Copy-paste into a new QMD
Add your name as the author in the YAML header, above
Install the NatParksPalettes package (for fun color palettes inspired by pictures of US and Canadian National Parks – read more here):

install.packages('NatParksPalettes')

Load the packages that we’ll need today:

library(tidyverse)
library(broom)
library(snpStats)
library(NatParksPalettes)

Exercises

Part 1: Independent Tests

To begin, let’s explore the problem of conducting multiple independent hypothesis tests. In the context of genetic data, this might look like testing SNPs that are far away from each other and are thus inherited independently.

a) Data setup

To start, let’s generate a small dataset with 165 rows (i.e., 165 people) and 83 columns (i.e., 83 SNPs). (The reasons for these numbers will become apparent later.) For now, we’ll assume that all of these SNPs have the same minor allele frequency of 10% and we’ll generate each of these SNPs independently from the others. We’ll use similar code to generate these data as we did in Lab 1.

Run the code chunk below.

# function to simulate one genetic variant
sim_one_variant <- function(n_ppl, MAF){
  snp <- rbinom(n = n_ppl, size = 2, p = MAF)
  return(snp)
}

# replicate 83 times to create 83 independent SNPs
# use a sample size of 165 and MAF of 10%
# set seed so we all get the same results
set.seed(494)
snps <- replicate(83, sim_one_variant(n_ppl = 165, MAF = 0.1))

Take a minute to explore the snps object and confirm it looks like what you were expecting: calculate the dimensions, type View(snps) in the Console to open the data in a new window, calculate the correlation between each pair of SNPs, etc. Pause and discuss.

# your code here

b) What’s wrong with 0.05?

Suppose we want to conduct a “genome-wide” association study, using marginal regression to separately test the association between each of these 83 SNPs and a trait. Before we work on determining which significance threshold we should use in this setting, let’s explore what happens when we use the “traditional” p-value threshold of 0.05.

Consider the scenario where the null hypothesis is universally true: none of the 83 SNPs are associated with our trait. If we use a p-value threshold of 0.05, how often will we incorrectly state that at least one of these SNPs is associated with the trait (even though we know, in reality, that’s not true)?

To investigate this question, let’s start by simulating a quantitative trait under the null hypothesis, that is, a trait that does not depend on any of the SNPs. Run this code chunk.

set.seed(494) 
y <- rnorm(n = 165, mean = 0, sd = 1) # simulate Y from a standard normal distribution

Next, implement GWAS to investigate the association between each SNP and the simulated null trait. Run this code chunk.

# set up empty vector to store p-values
pvals <- c()

# loop through each of the 83 SNPs
for(i in 1:83){
  mod <- lm(y ~ snps[,i]) # fit model looking at SNP i
  pvals[i] <- tidy(mod)$p.value[2] # record p-value
}

Look at the p-values that we got. Are any of them below 0.05? How many of them “should” be below 0.05? Pause and discuss.

# your code here

Maybe we just got “unlucky” in our first simulation replicate. Let’s repeat this process many times and see how often we end up with at least one p-value being below 0.05. Let’s wrap our code, above, into a function called do_one_sim to help us do this. Run this code chunk.

# create a custom function that puts all steps above into one function
do_one_sim <- function(){
  # simulate null trait
  y <- rnorm(n = 165, mean = 0, sd = 1)
  # implement GWAS
  pvals <- c()
  for(i in 1:83){
    mod <- lm(y ~ snps[,i])
    pvals[i] <- tidy(mod)$p.value[2]
  }
  # check if any pvals are < 0.05
  any(pvals < 0.05)
}

# try it once to confirm we get the same thing as above
set.seed(494)
do_one_sim()

[1] TRUE

Now, we can use the replicate function to run do_one_sim 500 times. Run this code chunk.

# NOTE: the code chunk option cache: true is helpful here!
# repeat simulation 500 times
set.seed(494)
simres <- replicate(500, do_one_sim())

Across how many of these simulations did we end up with at least one p-value below the 0.05 threshold?

# your code here

We just calculated the empirical family-wise error rate for our simulation study. How does this compare to the mathematical result that we derived together in class? Collectively, what do these results suggest about the appropriateness of using a p-value threshold of 0.05 in this setting? Pause and discuss.

c) Bonferroni correction

Hopefully it’s clear from the last section that a p-value threshold of 0.05 is NOT appropriate in this setting. What should we use instead? One approach we can use to control our family-wise error rate (FWER) at a more reasonable level is the Bonferroni correction.

According to this approach, what significance threshold should we use in this setting if we want to control the FWER at 0.05? Pause and discuss.

d) Simulation-based approach

Another way to determine an appropriate significance threshold is to use a simulation-based approach like the one they used in Pulit et al., 2016. If we want to control the family-wise error rate at 5%, then the steps of this process look like this:

Simulate a null trait (i.e., a trait that is not associated with any of the SNPs)
Run GWAS to test the association between this simulated null trait and each of the SNPs in our dataset.
Record the smallest p-value from this GWAS.
Repeat steps 1–3 many (e.g., 100, 1000) times.
Look at the p-values you saved from those simulation replicates. Sort them from smallest to largest and find the lowest 5th percentile (i.e., the point at which 5% of the p-values are smaller than that number). This is your significance threshold!

Let’s implement this simulation-based multiple testing correction approach now. We’ll use our code from earlier as a starting point, but instead of checking if any SNPs are smaller than 0.05 for each simulation, we’ll record the smallest one. Run this code chunk.

# write a function to do one simulation replicate 
# (should look a lot like our last simulation)
do_one_rep <- function(){
  # simulate null trait
  y <- rnorm(n = 165, mean = 0, sd = 1)
  # implement GWAS
  pvals <- c()
  for(i in 1:83){
    mod <- lm(y ~ snps[,i])
    pvals[i] <- tidy(mod)$p.value[2]
  }
  # record the smallest p-value
  min(pvals)
}


# then repeat many times
set.seed(494) 
reps <- replicate(500, do_one_rep())


# then use the quantile function to find the lower 5th percentile
quantile(reps, probs = c(0.05))

          5% 
0.0005228908

Based on this approach, what significance threshold should we use? How does this compare to the Bonferroni correction? Pause and discuss.

e) Comparison

Let’s convince ourselves that the Bonferroni correction and simulation-based approach are both “working” here. How might we use a simulation study (perhaps similar to one we did earlier) to assess whether the significance thresholds given by these two approaches are actually controlling the FWER at our desired 5% level? Pause and discuss.

Implement this simulation study and report what you find! Hint: borrow heavily from the code in Part 1b, above.

# your code here

Part 2: Correlated Tests

In the previous example, we assumed that our 83 SNPs were independent of one another, but this is not a realistic assumption in the case of a typical GWAS. In reality, nearby SNPs are correlated with one another due to the processes involved in the inheritance of DNA. How does this impact the multiple testing problem?

a) HapMap data setup

Let’s return to the HapMap data we explored in Lab 2. To make the size of the dataset more manageable for us today, we’ll just look at the first 100 SNPs.

First, update the file paths below to reflect the way you’ve stored these data on your computer:

# paths to PLINK files
bed <- '../data/1_QC_GWAS/HapMap_3_r3_1.bed'
bim <- '../data/1_QC_GWAS/HapMap_3_r3_1.bim'
fam <- '../data/1_QC_GWAS/HapMap_3_r3_1.fam'

Then, read in the files, using the select.snps argument to select only the first 100. Run this code chunk.

# read in first 100 SNPs
hapmap <- read.plink(bed, bim, fam, select.snps = 1:100)

# confirm we have 100 SNPs only
hapmap$genotypes

A SnpMatrix with  165 rows and  100 columns
Row names:  NA06989 ... NA12865 
Col names:  rs2185539 ... rs12757754

b) Linkage disequilibrium

There is a special term that we use in genetics to talk about the correlation between SNPs: linkage disequilibrium, or LD for short.

The snpStats package has a function called ld that will calculate LD for us. Run this code chunk.

# calculate LD
hapmap.ld <- ld(hapmap$genotypes, depth = 99, stats = "R.squared", symmetric = TRUE)

Let’s investigate and then plot this LD matrix. What do you notice? Pause and discuss.

# look at the first 5-by-5 elements:
hapmap.ld[1:5, 1:5]

5 x 5 sparse Matrix of class "dsCMatrix"
           rs2185539  rs11510103 rs11240767   rs3131972   rs3131969
rs2185539          .          NA         NA          NA          NA
rs11510103        NA .                   NA 0.005662224 0.007526042
rs11240767        NA          NA          .          NA          NA
rs3131972         NA 0.005662224         NA .           0.800991691
rs3131969         NA 0.007526042         NA 0.800991691 .

# plot LD (grey scale)
image(hapmap.ld)

# plot LD (fun color palette)
color.pal <- natparks.pals("Acadia", 10)
image(hapmap.ld, lwd = 0, cuts = 9, col.regions = color.pal, colorkey = TRUE)

c) Quality control

As you hopefully noticed in the last section, nearby SNPs are often highly correlated with one another. This means that our hypothesis tests at nearby SNPs will also be highly correlated. How does this impact our choice of significance threshold? Let’s try implementing the simulation-based multiple testing approach on this dataset to find out.

First, we’ll repeat some of our data pre-processing / “quality control” steps from Lab 2:

convert our genotype data into a matrix of 0’s, 1’s, and 2’s instead of the snpStats format it’s currently in
remove the monomorphic SNPs

Run this code chunk.

# convert to matrix
hapmap.geno <- as(hapmap$genotypes, "numeric")

# calculate MAF
maf <- col.summary(hapmap$genotypes)$MAF

# find monomorphic SNPs
monomorphic <- which(maf == 0) 

# remove monomorphic SNPs from genotype matrix
hapmap.geno <- hapmap.geno[,-monomorphic]

# check the dimensions after filtering
dim(hapmap.geno)

[1] 165  83

How many SNPs remain after filtering? Pause and discuss.

d) Simulation-based approach

Now, modify your code from Part 1 to run the simulation-based multiple testing correction approach on the HapMap data. Remember to update the number of individuals and the number of SNPs accordingly.

# write a function to do one simulation replicate

# then repeat many times

# then use the quantile function to find the lower 5th percentile

What threshold did you get? How does this compare to the Bonferroni correction? Pause and discuss.

Reflect on the steps of the simulation-based approach. Why does this approach work? Pause and discuss.

--- title: "Lab 3: The multiple testing problem" subtitle: "STAT 494: Statistical Genetics" image: "../images/hapmap_ld.png" date: "February 13, 2025" author: "" format: html: toc: true toc-depth: 3 embed-resources: true code-tools: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, error = TRUE) ``` \ \ # Goals - Implement the Bonferroni and simulation-based multiple testing correction approaches - Use simulation studies to assess the family-wise error rate of a specified significance threshold and to compare these two multiple testing correction approaches - Explore the concept of *linkage disequilibrium* and how it affects the multiple testing problem \ \ # Getting Started 1. Access the source code for this document (click the `</> Code` button above) 2. Copy-paste into a new QMD 3. Add your name as the `author` in the YAML header, above 4. Install the `NatParksPalettes` package (for fun color palettes inspired by pictures of US and Canadian National Parks -- read more [here](https://github.com/kevinsblake/natparkspalettes)): ```{r} #| eval: false install.packages('NatParksPalettes') ``` 5. Load the packages that we'll need today: ```{r} #| message: false library(tidyverse) library(broom) library(snpStats) library(NatParksPalettes) ``` \ \ # Exercises ## Part 1: Independent Tests To begin, let's explore the problem of conducting multiple *independent* hypothesis tests. In the context of genetic data, this might look like testing SNPs that are far away from each other and are thus inherited independently. ### a) Data setup To start, let's generate a small dataset with 165 rows (i.e., 165 people) and 83 columns (i.e., 83 SNPs). (The reasons for these numbers will become apparent later.) For now, we'll assume that all of these SNPs have the same minor allele frequency of 10% and we'll generate each of these SNPs independently from the others. We'll use similar code to generate these data as we did in Lab 1. **Run the code chunk below.** ```{r simulate-data} # function to simulate one genetic variant sim_one_variant <- function(n_ppl, MAF){ snp <- rbinom(n = n_ppl, size = 2, p = MAF) return(snp) } # replicate 83 times to create 83 independent SNPs # use a sample size of 165 and MAF of 10% # set seed so we all get the same results set.seed(494) snps <- replicate(83, sim_one_variant(n_ppl = 165, MAF = 0.1)) ``` \ Take a minute to explore the `snps` object and confirm it looks like what you were expecting: calculate the dimensions, type `View(snps)` in the Console to open the data in a new window, calculate the correlation between each pair of SNPs, etc. **Pause and discuss.** ```{r explore-data} # your code here ``` > YOUR TAKEAWAYS HERE \ \ ### b) What's wrong with 0.05? Suppose we want to conduct a "genome-wide" association study, using marginal regression to separately test the association between each of these 83 SNPs and a trait. Before we work on determining which significance threshold we *should* use in this setting, let's explore what happens when we use the "traditional" p-value threshold of 0.05. Consider the scenario where the null hypothesis is universally true: none of the 83 SNPs are associated with our trait. If we use a p-value threshold of 0.05, how often will we incorrectly state that at least one of these SNPs is associated with the trait (even though we know, in reality, that's not true)? \ To investigate this question, let's start by simulating a quantitative trait under the *null hypothesis*, that is, a trait that does not depend on any of the SNPs. **Run this code chunk.** ```{r simulate-trait} set.seed(494) y <- rnorm(n = 165, mean = 0, sd = 1) # simulate Y from a standard normal distribution ``` \ Next, implement GWAS to investigate the association between each SNP and the simulated null trait. **Run this code chunk.** ```{r implement-GWAS} # set up empty vector to store p-values pvals <- c() # loop through each of the 83 SNPs for(i in 1:83){ mod <- lm(y ~ snps[,i]) # fit model looking at SNP i pvals[i] <- tidy(mod)$p.value[2] # record p-value } ``` Look at the p-values that we got. Are any of them below 0.05? How many of them "should" be below 0.05? **Pause and discuss.** ```{r check-pvals} # your code here ``` > YOUR ANSWER HERE \ \ Maybe we just got "unlucky" in our first simulation replicate. Let's repeat this process many times and see how often we end up with at least one p-value being below 0.05. Let's wrap our code, above, into a function called `do_one_sim` to help us do this. **Run this code chunk.** ```{r helper-simulation-function} # create a custom function that puts all steps above into one function do_one_sim <- function(){ # simulate null trait y <- rnorm(n = 165, mean = 0, sd = 1) # implement GWAS pvals <- c() for(i in 1:83){ mod <- lm(y ~ snps[,i]) pvals[i] <- tidy(mod)$p.value[2] } # check if any pvals are < 0.05 any(pvals < 0.05) } # try it once to confirm we get the same thing as above set.seed(494) do_one_sim() ``` Now, we can use the `replicate` function to run `do_one_sim` 500 times. **Run this code chunk.** ```{r run-simulation} #| cache: true # NOTE: the code chunk option cache: true is helpful here! # repeat simulation 500 times set.seed(494) simres <- replicate(500, do_one_sim()) ``` Across how many of these simulations did we end up with at least one p-value below the 0.05 threshold? ```{r check-simulation-results} # your code here ``` > YOUR ANSWER HERE \ We just calculated the empirical *family-wise error rate* for our simulation study. How does this compare to the mathematical result that we derived together in class? Collectively, what do these results suggest about the appropriateness of using a p-value threshold of 0.05 in this setting? **Pause and discuss.** > YOUR ANSWER HERE \ \ ### c) Bonferroni correction Hopefully it's clear from the last section that a p-value threshold of 0.05 is NOT appropriate in this setting. What should we use instead? One approach we can use to control our family-wise error rate (FWER) at a more reasonable level is the *Bonferroni correction*. According to this approach, what significance threshold should we use in this setting if we want to control the FWER at 0.05? **Pause and discuss.** > YOUR ANSWER HERE \ \ ### d) Simulation-based approach Another way to determine an appropriate significance threshold is to use a simulation-based approach like the one they used in [Pulit et al., 2016](https://onlinelibrary.wiley.com/doi/10.1002/gepi.22032). If we want to control the family-wise error rate at 5%, then the steps of this process look like this: 1. Simulate a null trait (i.e., a trait that is not associated with any of the SNPs) 2. Run GWAS to test the association between this simulated null trait and each of the SNPs in our dataset. 3. Record the smallest p-value from this GWAS. 4. Repeat steps 1--3 many (e.g., 100, 1000) times. 5. Look at the p-values you saved from those simulation replicates. Sort them from smallest to largest and find the lowest 5th percentile (i.e., the point at which 5% of the p-values are smaller than that number). This is your significance threshold! \ Let's implement this simulation-based multiple testing correction approach now. We'll use our code from earlier as a starting point, but instead of checking if any SNPs are smaller than 0.05 for each simulation, we'll record the smallest one. **Run this code chunk.** ```{r implement-sim-approach} #| cache: true # write a function to do one simulation replicate # (should look a lot like our last simulation) do_one_rep <- function(){ # simulate null trait y <- rnorm(n = 165, mean = 0, sd = 1) # implement GWAS pvals <- c() for(i in 1:83){ mod <- lm(y ~ snps[,i]) pvals[i] <- tidy(mod)$p.value[2] } # record the smallest p-value min(pvals) } # then repeat many times set.seed(494) reps <- replicate(500, do_one_rep()) # then use the quantile function to find the lower 5th percentile quantile(reps, probs = c(0.05)) ``` Based on this approach, what significance threshold should we use? How does this compare to the Bonferroni correction? **Pause and discuss.** > YOUR ANSWER HERE \ \ ### e) Comparison Let's convince ourselves that the Bonferroni correction and simulation-based approach are both "working" here. How might we use a simulation study (perhaps similar to one we did earlier) to assess whether the significance thresholds given by these two approaches are actually controlling the FWER at our desired 5% level? **Pause and discuss.** > YOUR ANSWER HERE \ Implement this simulation study and report what you find! *Hint: borrow heavily from the code in Part 1b, above.* ```{r} #| cache: true # your code here ``` \ \ \ ## Part 2: Correlated Tests In the previous example, we assumed that our 83 SNPs were independent of one another, but this is not a realistic assumption in the case of a typical GWAS. In reality, **nearby SNPs are correlated with one another** due to the processes involved in the inheritance of DNA. How does this impact the multiple testing problem? \ ### a) HapMap data setup Let's return to the HapMap data we explored in Lab 2. To make the size of the dataset more manageable for us today, we'll just look at the first 100 SNPs. First, **update the file paths** below to reflect the way you've stored these data on your computer: ```{r file-paths} # paths to PLINK files bed <- '../data/1_QC_GWAS/HapMap_3_r3_1.bed' bim <- '../data/1_QC_GWAS/HapMap_3_r3_1.bim' fam <- '../data/1_QC_GWAS/HapMap_3_r3_1.fam' ``` Then, read in the files, using the `select.snps` argument to select only the first 100. **Run this code chunk.** ```{r read-hapmap-data} # read in first 100 SNPs hapmap <- read.plink(bed, bim, fam, select.snps = 1:100) # confirm we have 100 SNPs only hapmap$genotypes ``` \ \ ### b) Linkage disequilibrium There is a special term that we use in genetics to talk about the correlation between SNPs: *linkage disequilibrium*, or *LD* for short. The `snpStats` package has a function called `ld` that will calculate LD for us. **Run this code chunk.** ```{r calculate-hapmap-ld} # calculate LD hapmap.ld <- ld(hapmap$genotypes, depth = 99, stats = "R.squared", symmetric = TRUE) ``` \ Let's investigate and then plot this LD matrix. What do you notice? **Pause and discuss.** ```{r plot-hapmap-ld} # look at the first 5-by-5 elements: hapmap.ld[1:5, 1:5] # plot LD (grey scale) image(hapmap.ld) # plot LD (fun color palette) color.pal <- natparks.pals("Acadia", 10) image(hapmap.ld, lwd = 0, cuts = 9, col.regions = color.pal, colorkey = TRUE) ``` > YOUR TAKEAWAYS HERE \ \ ### c) Quality control As you hopefully noticed in the last section, nearby SNPs are often highly correlated with one another. This means that our hypothesis tests at nearby SNPs will also be highly correlated. How does this impact our choice of significance threshold? Let's try implementing the simulation-based multiple testing approach on this dataset to find out. First, we'll repeat some of our data pre-processing / "quality control" steps from Lab 2: - convert our genotype data into a matrix of 0's, 1's, and 2's instead of the `snpStats` format it's currently in - remove the monomorphic SNPs **Run this code chunk.** ```{r hapmap-data-cleaning} # convert to matrix hapmap.geno <- as(hapmap$genotypes, "numeric") # calculate MAF maf <- col.summary(hapmap$genotypes)$MAF # find monomorphic SNPs monomorphic <- which(maf == 0) # remove monomorphic SNPs from genotype matrix hapmap.geno <- hapmap.geno[,-monomorphic] # check the dimensions after filtering dim(hapmap.geno) ``` How many SNPs remain after filtering? **Pause and discuss.** > YOUR ANSWER HERE \ \ ### d) Simulation-based approach Now, modify your code from Part 1 to run the simulation-based multiple testing correction approach on the HapMap data. Remember to update the number of individuals and the number of SNPs accordingly. ```{r get-hapmap-threshold} # write a function to do one simulation replicate # then repeat many times # then use the quantile function to find the lower 5th percentile ``` What threshold did you get? How does this compare to the Bonferroni correction? **Pause and discuss.** > YOUR ANSWER HERE \ Reflect on the steps of the simulation-based approach. *Why* does this approach work? **Pause and discuss.** > YOUR ANSWER HERE \ \