install.packages('NatParksPalettes')
Lab 3: The multiple testing problem
STAT 494: Statistical Genetics
Goals
- Implement the Bonferroni and simulation-based multiple testing correction approaches
- Use simulation studies to assess the family-wise error rate of a specified significance threshold and to compare these two multiple testing correction approaches
- Explore the concept of linkage disequilibrium and how it affects the multiple testing problem
Getting Started
Access the source code for this document (click the
</> Code
button above)Copy-paste into a new QMD
Add your name as the
author
in the YAML header, aboveInstall the
NatParksPalettes
package (for fun color palettes inspired by pictures of US and Canadian National Parks – read more here):
- Load the packages that we’ll need today:
library(tidyverse)
library(broom)
library(snpStats)
library(NatParksPalettes)
Exercises
Part 1: Independent Tests
To begin, let’s explore the problem of conducting multiple independent hypothesis tests. In the context of genetic data, this might look like testing SNPs that are far away from each other and are thus inherited independently.
a) Data setup
To start, let’s generate a small dataset with 165 rows (i.e., 165 people) and 83 columns (i.e., 83 SNPs). (The reasons for these numbers will become apparent later.) For now, we’ll assume that all of these SNPs have the same minor allele frequency of 10% and we’ll generate each of these SNPs independently from the others. We’ll use similar code to generate these data as we did in Lab 1.
Run the code chunk below.
# function to simulate one genetic variant
<- function(n_ppl, MAF){
sim_one_variant <- rbinom(n = n_ppl, size = 2, p = MAF)
snp return(snp)
}
# replicate 83 times to create 83 independent SNPs
# use a sample size of 165 and MAF of 10%
# set seed so we all get the same results
set.seed(494)
<- replicate(83, sim_one_variant(n_ppl = 165, MAF = 0.1)) snps
Take a minute to explore the snps
object and confirm it looks like what you were expecting: calculate the dimensions, type View(snps)
in the Console to open the data in a new window, calculate the correlation between each pair of SNPs, etc. Pause and discuss.
# your code here
YOUR TAKEAWAYS HERE
b) What’s wrong with 0.05?
Suppose we want to conduct a “genome-wide” association study, using marginal regression to separately test the association between each of these 83 SNPs and a trait. Before we work on determining which significance threshold we should use in this setting, let’s explore what happens when we use the “traditional” p-value threshold of 0.05.
Consider the scenario where the null hypothesis is universally true: none of the 83 SNPs are associated with our trait. If we use a p-value threshold of 0.05, how often will we incorrectly state that at least one of these SNPs is associated with the trait (even though we know, in reality, that’s not true)?
To investigate this question, let’s start by simulating a quantitative trait under the null hypothesis, that is, a trait that does not depend on any of the SNPs. Run this code chunk.
set.seed(494)
<- rnorm(n = 165, mean = 0, sd = 1) # simulate Y from a standard normal distribution y
Next, implement GWAS to investigate the association between each SNP and the simulated null trait. Run this code chunk.
# set up empty vector to store p-values
<- c()
pvals
# loop through each of the 83 SNPs
for(i in 1:83){
<- lm(y ~ snps[,i]) # fit model looking at SNP i
mod <- tidy(mod)$p.value[2] # record p-value
pvals[i] }
Look at the p-values that we got. Are any of them below 0.05? How many of them “should” be below 0.05? Pause and discuss.
# your code here
YOUR ANSWER HERE
Maybe we just got “unlucky” in our first simulation replicate. Let’s repeat this process many times and see how often we end up with at least one p-value being below 0.05. Let’s wrap our code, above, into a function called do_one_sim
to help us do this. Run this code chunk.
# create a custom function that puts all steps above into one function
<- function(){
do_one_sim # simulate null trait
<- rnorm(n = 165, mean = 0, sd = 1)
y # implement GWAS
<- c()
pvals for(i in 1:83){
<- lm(y ~ snps[,i])
mod <- tidy(mod)$p.value[2]
pvals[i]
}# check if any pvals are < 0.05
any(pvals < 0.05)
}
# try it once to confirm we get the same thing as above
set.seed(494)
do_one_sim()
[1] TRUE
Now, we can use the replicate
function to run do_one_sim
500 times. Run this code chunk.
# NOTE: the code chunk option cache: true is helpful here!
# repeat simulation 500 times
set.seed(494)
<- replicate(500, do_one_sim()) simres
Across how many of these simulations did we end up with at least one p-value below the 0.05 threshold?
# your code here
YOUR ANSWER HERE
We just calculated the empirical family-wise error rate for our simulation study. How does this compare to the mathematical result that we derived together in class? Collectively, what do these results suggest about the appropriateness of using a p-value threshold of 0.05 in this setting? Pause and discuss.
YOUR ANSWER HERE
c) Bonferroni correction
Hopefully it’s clear from the last section that a p-value threshold of 0.05 is NOT appropriate in this setting. What should we use instead? One approach we can use to control our family-wise error rate (FWER) at a more reasonable level is the Bonferroni correction.
According to this approach, what significance threshold should we use in this setting if we want to control the FWER at 0.05? Pause and discuss.
YOUR ANSWER HERE
d) Simulation-based approach
Another way to determine an appropriate significance threshold is to use a simulation-based approach like the one they used in Pulit et al., 2016. If we want to control the family-wise error rate at 5%, then the steps of this process look like this:
- Simulate a null trait (i.e., a trait that is not associated with any of the SNPs)
- Run GWAS to test the association between this simulated null trait and each of the SNPs in our dataset.
- Record the smallest p-value from this GWAS.
- Repeat steps 1–3 many (e.g., 100, 1000) times.
- Look at the p-values you saved from those simulation replicates. Sort them from smallest to largest and find the lowest 5th percentile (i.e., the point at which 5% of the p-values are smaller than that number). This is your significance threshold!
Let’s implement this simulation-based multiple testing correction approach now. We’ll use our code from earlier as a starting point, but instead of checking if any SNPs are smaller than 0.05 for each simulation, we’ll record the smallest one. Run this code chunk.
# write a function to do one simulation replicate
# (should look a lot like our last simulation)
<- function(){
do_one_rep # simulate null trait
<- rnorm(n = 165, mean = 0, sd = 1)
y # implement GWAS
<- c()
pvals for(i in 1:83){
<- lm(y ~ snps[,i])
mod <- tidy(mod)$p.value[2]
pvals[i]
}# record the smallest p-value
min(pvals)
}
# then repeat many times
set.seed(494)
<- replicate(500, do_one_rep())
reps
# then use the quantile function to find the lower 5th percentile
quantile(reps, probs = c(0.05))
5%
0.0005228908
Based on this approach, what significance threshold should we use? How does this compare to the Bonferroni correction? Pause and discuss.
YOUR ANSWER HERE
e) Comparison
Let’s convince ourselves that the Bonferroni correction and simulation-based approach are both “working” here. How might we use a simulation study (perhaps similar to one we did earlier) to assess whether the significance thresholds given by these two approaches are actually controlling the FWER at our desired 5% level? Pause and discuss.
YOUR ANSWER HERE
Implement this simulation study and report what you find! Hint: borrow heavily from the code in Part 1b, above.
# your code here