install.packages('NatParksPalettes')
Lab 3: The multiple testing problem
STAT 494: Statistical Genetics
Goals
- Implement the Bonferroni and simulation-based multiple testing correction approaches
- Use simulation studies to assess the family-wise error rate of a specified significance threshold and to compare these two multiple testing correction approaches
- Explore the concept of linkage disequilibrium and how it affects the multiple testing problem
Getting Started
Access the source code for this document (click the
</> Code
button above)Copy-paste into a new QMD
Add your name as the
author
in the YAML header, aboveInstall the
NatParksPalettes
package (for fun color palettes inspired by pictures of US and Canadian National Parks – read more here):
- Load the packages that we’ll need today:
library(tidyverse) # for data wrangling tools
library(broom) # for tidy model output
library(snpStats) # for reading in PLINK data
library(NatParksPalettes) # for fun color palettes
Exercises
Part 1: Independent Tests
To begin, let’s explore the problem of conducting multiple independent hypothesis tests. In the context of genetic data, this might look like testing SNPs that are far away from each other and are thus inherited independently.
a) Data setup
To start, let’s generate a small dataset with 165 rows (i.e., 165 people) and 83 columns (i.e., 83 SNPs). (The reasons for these numbers will become apparent later.) For now, we’ll assume that all of these SNPs have the same minor allele frequency of 10% and we’ll generate each of these SNPs independently from the others. We’ll use similar code to generate these data as we did in Lab 1.
Run the code chunk below.
# function to simulate one genetic variant
<- function(n_ppl, MAF){
sim_one_variant <- rbinom(n = n_ppl, size = 2, p = MAF)
snp return(snp)
}
# replicate 83 times to create 83 independent SNPs
# use a sample size of 165 and MAF of 10%
# set seed so we all get the same results
set.seed(494)
<- replicate(83, sim_one_variant(n_ppl = 165, MAF = 0.1)) snps
Take a minute to explore the snps
object and confirm it looks like what you were expecting: calculate the dimensions, type View(snps)
in the Console to open the data in a new window, calculate the correlation between each pair of SNPs, etc. Pause and discuss.
# check dimensions
dim(snps)
[1] 165 83
#View(snps)
# calculate correlation
<- cor(snps)
corr.snps
# plot correlation
image(corr.snps) # heat map
plot(corr.snps) # another option
# look at what's on the off-diagonals
summary(corr.snps[upper.tri(corr.snps)])
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.2039251 -0.0564661 -0.0025264 0.0006211 0.0508059 0.3552988
We’ve generated a matrix of 0’s, 1’s, and 2’s with 165 rows and 83 columns.
Looking at the correlation between columns in this matrix, we see that there are a few columns that are moderately correlated with one another (correlation ~ 0.3), but for the most part the pairwise correlations are close to zero. This is what we’d expect to see when the columns are generated independently, as they were here.
b) What’s wrong with 0.05?
Suppose we want to conduct a “genome-wide” association study, using marginal regression to separately test the association between each of these 83 SNPs and a trait. Before we work on determining which significance threshold we should use in this setting, let’s explore what happens when we use the “traditional” p-value threshold of 0.05.
Consider the scenario where the null hypothesis is universally true: none of the 83 SNPs are associated with our trait. If we use a p-value threshold of 0.05, how often will we incorrectly state that at least one of these SNPs is associated with the trait (even though we know, in reality, that’s not true)?
To investigate this question, let’s start by simulating a quantitative trait under the null hypothesis, that is, a trait that does not depend on any of the SNPs. Run this code chunk.
# simulate y from a standard normal distribution
# NOTE: y does not depend on genotypes in any way (hence is "null")
set.seed(494)
<- rnorm(n = 165, mean = 0, sd = 1) y
Next, implement GWAS to investigate the association between each SNP and the simulated null trait. Run this code chunk.
# set up empty vector to store p-values
<- c()
pvals
# loop through each of the 83 SNPs
for(i in 1:83){
<- lm(y ~ snps[,i]) # fit model looking at SNP i
mod <- tidy(mod)$p.value[2] # record p-value
pvals[i] }
Look at the p-values that we got. Are any of them below 0.05? How many of them “should” be below 0.05? Pause and discuss.
# check if any p-values are below 0.05
any(pvals < 0.05)
[1] TRUE
# check how many p-values are below 0.05
sum(pvals < 0.05)
[1] 4
# what proportion does this equate to?
mean(pvals < 0.05)
[1] 0.04819277
Ideally, we would hope not to see any p-values below 0.05 since we know the truth is that none of our 83 SNPs are associated with the trait. However, we ended up with 4 SNPs (out of 83) with a p-value smaller than 0.05, i.e., 4 SNPs for which we incorrectly reject the null hypothesis. Are we comfortable with this?
Maybe we just got “unlucky” in our first simulation replicate. Let’s repeat this process many times and see how often we end up with at least one p-value being below 0.05. Let’s wrap our code, above, into a function called do_one_sim
to help us do this. Run this code chunk.
# create a custom function that puts all steps above into one function
<- function(){
do_one_sim # simulate null trait
<- rnorm(n = 165, mean = 0, sd = 1)
y # implement GWAS
<- c()
pvals for(i in 1:83){
<- lm(y ~ snps[,i])
mod <- tidy(mod)$p.value[2]
pvals[i]
}# check if any pvals are < 0.05
any(pvals < 0.05)
}
# try it once to confirm we get the same thing as above
set.seed(494)
do_one_sim()
[1] TRUE
Now, we can use the replicate
function to run do_one_sim
500 times. Run this code chunk.
# NOTE: the cache code chunk option is helpful here!
# repeat simulation 500 times
set.seed(494)
<- replicate(500, do_one_sim()) simres
Across how many of these simulations did we end up with at least one p-value below the 0.05 threshold?
# look at simulation results
head(simres)
[1] TRUE TRUE TRUE TRUE TRUE TRUE
# how many were TRUE?
sum(simres)
[1] 493
# what proportion were TRUE?
mean(simres)
[1] 0.986
Across our 500 simulations, we had at least one p-value below the 0.05 threshold (i.e., at least one type I error) 98.6 percent of the time.
We just calculated the empirical family-wise error rate for our simulation study. How does this compare to the mathematical result that we derived together in class? Collectively, what do these results suggest about the appropriateness of using a p-value threshold of 0.05 in this setting? Pause and discuss.
Theoretically, the family-wise error rate (the probability of having at least one type I error) should be
\[ \begin{aligned} FWER &= P(\text{at least one T1E}) \\ & = 1 - P(\text{no T1E on any tests}) \\ & = 1 - P(\text{no T1E on test 1 \&} \dots \text{\& no T1E on test 83}) \\ & = 1 - \left[P(\text{no T1E on test 1}) \times \dots \times P(\text{no T1E on test 83})\right], \text{ since the tests are independent} \\ & = 1 - [1 - P(\text{T1E on test 1})] \times \dots \times [1 - P(\text{T1E on test 83})] \\ & = 1 - [1 - 0.05] \times \dots \times [1 - 0.05] \\ & = 1 - [1 - 0.05]^{83} \\ & = 0.9858401 \end{aligned} \]
This is incredibly close to the family-wise error rate that we observed in our simulations!
Collectively, these results suggest that if we use a p-value threshold of 0.05, our family-wise error rate will be very high when we conduct this many tests. In other words, if we conduct 83 hypothesis tests using a signicance threshold of 0.05, there is a very high chance that we will incorrectly reject the null hypothesis (i.e., incorrectly conclude that a SNP is associated with the trait, even when it isn’t) on at least one of these tests! A stricter (smaller) threshold may be more appropriate.
c) Bonferroni correction
Hopefully it’s clear from the last section that a p-value threshold of 0.05 is NOT appropriate in this setting. What should we use instead? One approach we can use to control our family-wise error rate (FWER) at a more reasonable level is the Bonferroni correction.
According to this approach, what significance threshold should we use in this setting if we want to control the FWER at 0.05? Pause and discuss.
The Bonferroni correction approach says to set our significance threshold equal to our desired FWER divided by the number of tests we are conducting. In this case:
\[ \begin{aligned} \frac{\text{FWER}}{\text{no. tests}} & = \frac{0.05}{83} \\ & \approx 6\times 10^{-4} \end{aligned} \]
d) Simulation-based approach
Another way to determine an appropriate significance threshold is to use a simulation-based approach like the one they used in Pulit et al., 2016. If we want to control the family-wise error rate at 5%, then the steps of this process look like this:
- Simulate a null trait (i.e., a trait that is not associated with any of the SNPs)
- Run GWAS to test the association between this simulated null trait and each of the SNPs in our dataset.
- Record the smallest p-value from this GWAS.
- Repeat steps 1–3 many (e.g., 100, 1000) times.
- Look at the p-values you saved from those simulation replicates. Sort them from smallest to largest and find the lowest 5th percentile (i.e., the point at which 5% of the p-values are smaller than that number). This is your significance threshold!
Let’s implement this simulation-based multiple testing correction approach now. We’ll use our code from earlier as a starting point, but instead of checking if any SNPs are smaller than 0.05 for each simulation, we’ll record the smallest one. Run this code chunk.
# write a function to do one simulation replicate
# (should look a lot like our last simulation)
<- function(){
do_one_rep # simulate null trait
<- rnorm(n = 165, mean = 0, sd = 1)
y # implement GWAS
<- c()
pvals for(i in 1:83){
<- lm(y ~ snps[,i])
mod <- tidy(mod)$p.value[2]
pvals[i]
}# record the smallest p-value
min(pvals)
}
# then repeat many times
set.seed(494)
<- replicate(500, do_one_rep())
reps
# then use the quantile function to find the lower 5th percentile
quantile(reps, probs = c(0.05))
5%
0.0005228908
Based on this approach, what significance threshold should we use? How does this compare to the Bonferroni correction? Pause and discuss.
This approach suggests that we should use a significance threshold of 5.2289076^{-4} or approximately \(5.2\times 10^{-4}\).
e) Comparison
Let’s convince ourselves that the Bonferroni correction and simulation-based approach are both “working” here. How might we use a simulation study (perhaps similar to one we did earlier) to assess whether the significance thresholds given by these two approaches are actually controlling the FWER at our desired 5% level? Pause and discuss.
We can simulate data under the null hypothesis, conduct GWAS, and check how often we reject the null using these new significance thresholds. Repeating this many times will give us a sense of the family-wise error rate: across the 500 replicates of our simulation study (or however many simulations we decide to do), how often did we reject the null at least one time? Since we know that our data were generated under the scenario of no SNPs being associated with the trait, this will give us a sense of how often we will see at least one type I error.
Implement this simulation study and report what you find! Hint: borrow heavily from the code in Part 1b, above.
# create a helper function to easily repeat sim with different thresholds
<- function(thresh){
do_one_fwer_sim # simulate null trait
<- rnorm(n = 165, mean = 0, sd = 1)
y # implement GWAS
<- c()
pvals for(i in 1:83){
<- lm(y ~ snps[,i])
mod <- tidy(mod)$p.value[2]
pvals[i]
}# check if any pvals are < threshold
any(pvals < thresh) # swap out 0.05 in sim above for generic threshold
}
# assess 0.05 threshold
# (should be identical to results from 1b)
set.seed(494)
replicate(500, do_one_fwer_sim(thresh = 0.05)) %>%
mean()
[1] 0.986
# assess Bonferroni threshold
set.seed(494)
replicate(500, do_one_fwer_sim(thresh = 0.05/83)) %>%
mean()
[1] 0.058
# assess simulation-based threshold
set.seed(494)
replicate(500, do_one_fwer_sim(thresh = 0.0005228908)) %>%
mean()
[1] 0.05
Using a “traditional” significance threshold of 0.05 led to a nearly 99% family-wise error rate in our simulation study.
However, both Bonferroni and the simulation-based multiple testing correction approaches yield significance thresholds that reduce the family-wise error rate to our desired 5% level!