Lab 3: The multiple testing problem

STAT 494: Statistical Genetics

Author

Solutions

Published

February 13, 2025



Goals

  • Implement the Bonferroni and simulation-based multiple testing correction approaches
  • Use simulation studies to assess the family-wise error rate of a specified significance threshold and to compare these two multiple testing correction approaches
  • Explore the concept of linkage disequilibrium and how it affects the multiple testing problem



Getting Started

  1. Access the source code for this document (click the </> Code button above)

  2. Copy-paste into a new QMD

  3. Add your name as the author in the YAML header, above

  4. Install the NatParksPalettes package (for fun color palettes inspired by pictures of US and Canadian National Parks – read more here):

install.packages('NatParksPalettes')
  1. Load the packages that we’ll need today:
library(tidyverse) # for data wrangling tools
library(broom) # for tidy model output
library(snpStats) # for reading in PLINK data
library(NatParksPalettes) # for fun color palettes



Exercises

Part 1: Independent Tests

To begin, let’s explore the problem of conducting multiple independent hypothesis tests. In the context of genetic data, this might look like testing SNPs that are far away from each other and are thus inherited independently.

a) Data setup

To start, let’s generate a small dataset with 165 rows (i.e., 165 people) and 83 columns (i.e., 83 SNPs). (The reasons for these numbers will become apparent later.) For now, we’ll assume that all of these SNPs have the same minor allele frequency of 10% and we’ll generate each of these SNPs independently from the others. We’ll use similar code to generate these data as we did in Lab 1.

Run the code chunk below.

# function to simulate one genetic variant
sim_one_variant <- function(n_ppl, MAF){
  snp <- rbinom(n = n_ppl, size = 2, p = MAF)
  return(snp)
}

# replicate 83 times to create 83 independent SNPs
# use a sample size of 165 and MAF of 10%
# set seed so we all get the same results
set.seed(494)
snps <- replicate(83, sim_one_variant(n_ppl = 165, MAF = 0.1))


Take a minute to explore the snps object and confirm it looks like what you were expecting: calculate the dimensions, type View(snps) in the Console to open the data in a new window, calculate the correlation between each pair of SNPs, etc. Pause and discuss.

# check dimensions
dim(snps)
[1] 165  83
#View(snps)

# calculate correlation
corr.snps <- cor(snps)

# plot correlation
image(corr.snps) # heat map

plot(corr.snps) # another option

# look at what's on the off-diagonals
summary(corr.snps[upper.tri(corr.snps)])
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-0.2039251 -0.0564661 -0.0025264  0.0006211  0.0508059  0.3552988 

We’ve generated a matrix of 0’s, 1’s, and 2’s with 165 rows and 83 columns.

Looking at the correlation between columns in this matrix, we see that there are a few columns that are moderately correlated with one another (correlation ~ 0.3), but for the most part the pairwise correlations are close to zero. This is what we’d expect to see when the columns are generated independently, as they were here.



b) What’s wrong with 0.05?

Suppose we want to conduct a “genome-wide” association study, using marginal regression to separately test the association between each of these 83 SNPs and a trait. Before we work on determining which significance threshold we should use in this setting, let’s explore what happens when we use the “traditional” p-value threshold of 0.05.

Consider the scenario where the null hypothesis is universally true: none of the 83 SNPs are associated with our trait. If we use a p-value threshold of 0.05, how often will we incorrectly state that at least one of these SNPs is associated with the trait (even though we know, in reality, that’s not true)?


To investigate this question, let’s start by simulating a quantitative trait under the null hypothesis, that is, a trait that does not depend on any of the SNPs. Run this code chunk.

# simulate y from a standard normal distribution
# NOTE: y does not depend on genotypes in any way (hence is "null")
set.seed(494) 
y <- rnorm(n = 165, mean = 0, sd = 1) 


Next, implement GWAS to investigate the association between each SNP and the simulated null trait. Run this code chunk.

# set up empty vector to store p-values
pvals <- c()

# loop through each of the 83 SNPs
for(i in 1:83){
  mod <- lm(y ~ snps[,i]) # fit model looking at SNP i
  pvals[i] <- tidy(mod)$p.value[2] # record p-value
}

Look at the p-values that we got. Are any of them below 0.05? How many of them “should” be below 0.05? Pause and discuss.

# check if any p-values are below 0.05
any(pvals < 0.05)
[1] TRUE
# check how many p-values are below 0.05
sum(pvals < 0.05)
[1] 4
# what proportion does this equate to?
mean(pvals < 0.05)
[1] 0.04819277

Ideally, we would hope not to see any p-values below 0.05 since we know the truth is that none of our 83 SNPs are associated with the trait. However, we ended up with 4 SNPs (out of 83) with a p-value smaller than 0.05, i.e., 4 SNPs for which we incorrectly reject the null hypothesis. Are we comfortable with this?



Maybe we just got “unlucky” in our first simulation replicate. Let’s repeat this process many times and see how often we end up with at least one p-value being below 0.05. Let’s wrap our code, above, into a function called do_one_sim to help us do this. Run this code chunk.

# create a custom function that puts all steps above into one function
do_one_sim <- function(){
  # simulate null trait
  y <- rnorm(n = 165, mean = 0, sd = 1)
  # implement GWAS
  pvals <- c()
  for(i in 1:83){
    mod <- lm(y ~ snps[,i])
    pvals[i] <- tidy(mod)$p.value[2]
  }
  # check if any pvals are < 0.05
  any(pvals < 0.05)
}

# try it once to confirm we get the same thing as above
set.seed(494)
do_one_sim()
[1] TRUE

Now, we can use the replicate function to run do_one_sim 500 times. Run this code chunk.

# NOTE: the cache code chunk option is helpful here!
# repeat simulation 500 times
set.seed(494)
simres <- replicate(500, do_one_sim())

Across how many of these simulations did we end up with at least one p-value below the 0.05 threshold?

# look at simulation results
head(simres)
[1] TRUE TRUE TRUE TRUE TRUE TRUE
# how many were TRUE?
sum(simres)
[1] 493
# what proportion were TRUE?
mean(simres)
[1] 0.986

Across our 500 simulations, we had at least one p-value below the 0.05 threshold (i.e., at least one type I error) 98.6 percent of the time.


We just calculated the empirical family-wise error rate for our simulation study. How does this compare to the mathematical result that we derived together in class? Collectively, what do these results suggest about the appropriateness of using a p-value threshold of 0.05 in this setting? Pause and discuss.

Theoretically, the family-wise error rate (the probability of having at least one type I error) should be

\[ \begin{aligned} FWER &= P(\text{at least one T1E}) \\ & = 1 - P(\text{no T1E on any tests}) \\ & = 1 - P(\text{no T1E on test 1 \&} \dots \text{\& no T1E on test 83}) \\ & = 1 - \left[P(\text{no T1E on test 1}) \times \dots \times P(\text{no T1E on test 83})\right], \text{ since the tests are independent} \\ & = 1 - [1 - P(\text{T1E on test 1})] \times \dots \times [1 - P(\text{T1E on test 83})] \\ & = 1 - [1 - 0.05] \times \dots \times [1 - 0.05] \\ & = 1 - [1 - 0.05]^{83} \\ & = 0.9858401 \end{aligned} \]

This is incredibly close to the family-wise error rate that we observed in our simulations!

Collectively, these results suggest that if we use a p-value threshold of 0.05, our family-wise error rate will be very high when we conduct this many tests. In other words, if we conduct 83 hypothesis tests using a signicance threshold of 0.05, there is a very high chance that we will incorrectly reject the null hypothesis (i.e., incorrectly conclude that a SNP is associated with the trait, even when it isn’t) on at least one of these tests! A stricter (smaller) threshold may be more appropriate.



c) Bonferroni correction

Hopefully it’s clear from the last section that a p-value threshold of 0.05 is NOT appropriate in this setting. What should we use instead? One approach we can use to control our family-wise error rate (FWER) at a more reasonable level is the Bonferroni correction.

According to this approach, what significance threshold should we use in this setting if we want to control the FWER at 0.05? Pause and discuss.

The Bonferroni correction approach says to set our significance threshold equal to our desired FWER divided by the number of tests we are conducting. In this case:

\[ \begin{aligned} \frac{\text{FWER}}{\text{no. tests}} & = \frac{0.05}{83} \\ & \approx 6\times 10^{-4} \end{aligned} \]



d) Simulation-based approach

Another way to determine an appropriate significance threshold is to use a simulation-based approach like the one they used in Pulit et al., 2016. If we want to control the family-wise error rate at 5%, then the steps of this process look like this:

  1. Simulate a null trait (i.e., a trait that is not associated with any of the SNPs)
  2. Run GWAS to test the association between this simulated null trait and each of the SNPs in our dataset.
  3. Record the smallest p-value from this GWAS.
  4. Repeat steps 1–3 many (e.g., 100, 1000) times.
  5. Look at the p-values you saved from those simulation replicates. Sort them from smallest to largest and find the lowest 5th percentile (i.e., the point at which 5% of the p-values are smaller than that number). This is your significance threshold!


Let’s implement this simulation-based multiple testing correction approach now. We’ll use our code from earlier as a starting point, but instead of checking if any SNPs are smaller than 0.05 for each simulation, we’ll record the smallest one. Run this code chunk.

# write a function to do one simulation replicate 
# (should look a lot like our last simulation)
do_one_rep <- function(){
  # simulate null trait
  y <- rnorm(n = 165, mean = 0, sd = 1)
  # implement GWAS
  pvals <- c()
  for(i in 1:83){
    mod <- lm(y ~ snps[,i])
    pvals[i] <- tidy(mod)$p.value[2]
  }
  # record the smallest p-value
  min(pvals)
}


# then repeat many times
set.seed(494) 
reps <- replicate(500, do_one_rep())


# then use the quantile function to find the lower 5th percentile
quantile(reps, probs = c(0.05))
          5% 
0.0005228908 

Based on this approach, what significance threshold should we use? How does this compare to the Bonferroni correction? Pause and discuss.

This approach suggests that we should use a significance threshold of 5.2289076^{-4} or approximately \(5.2\times 10^{-4}\).



e) Comparison

Let’s convince ourselves that the Bonferroni correction and simulation-based approach are both “working” here. How might we use a simulation study (perhaps similar to one we did earlier) to assess whether the significance thresholds given by these two approaches are actually controlling the FWER at our desired 5% level? Pause and discuss.

We can simulate data under the null hypothesis, conduct GWAS, and check how often we reject the null using these new significance thresholds. Repeating this many times will give us a sense of the family-wise error rate: across the 500 replicates of our simulation study (or however many simulations we decide to do), how often did we reject the null at least one time? Since we know that our data were generated under the scenario of no SNPs being associated with the trait, this will give us a sense of how often we will see at least one type I error.


Implement this simulation study and report what you find! Hint: borrow heavily from the code in Part 1b, above.

# create a helper function to easily repeat sim with different thresholds
do_one_fwer_sim <- function(thresh){
  # simulate null trait
  y <- rnorm(n = 165, mean = 0, sd = 1)
  # implement GWAS
  pvals <- c()
  for(i in 1:83){
    mod <- lm(y ~ snps[,i])
    pvals[i] <- tidy(mod)$p.value[2]
  }
  # check if any pvals are < threshold
  any(pvals < thresh) # swap out 0.05 in sim above for generic threshold
}
# assess 0.05 threshold
# (should be identical to results from 1b)
set.seed(494)
replicate(500, do_one_fwer_sim(thresh = 0.05)) %>%
  mean()
[1] 0.986
# assess Bonferroni threshold
set.seed(494)
replicate(500, do_one_fwer_sim(thresh = 0.05/83)) %>%
  mean()
[1] 0.058
# assess simulation-based threshold
set.seed(494)
replicate(500, do_one_fwer_sim(thresh = 0.0005228908)) %>%
  mean()
[1] 0.05

Using a “traditional” significance threshold of 0.05 led to a nearly 99% family-wise error rate in our simulation study.
However, both Bonferroni and the simulation-based multiple testing correction approaches yield significance thresholds that reduce the family-wise error rate to our desired 5% level!




Part 2: Correlated Tests

In the previous example, we assumed that our 83 SNPs were independent of one another, but this is not a realistic assumption in the case of a typical GWAS. In reality, nearby SNPs are correlated with one another due to the processes involved in the inheritance of DNA. How does this impact the multiple testing problem?


a) HapMap data setup

Let’s return to the HapMap data we explored in Lab 2. To make the size of the dataset more manageable for us today, we’ll just look at the first 100 SNPs.

First, update the file paths below to reflect the way you’ve stored these data on your computer:

# paths to PLINK files
bed <- '../data/1_QC_GWAS/HapMap_3_r3_1.bed'
bim <- '../data/1_QC_GWAS/HapMap_3_r3_1.bim'
fam <- '../data/1_QC_GWAS/HapMap_3_r3_1.fam'

Then, read in the files, using the select.snps argument to select only the first 100. Run this code chunk.

# read in first 100 SNPs
hapmap <- read.plink(bed, bim, fam, select.snps = 1:100)

# confirm we have 100 SNPs only
hapmap$genotypes
A SnpMatrix with  165 rows and  100 columns
Row names:  NA06989 ... NA12865 
Col names:  rs2185539 ... rs12757754 

Notice that we only have 100 columns (SNPs) this time. This is because of our select.snps argument above.



b) Linkage disequilibrium

There is a special term that we use in genetics to talk about the correlation between SNPs: linkage disequilibrium, or LD for short.

The snpStats package has a function called ld that will calculate LD for us. Run this code chunk.

# calculate LD
hapmap.ld <- ld(hapmap$genotypes, depth = 99, stats = "R.squared", symmetric = TRUE)


Let’s investigate and then plot this LD matrix. What do you notice? Pause and discuss.

# look at the first 5-by-5 elements:
hapmap.ld[1:5, 1:5]
5 x 5 sparse Matrix of class "dsCMatrix"
           rs2185539  rs11510103 rs11240767   rs3131972   rs3131969
rs2185539          .          NA         NA          NA          NA
rs11510103        NA .                   NA 0.005662224 0.007526042
rs11240767        NA          NA          .          NA          NA
rs3131972         NA 0.005662224         NA .           0.800991691
rs3131969         NA 0.007526042         NA 0.800991691 .          
# plot LD (grey scale)
image(hapmap.ld)

# plot LD (fun color palette)
color.pal <- natparks.pals("Acadia", 10)
image(hapmap.ld, lwd = 0, cuts = 9, col.regions = color.pal, colorkey = TRUE)

Across the first 100 SNPs in the HapMap dataset, we can see that there are a few “blocks” of SNPs that are highly correlated with one another. This differs drastically from the pattern we saw in the toy snps dataset that we generated above, where all SNPs were independent.

What’s up with the white lines in the heat map above? As you may recall from Lab 2, there are some SNPs that are monomorphic in the HapMap dataset (e.g., rs2185539, rs11240767). Suppose we to calculate the correlation between a monomorphic SNP (x) and another SNP (y):

\[r = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2 \sum_{i=1}^n(y_i - \bar{y})^2}}\]

But, since the SNP is monomorphic this means that no one in the dataset has any copies of the minor allele. In other words, \(x_1 = \dots = x_n = 0\). This means that the sample average, \(\bar{x}\), is also 0, and thus \(x_i - \bar{x} = 0 - 0 = 0\) for all individuals \(i = 1, \dots, n\). Then, the correlation looks like this:

\[r = \frac{\sum_{i=1}^n 0 \times (y_i - \bar{y})}{\sqrt{0 \times \sum_{i=1}^n(y_i - \bar{y})^2}} = \frac{0}{0},\]

which is undefined. This explains why we have a bunch of NAs in our LD matrix. If we remove the monomorphic SNPs, our LD matrix looks like this:

# find monomorphic
maf <- col.summary(hapmap$genotypes)$MAF
mono <- which(maf == 0)

# calculate LD on polymorphic SNPs only
hapmap.ld.nomono <- ld(hapmap$genotypes[,-mono], depth = 99-length(mono), stats = "R.squared", symmetric = TRUE)

# plot 
image(hapmap.ld.nomono, lwd = 0, cuts = 9, col.regions = color.pal, colorkey = TRUE)



c) Quality control

As you hopefully noticed in the last section, nearby SNPs are often highly correlated with one another. This means that our hypothesis tests at nearby SNPs will also be highly correlated. How does this impact our choice of significance threshold? Let’s try implementing the simulation-based multiple testing approach on this dataset to find out.

First, we’ll repeat some of our data pre-processing / “quality control” steps from Lab 2:

  • convert our genotype data into a matrix of 0’s, 1’s, and 2’s instead of the snpStats format it’s currently in
  • remove the monomorphic SNPs

Run this code chunk.

# convert to matrix
hapmap.geno <- as(hapmap$genotypes, "numeric")

# calculate MAF
maf <- col.summary(hapmap$genotypes)$MAF

# find monomorphic SNPs
monomorphic <- which(maf == 0) 

# remove monomorphic SNPs from genotype matrix
hapmap.geno <- hapmap.geno[,-monomorphic]

# check the dimensions after filtering
dim(hapmap.geno)
[1] 165  83

How many SNPs remain after filtering? Pause and discuss.

After filtering, 83 polymorphic SNPs remain. (This means we would use the same Bonferroni threshold in this setting as we did in Part 1.)



d) Simulation-based approach

Now, modify your code from Part 1 to run the simulation-based multiple testing correction approach on the HapMap data. Remember to update the number of individuals and the number of SNPs accordingly.

Add the cache: true code chunk option here!

# write a function to do one simulation replicate
## NOTE:
## -- we can keep n = 165 and p = 83 since the hapmap data have the same dimensions (after QC)
## -- we need to change the dataset from "snps" to "hapmap.geno"
do_one_rep_hapmap <- function(){
  # simulate null trait
  y <- rnorm(n = 165, mean = 0, sd = 1) 
  # implement GWAS
  pvals <- c()
  for(i in 1:83){
    mod <- lm(y ~ hapmap.geno[,i])
    pvals[i] <- tidy(mod)$p.value[2]
  }
  # record the smallest p-value
  min(pvals)
}

# then repeat many times
set.seed(494) 
hapmap.reps <- replicate(500, do_one_rep_hapmap())

# then use the quantile function to find the lower 5th percentile
quantile(hapmap.reps, probs = c(0.05))
         5% 
0.001048875 

What threshold did you get? How does this compare to the Bonferroni correction? Pause and discuss.

I got a threshold of 0.0010489, or approximately \(1 \times 10^{-3}\).

Since we are doing 83 tests, Bonferroni tells us to use a threshold of 6.0240964^{-4} or approximately \(6 \times 10^{-4}\).

The Bonferroni threshold is quite a bit smaller (more strict) than the threshold we get from the simulation-based approach! If we used the Bonferroni threshold, this would make it (unnecessarily) more challenging for us to reject the null hypothesis, which could lead to more type II errors (false negatives). Remember that Bonferroni does not take into account the correlation between tests. In the scenario we investigated here, although we are conducting 83 tests, we are not conducting 83 independent tests. Since these tests are, in some cases, very highly correlated, we are effectively conducting fewer tests and thus don’t need as strict of a multiple testing correction.


Reflect on the steps of the simulation-based approach. Why does this approach work? Pause and discuss.

This approach is cleverly designed to “reverse-enginner” this multiple testing problem and identify, in this set of simulated data, the threshold that will control the family-wise error rate at your desired level (eg 5%).

Why keep track of minimum p-values from each replicate? Remember that FWER is concerned not with the exact number of type I errors that arise in each association study, but with the chance that “at least one” type I error arises. This is why the simulation approach records just the smallest p-value from each simulation replicate. Let’s say the smallest p-value from the first replicate was 0.008. If I use a significance threshold of 0.05, I know that the smallest p-value for this replicate (0.008) is smaller than that threshold so I would have “at least one” type I error in that analysis. If, on the other hand, I use a significance threshold of 0.001, I know that I would not have any type I errors for that study (if the smallest p-value is not less than 0.001, than none of the other p-values will be either).

Why do we set our threshold equal to the lower 5th percentile of these minimum p-values? This process equates to taking the minimum p-value from each simulation replicate, sorting them from smallest to biggest, and finding the point at which 5% of the other minimum p-values are smaller than it, and 95% are larger than it.
Consider Figure 1D in Pullit et al., 2016:

If we set our significance threshold equal to this 5th pecentile of minimum p-values (the blue diamond in the figure above), we know that we will have at least one type I error on exactly 5% of our simulation replicates: in 5% of settings there will be at least one p-value below the threshold. In other words, our empirical family-wise error rate will be exactly 5%! (If you want to control your FWER at a different level than 5%, just change which percentile of minimum p-values you pull.)


Going Deeper

Why does Bonferroni work?

Our simulations above provide a nice proof-of-concept that Bonferroni can also control the FWER at (or below) 5%. There’s also a nice mathematical proof. Use what you learned in MATH/STAT 354: Probability and see if you can prove this yourself.

Check your work here.