library(tidyverse)
Journal Club 3
STAT 494: Statistical Genetics
Introduction
To explore why the p-value should be lowered when multiple testing, we’re going to simulate some data. Keep this quote in mind from the first paragraph of the paper:
By testing an increasing number of hypotheses in a single experiment, however, [researchers] must account for the so-called testing burden of the experiment and accordingly adjust the significance threshold to reject the null hypothesis, thereby minimizing the chance of reporting a false positive (type I error).
We’ll start by simulating five different SNPs for one million different people and compare that to a normally distributed height.
<- c("c", "a") #string to sample from later
SNPs <- 1000000 #Number of rows/people to create
n
<- tibble("Height" = rnorm(n, 66, 3), #normal distribution of height with a mean of 66 (in) and a standard deviation of 3
random_height_5 "SNP1" = sample(SNPs, n, replace = TRUE), #randomly samples from SNPs to determine a DNA base
"SNP2" = sample(SNPs, n, replace = TRUE),
"SNP3" = sample(SNPs, n, replace = TRUE),
"SNP4" = sample(SNPs, n, replace = TRUE),
"SNP5" = sample(SNPs, n, replace = TRUE))
<- lm(Height ~ SNP1 + SNP2 + SNP3 + SNP4 + SNP5, data = random_height_5) #creates linear model with height as the outcome
height_model_5 summary(height_model_5)
Did you get any statistically significant results?
Some of you may have gotten one or two p-values below 0.05, but most of you probably didn’t. The code randomly generates DNA bases for each SNP independent of the height.
Keep in mind that we’re only testing 5 different hypotheses in this model. That means we’re only testing if each of the 5 SNPs is a good predictor of height. And the 5 SNPs should not be a good predictor of height since all of the data is random.
More Hypothesis Testing
But we should test more than 5 hypotheses. The paper uses 20,000+ samples.
<- c("c", "a")
SNPs
<- 1000000 #number of rows in the dataset
n <- 20 #number of SNPs generated / number of hypotheses tested
x <- 0.05 #significance threshold
pvalue_threshold
<- tibble("Height" = rnorm(n, 66, 3), #normal distribution of height
random_height_x as_tibble(replicate(x, sample(SNPs, n, replace = TRUE)),
.name_repair = ~paste0("SNP", seq_along(.))) #generates x number of randomly sampled SNPs
)
<- lm(Height ~ ., data = random_height_x) #model with all predictors incorporated
height_model_x
<- summary(height_model_x)$coefficients[-1, 4] #extracts p-values from the model
p_values
sum(p_values < pvalue_threshold) #adds up the number of p-values below the set threshold
min(p_values) #minimum p-value
What’s the top number from your output? How many statistically significant results do you have now? And what is your p-value?
What if you increase the number of SNPs, which also increases the number of hypotheses you are testing? Take a couple minutes to mess around with the variables n, x, and pvalue_threshold. Discuss your results with your table.
Testing different groups
So far, we’ve only looked at individuals with the same height distribution, but we can add a second group. (The paper looked at five different populations, but for simplicity, we’ll only look at two.)
The code chunk below creates two different populations with different means for height. But each population’s SNPs are generated from the same sample. Run this chunk.
<- c("c", "a")
SNPs
<- 500000 #number of rows in the dataset
n <- 20 #number of SNPs generated / number of hypotheses tested
x <- 0.05 #significance threshold
pvalue_threshold set.seed(333)
<- tibble("Group" = "1",
group1 "Height" = rnorm(n, 70, 3), #note the different mean for height here
as_tibble(replicate(20, sample(SNPs, n, replace = TRUE)),
.name_repair = ~paste0("SNP", seq_along(.))))
<- tibble("Group" = "2",
group2 "Height" = rnorm(n, 62, 3), #note the different mean for height here
as_tibble(replicate(20, sample(SNPs, n, replace = TRUE)),
.name_repair = ~paste0("SNP", seq_along(.))))
<- rbind(group1, group2)
combined_groups
%>%
combined_groups ggplot(aes(x = Height, fill = Group)) +
geom_density(alpha = 0.5)
There is obviously some overlap between the distributions, but also a significant difference. Given that the SNPs are the same as in previous examples, and the only change is the outcome, what do you expect to happen? Do you think you will get any statistically significant results?
<- lm(Height ~ . - Group, data = combined_groups)
final_height_model
<- summary(final_height_model)$coefficients[-1, 4]
p_values
min(p_values)
summary(final_height_model)
If everything worked properly, you should get a p-value of 3.78 x 10^-6. I did cheat a bit, there is a specific seed set to get this sort of extreme result.
#Takeaway
Notice that even with purely random data, and even with data that should definitely not be correlated, we can still find unexpected results. This is why when researchers are testing genetic data with many more variables, the p-value needs to be lowered.
Discuss this result for a couple minutes with your table.
The code below is what I used to find the most extreme p-values, if you are interested. But it takes about a half hour to run, so don’t run it right now.
<- c("c", "a")
SNPs
<- 500000 #number of rows in the dataset
n <- 20 #number of SNPs generated / number of hypotheses tested
x <- 0.05 #significance threshold
pvalue_threshold
<- tibble("Seed" = NA,
loop_df "Min p-value" = NA,
"Num Below Threshold" = NA)
for (i in 1:500) {
set.seed(i)
<- tibble("Group" = "1",
group1 "Height" = rnorm(n, 70, 3),
as_tibble(replicate(20, sample(SNPs, n, replace = TRUE)),
.name_repair = ~paste0("SNP", seq_along(.))))
<- tibble("Group" = "2",
group2 "Height" = rnorm(n, 62, 3),
as_tibble(replicate(20, sample(SNPs, n, replace = TRUE)),
.name_repair = ~paste0("SNP", seq_along(.))))
<- rbind(group1, group2)
combined_groups
<- lm(Height ~ . - Group, data = combined_groups)
final_height_model
<- summary(final_height_model)$coefficients[-1, 4]
p_values
1] = i
loop_df[i, 2] = min(p_values)
loop_df[i, 3] = sum(p_values < pvalue_threshold)
loop_df[i,
print(i)
}
%>%
loop_df arrange(`Min p-value`)