Journal Club 3

STAT 494: Statistical Genetics

Author

Journal Club 3 Discussion Leaders

Published

February 11, 2025

library(tidyverse)

Introduction

To explore why the p-value should be lowered when multiple testing, we’re going to simulate some data. Keep this quote in mind from the first paragraph of the paper:

By testing an increasing number of hypotheses in a single experiment, however, [researchers] must account for the so-called testing burden of the experiment and accordingly adjust the significance threshold to reject the null hypothesis, thereby minimizing the chance of reporting a false positive (type I error).

We’ll start by simulating five different SNPs for one million different people and compare that to a normally distributed height.

SNPs <- c("c", "a") #string to sample from later
n <- 1000000 #Number of rows/people to create

random_height_5 <- tibble("Height" = rnorm(n, 66, 3), #normal distribution of height with a mean of 66 (in) and a standard deviation of 3
                        "SNP1" = sample(SNPs, n, replace = TRUE), #randomly samples from SNPs to determine a DNA base
                        "SNP2" = sample(SNPs, n, replace = TRUE),
                        "SNP3" = sample(SNPs, n, replace = TRUE),
                        "SNP4" = sample(SNPs, n, replace = TRUE),
                        "SNP5" = sample(SNPs, n, replace = TRUE))

height_model_5 <- lm(Height ~ SNP1 + SNP2 + SNP3 + SNP4 + SNP5, data = random_height_5) #creates linear model with height as the outcome
summary(height_model_5)

Did you get any statistically significant results?

Some of you may have gotten one or two p-values below 0.05, but most of you probably didn’t. The code randomly generates DNA bases for each SNP independent of the height.

Keep in mind that we’re only testing 5 different hypotheses in this model. That means we’re only testing if each of the 5 SNPs is a good predictor of height. And the 5 SNPs should not be a good predictor of height since all of the data is random.

More Hypothesis Testing

But we should test more than 5 hypotheses. The paper uses 20,000+ samples.

SNPs <- c("c", "a")

n <- 1000000 #number of rows in the dataset
x <- 20 #number of SNPs generated / number of hypotheses tested
pvalue_threshold <- 0.05 #significance threshold

random_height_x <- tibble("Height" = rnorm(n, 66, 3), #normal distribution of height
                           as_tibble(replicate(x, sample(SNPs, n, replace = TRUE)), 
                                     .name_repair = ~paste0("SNP", seq_along(.))) #generates x number of randomly sampled SNPs
)

height_model_x <- lm(Height ~ ., data = random_height_x) #model with all predictors incorporated

p_values <- summary(height_model_x)$coefficients[-1, 4] #extracts p-values from the model

sum(p_values < pvalue_threshold) #adds up the number of p-values below the set threshold
min(p_values) #minimum p-value

What’s the top number from your output? How many statistically significant results do you have now? And what is your p-value?

What if you increase the number of SNPs, which also increases the number of hypotheses you are testing? Take a couple minutes to mess around with the variables n, x, and pvalue_threshold. Discuss your results with your table.

Testing different groups

So far, we’ve only looked at individuals with the same height distribution, but we can add a second group. (The paper looked at five different populations, but for simplicity, we’ll only look at two.)

The code chunk below creates two different populations with different means for height. But each population’s SNPs are generated from the same sample. Run this chunk.

SNPs <- c("c", "a")

n <- 500000 #number of rows in the dataset
x <- 20 #number of SNPs generated / number of hypotheses tested
pvalue_threshold <- 0.05 #significance threshold
set.seed(333)

group1 <- tibble("Group" = "1",
                 "Height" = rnorm(n, 70, 3), #note the different mean for height here
                 as_tibble(replicate(20, sample(SNPs, n, replace = TRUE)), 
                                     .name_repair = ~paste0("SNP", seq_along(.))))

group2 <- tibble("Group" = "2",
                 "Height" = rnorm(n, 62, 3), #note the different mean for height here
                 as_tibble(replicate(20, sample(SNPs, n, replace = TRUE)), 
                                     .name_repair = ~paste0("SNP", seq_along(.))))

combined_groups <- rbind(group1, group2)

combined_groups %>%
  ggplot(aes(x = Height, fill = Group)) +
  geom_density(alpha = 0.5)

There is obviously some overlap between the distributions, but also a significant difference. Given that the SNPs are the same as in previous examples, and the only change is the outcome, what do you expect to happen? Do you think you will get any statistically significant results?

final_height_model <- lm(Height ~ . - Group, data = combined_groups)

p_values <- summary(final_height_model)$coefficients[-1, 4]

min(p_values)

summary(final_height_model)

If everything worked properly, you should get a p-value of 3.78 x 10^-6. I did cheat a bit, there is a specific seed set to get this sort of extreme result.

#Takeaway

Notice that even with purely random data, and even with data that should definitely not be correlated, we can still find unexpected results. This is why when researchers are testing genetic data with many more variables, the p-value needs to be lowered.

Discuss this result for a couple minutes with your table.

The code below is what I used to find the most extreme p-values, if you are interested. But it takes about a half hour to run, so don’t run it right now.

SNPs <- c("c", "a")

n <- 500000 #number of rows in the dataset
x <- 20 #number of SNPs generated / number of hypotheses tested
pvalue_threshold <- 0.05 #significance threshold

loop_df <- tibble("Seed" = NA,
                  "Min p-value" = NA,
                  "Num Below Threshold" = NA)

for (i in 1:500) {
  
  set.seed(i)
  
  group1 <- tibble("Group" = "1",
                   "Height" = rnorm(n, 70, 3),
                   as_tibble(replicate(20, sample(SNPs, n, replace = TRUE)), 
                                       .name_repair = ~paste0("SNP", seq_along(.))))
  
  group2 <- tibble("Group" = "2",
                   "Height" = rnorm(n, 62, 3),
                   as_tibble(replicate(20, sample(SNPs, n, replace = TRUE)), 
                                       .name_repair = ~paste0("SNP", seq_along(.))))
  
  combined_groups <- rbind(group1, group2)
  
  final_height_model <- lm(Height ~ . - Group, data = combined_groups)
  
  p_values <- summary(final_height_model)$coefficients[-1, 4]
  
  loop_df[i, 1] = i
  loop_df[i, 2] = min(p_values)
  loop_df[i, 3] = sum(p_values < pvalue_threshold)
  
  print(i)
}

loop_df %>% 
  arrange(`Min p-value`)

--- title: "Journal Club 3" subtitle: "STAT 494: Statistical Genetics" date: "February 11, 2025" author: "Journal Club 3 Discussion Leaders" format: html: toc: true toc-depth: 2 embed-resources: true code-tools: true --- ```{r} #| echo: false # set code chunk options to echo: true and eval: false by default # for students: you may want to change eval to true knitr::opts_chunk$set(echo=TRUE, eval=FALSE) ``` ```{r} library(tidyverse) ``` # Introduction To explore why the p-value should be lowered when **multiple testing**, we're going to simulate some data. Keep this quote in mind from the first paragraph of the paper: *By testing an increasing number of hypotheses in a single experiment, however, [researchers] must account for the so-called testing burden of the experiment and accordingly adjust the significance threshold to reject the null hypothesis, thereby minimizing the chance of reporting a false positive (type I error).* We'll start by simulating five different SNPs for one million different people and compare that to a normally distributed height. ```{r} SNPs <- c("c", "a") #string to sample from later n <- 1000000 #Number of rows/people to create random_height_5 <- tibble("Height" = rnorm(n, 66, 3), #normal distribution of height with a mean of 66 (in) and a standard deviation of 3 "SNP1" = sample(SNPs, n, replace = TRUE), #randomly samples from SNPs to determine a DNA base "SNP2" = sample(SNPs, n, replace = TRUE), "SNP3" = sample(SNPs, n, replace = TRUE), "SNP4" = sample(SNPs, n, replace = TRUE), "SNP5" = sample(SNPs, n, replace = TRUE)) height_model_5 <- lm(Height ~ SNP1 + SNP2 + SNP3 + SNP4 + SNP5, data = random_height_5) #creates linear model with height as the outcome summary(height_model_5) ``` Did you get any statistically significant results? Some of you may have gotten one or two p-values below 0.05, but most of you probably didn't. The code randomly generates DNA bases for each SNP independent of the height. Keep in mind that we're only testing 5 different hypotheses in this model. That means we're only testing if each of the 5 SNPs is a good predictor of height. And the 5 SNPs should not be a good predictor of height since all of the data is random. # More Hypothesis Testing But we should test more than 5 hypotheses. The paper uses 20,000+ samples. ```{r} SNPs <- c("c", "a") n <- 1000000 #number of rows in the dataset x <- 20 #number of SNPs generated / number of hypotheses tested pvalue_threshold <- 0.05 #significance threshold random_height_x <- tibble("Height" = rnorm(n, 66, 3), #normal distribution of height as_tibble(replicate(x, sample(SNPs, n, replace = TRUE)), .name_repair = ~paste0("SNP", seq_along(.))) #generates x number of randomly sampled SNPs ) height_model_x <- lm(Height ~ ., data = random_height_x) #model with all predictors incorporated p_values <- summary(height_model_x)$coefficients[-1, 4] #extracts p-values from the model sum(p_values < pvalue_threshold) #adds up the number of p-values below the set threshold min(p_values) #minimum p-value ``` What's the top number from your output? How many statistically significant results do you have now? And what is your p-value? What if you increase the number of SNPs, which also increases the number of hypotheses you are testing? Take a couple minutes to mess around with the variables n, x, and pvalue_threshold. Discuss your results with your table. # Testing different groups So far, we've only looked at individuals with the same height distribution, but we can add a second group. (The paper looked at five different populations, but for simplicity, we'll only look at two.) The code chunk below creates two different populations with different means for height. But each population's SNPs are generated from the same sample. Run this chunk. ```{r} SNPs <- c("c", "a") n <- 500000 #number of rows in the dataset x <- 20 #number of SNPs generated / number of hypotheses tested pvalue_threshold <- 0.05 #significance threshold set.seed(333) group1 <- tibble("Group" = "1", "Height" = rnorm(n, 70, 3), #note the different mean for height here as_tibble(replicate(20, sample(SNPs, n, replace = TRUE)), .name_repair = ~paste0("SNP", seq_along(.)))) group2 <- tibble("Group" = "2", "Height" = rnorm(n, 62, 3), #note the different mean for height here as_tibble(replicate(20, sample(SNPs, n, replace = TRUE)), .name_repair = ~paste0("SNP", seq_along(.)))) combined_groups <- rbind(group1, group2) combined_groups %>% ggplot(aes(x = Height, fill = Group)) + geom_density(alpha = 0.5) ``` There is obviously some overlap between the distributions, but also a significant difference. Given that the SNPs are the same as in previous examples, and the only change is the outcome, what do you expect to happen? Do you think you will get any statistically significant results? ```{r test-label} final_height_model <- lm(Height ~ . - Group, data = combined_groups) p_values <- summary(final_height_model)$coefficients[-1, 4] min(p_values) summary(final_height_model) ``` If everything worked properly, you should get a p-value of 3.78 x 10^-6. I did cheat a bit, there is a specific seed set to get this sort of extreme result. #Takeaway Notice that even with purely random data, and even with data that should definitely not be correlated, we can still find unexpected results. This is why when researchers are testing genetic data with many more variables, the p-value needs to be lowered. Discuss this result for a couple minutes with your table. The code below is what I used to find the most extreme p-values, if you are interested. But it takes about a half hour to run, so don't run it right now. ```{r} #| eval: false SNPs <- c("c", "a") n <- 500000 #number of rows in the dataset x <- 20 #number of SNPs generated / number of hypotheses tested pvalue_threshold <- 0.05 #significance threshold loop_df <- tibble("Seed" = NA, "Min p-value" = NA, "Num Below Threshold" = NA) for (i in 1:500) { set.seed(i) group1 <- tibble("Group" = "1", "Height" = rnorm(n, 70, 3), as_tibble(replicate(20, sample(SNPs, n, replace = TRUE)), .name_repair = ~paste0("SNP", seq_along(.)))) group2 <- tibble("Group" = "2", "Height" = rnorm(n, 62, 3), as_tibble(replicate(20, sample(SNPs, n, replace = TRUE)), .name_repair = ~paste0("SNP", seq_along(.)))) combined_groups <- rbind(group1, group2) final_height_model <- lm(Height ~ . - Group, data = combined_groups) p_values <- summary(final_height_model)$coefficients[-1, 4] loop_df[i, 1] = i loop_df[i, 2] = min(p_values) loop_df[i, 3] = sum(p_values < pvalue_threshold) print(i) } loop_df %>% arrange(`Min p-value`) ```