[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.MotivationA few months back in a side skirmish during the great p-curve controversy, Richard McElreath mentioned that p-values under the null hypothesis are not always uniformly distributed, as is sometimes claimed. This prompted me to check out the phenomenon. I’ll admit I had in my head the basic idea that p-values are indeed uniformly distributed if the null hypothesis is true. It turns out this is only ‘often’ the case, not always.As is commonly the case for this blog, the main motivation was to make sure I understood something myself, so there’s nothing particularly new for the world in this blog. But it might be interesting for some. There were a few odd fishhooks.I wanted to check the case where we are comparing the number of “successes” in a relatively small sample of binary success/failure outcomes, split into two groups. The null hypothesis is that the underlying probability for success is the same in each group. I wanted to see the distribution of the p-value for a test of this null hypothesis with sample sizes of 10, 100 or 1000 observations in each group; for different values of the underlying probability of success, and for when the groups’ sample size is random but mean 100.Calculating the p valueThe fishhooks were in how to calculate the p value. I actually went for three different ways: My lazy approximation way is to estimate the variance of the difference between the two sample proportions under the null hypothesis and rely on asymptotic normality to get a probability of the value being as extreme as it actually is. This has the disadvantage of being known to be not particularly right particularly when the sample is small or the probability of success is close to 1 or 0. The advantage is I didn’t need to look anything up and it was easy to vectorise. This method is called pval_hand in the plot below. A better method is to use the Fisher exact test, as per the famous lady-tasting-tea analysis, but the out of the box method I was using (from the corpora R package by Stephanie Evert) doesn’t work when the observed successes in both samples is zero. This method is called pval_fisher in the plot below. The most out-of-the-box method of all is to use the prop.test() function from the stats package built into R. The disadvantage here was the bother in vectorising this function to run efficiently with large numbers of simulations. This method is called pval_proptest in the plot below.Because I was nervous about whether these methods might give materially different results I started by comparing the results they gave, with different sample sizes and underlying parameters, with just 1,000 repetitions for each combination of sample size and parameter. This gives this comparison:So we can see that the pval_fisher and pval_proptest methods give effectively the same results, whereas my hand-made method has a significant number of discrepancies. Because of this I decided to stick with Evert’s corpora::fisher.pval. I just hardened it up with a wrapper function to define the p value (chance of seeing data as extreme as this, if the null hypothesis is true) to be 1 if the observed successes are 0 in both samples:library(tidyverse)library(corpora)library(GGally)library(glue)library(scales)library(frs) # for svg_png()#' Version of fisher.pval that won't break if k1 and k2 both 0tough_fisher