Power and ‘fragile’ p-values by @ellis2013nz

Wait 5 sec.

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.Do ‘fragile’ p values tell us anything?I was interested recently to see this article on p values in the psychology literature float across my social media feed. Paul C Bogdan makes the case that the severity of the replication crisis in science can be judged in part by the proportion of p values that are ‘fragile’,which he defines as between 0.01 and 0.05.Of course, concern at the proportion of p values that are ‘significant but only just’ is a stable feature of the replication crisis. One of the standing concerns with science is that researchers use questionable research practices to somehow nudge the p values down to just below the threshold deemed to be “signficant” evidence. Another standing concern is that researchers who might not use those practices in the analysis themselves will not publish or not be able to publish their null results, leaving a bias towards positive results in the published literature (the “file-drawer” problem).Bogdan argues that for studies with 80% power (defined as 1 minus the probability of accepting the null hypothesis when there is in fact a real effect in the data), 26% of p values that are significant should be in this “fragile” range, based on simulations.The research Bogdan describes in the article linked above is a clever data processing exercise of published psychology literature to see what proportion of p values are in fact, “fragile” and how this changes over time. He finds that “From before the replication crisis (2004–2011) to today (2024), the overall percentage of significant p values in the fragile range has dropped from 32% to nearly 26%”. As 26% is about what we’d expect, if all the studies had power of 80%, then this is seen as good news.Is the replication crisis over? (to be fair, I don’t think Bogdan claims this last point).One of Bogdan’s own citations is this piece by Daniel Lakens, which itself is a critique of a similar attempt at this earlier. Lakens argues “the changes in the ratio of fractions of p-values between 0.041–0.049 over the years are better explained by assuming the average power has decreased over time” rather than by changes in questionable research practices. I think I agree with Lakens on this.I just don’t think the 26% of significant p values to be ‘fragile’ is a solid enough benchmark to judge research pracices on.Anyway, all this intrigued me enough when it was discussed first in Science (as “a big win”) and then on Bluesky for me to want to do my own simulations to see how changes in effect sizes and sample sizes would change that 26%. My hunch was 26% was based on assumptions that all studies have 80% power and (given power has to be calculated for some assumed but unobserved true effect size) that the actual difference in the real world is close to the difference assumed in making that power calculation. Both these assumptions are obviously extremely brittle, but what is the impact if they are wrong?From my rough playing out below, the impact is pretty material. We shouldn’t think that changes in the proportion of signficant p values that are between 0.01 and 0.05 tells us much about questionable research practices, because there is just too much else going on — pre-calculated power, how much power calculations and indeed the research that is chosen are based on a good reflection of reality, the size of differences we’re looking for, and sample sizes — confounding the whole thing.Do your own research simulationsTo do this, I wrote a simple function experiment which draws two independent samples from two populations, all observations normally distributed. For my purposes the two sample sizes are going to be the same and the standard deviations the same in both populations; only the means differ by population. But this function is set up for a more general exploration if I’m ever motivated.The ideal situation – researcher’s power calculation matches the real worldWith this function I first played around a bit to get a situation where the power is very close to 80%. I got this with sample sizes of 53 each and a difference in the means of the two populations of 0.55 (remembering each population has a standard distribtuion of N(0, 1)).I then checked this with a published power package, Bulus, M. (2023). pwrss: Statistical Power and Sample Size Calculation Tools. R package version 0.3.1. https://CRAN.R-project.org/package=pwrss. I’ve never used this before and just downloaded it to check I hadn’t made mistakes in my own calculations, and later I will use it to speed up some stuff.library(pwrss)library(tidyverse)experiment