Nerdy Debate About p-Values

By Brian Resnick 7/31/17 @ 12:00pm EDT; Original article here

The article explains the findings of Revised Standards For Statistical Evidence by Valen E. Johnson, (See 2013 pdf), but keep in mind, this is all just a proposal, something to spark debate. To the author’s knowledge, journals are not rushing to change their editorial standards any time soon.

The debate is considering “What counts as solid evidence?”

When researchers calculate a p-value, they’re putting to the test what’s known as the null hypothesis: a statement that there is no difference in the results measured between test and control groups, when observing the effects of varying a single variable.

For a long while, scientists and the journals that publish their work have thought p<.05 represented something rare. They still do, however, new work in statistics shows that it’s not. The biggest change for which Johnson’s paper advocates is that results that currently meet the .05 level should be called “suggestive,” and those that reach the stricter standard of .005 should be called statistically significant.

Historians of science are always quick to point out that Ronald Fisher, the UK statistician who invented the p-value, never intended it to be the final word on scientific evidence. To Fisher, “statistical significance” meant the hypothesis is worthy of a follow-up investigation. “In a way, we’re proposing to returning to his original vision of what statistical significance means,” Daniel Benjamin, a behavioral economist at the University of California says.

If labs do want to publish “statistically significant” results, it’s going to be much harder. It means that labs will need to increase the number of participants in their studies by 70 percent. “The change essentially requires six times stronger evidence,” Benjamin says.

Placing p-values under the microscope

If the p-value is very small, it means the numbers would rarely (but not never!) occur by chance alone. And so, when the p is small, researchers start to think the null hypothesis looks improbable. And they take a leap to conclude their [experimental] data are pretty unlikely to be due to random chance.

Researchers can never completely rule out the null (just like jurors are not firsthand witnesses to a crime). So scientists instead pick a threshold where they feel pretty confident that they can reject the null. For many disciplines, that’s now set at less than .05.

A p-value of .05 means if you ran the experiment 100 times — again, assuming the null hypothesis is true — you’d see these same numbers (or more extreme results) five times.
A p-value less than .05 does not mean there’s less than a 5 percent chance your experimental results are due to random chance. It does not mean there’s only a 5 percent chance you’ve landed on a false positive.
A p-value of less than .05 means that there is less than a 5 percent chance of seeing these results (or more extreme results), in the world where the null hypothesis is true.
Studies that yielded highly significant results at p < .01 are more likely to reproduce than those that are significant at the p < .05 level.