Biofinysics: Why it burns when you P.

The smaller the p-value, the better the result. True or false?

If you answered "True", then that is why it burns when you use and interpret p-values. It hurts us all. Stop it.

The p-value seems to have become the judge, jury, and executioner of evaluating scientific results. As I go further into my PhD, I see just how in love we are with p-values as well as all the signs of an abusive relationship. No matter how many horrors the p-value wreaks upon us, we just know it won't do it again. Plus, no matter how much we misuse and manipulate the P-value, it just won't leave us. It's here to stay and that's love … right?! Love or hate, it is pretty hard to imagine a world without p-values, but such a world did exist: p-values have only been around for less than a century. And before that: nothing was statistically significant! Yet scientists were able to somehow test hypotheses, come to conclusions, and develop theories anyway including Darwin, Mendel, Newton, Einstein…

Google Ngram snapshot Feb 17, 2014.
Side note: It is not strictly true that the world was completely absent of significance before the P value. For example, John Arbuthnot seems to have stumbled upon a significant result in 1710 and the term "statistically significant" seems to have come up around the year 1600 according to google ngram (above). However, "P values", "statistical significance", and "statistically significant results" in their modern incarnations are a product of the 1920s.

So who invented the P-value, what was it originally used for, and how is it used in modern data analysis? Could science and data analysis exist without p-values once again? Should it? Many of you (i.e. my readers, i.e. referer spam bots) might have felt a sense of shock thinking that the almighty p-value ever had an inventor. What mere mortal could have created the mighty metric for "significance" itself? Rather than tell you, I encourage you, Spam Bot, to read, "The Lady Tasting Tea" by David Salsburg. One result of inventing the p-value that is almost certainly statistically significant is the number of scientific papers that have included the p-value since. More interesting is that the p-value has become so omnipresent, so omnipotent, so mythological and magical that few if anyone cites the mere mortal inventor when they use it. If everyone did cite R. A. Fisher, he would quite possibly be the most cited scientist ever by no small margin (okay I told you, Spam Bot ...but I still think you should read that book).

The p-value has been a source of controversy ever since its invention. Unfortunately, the controversy has mostly gone on in the coolest statistics circles of which it is statistically unlikely that you or any of your ancestors are or were in. Good news! Due to increasingly bad science -- that is perhaps a consequence of the rise, abuse, misuse, and generally poor understanding of p-values -- the controversy has reached the foreground.

There are already plenty of people that have provided prose preaching about the promiscuous rise and predicted plummet of the p-value's popularity and that promote posterior probabilities and other alternatives such as reporting effect sizes and confidence intervals. So I do not want to do my own version of that. See the end for recommended books and articles covering these issues. Rather, I want to ask you a few questions (feel free to answer in the comments section). This is not a comprehensive set of questions, but they are questions that I think any scientist/data analyst should have their own thoughts on. In fact, it would be amazing if average Joe Citizen had thoughts on these questions about the almighty p-value and related topics such as the False Discovery Rate, reproducibility, and validity. All of us are blasted in the face with statistically significant results on a daily basis. Popular views on whether eggs and coffee are good or bad for your health change all the time because of statistical significance. So, which is it: good or bad?!?! Can statistical significance ever answer those health questions definitively?

Take a moment to think about each of the following questions. There are some starting points to explore these topics afterward, but mostly I expect that you know how to use Google. For the uninitiated, remember that p-values get smaller with "higher significance" -- hence the tempting, but horribly wrong conclusion of "the smaller the p-value, the better the result":

1. What is a p-value?

2. What does a p-value mean to you?

3. What does p-value mean to your colleagues?

4. Are results better when the p-value is smaller (i.e."more significant")?

5. What factors make a p-value tiny anyway?

6. When you are scanning a paper or report and see "highly significant" (tiny) p-values, how does that affect how you perceive the results?

7. When should you or anyone else report a p-value?

8. Is it always necessary to include a p-value?

9. What is an effect size?

10. Does a tiny p-value guarantee a large effect size?

11. Is it possible to get a highly significant (tiny) p-value with a tiny effect size?

12. How does "n" (number of data points) affect the p-value?

13. What is a null hypothesis? or what is the null distribution?

14. Is there more than one possible null distribution to use for a given test?

15. How do you pick a null distribution?

16. Is there a correct null distribution? or multiple correct null distributions ("nulls")? or mostly incorrect nulls? or do nulls have no inherent correctness at all?

17. What does the null distribution you use say about your data?

18. If your p-value is "highly significant" and lets you conclude that your data almost certainly did not come from your null distribution, a negative assertion, does that give you any positive assertions to make about your data?

19. If your p-value is "not significant" and so you fail to reject the null hypothesis, does that mean the null hypothesis is therefore true?

20. If your p-value is "not significant" and so you fail to reject the null hypothesis, does that imply that your alternative hypothesis is therefore false?

21. If your p-value is "significant", would it come out statistically significant again if you repeated the experiment? What is the probability it would be significant again? Is that knowable?

22. Is it possible to get a highly significant (tiny) p-value when your data have indeed come from the null distribution?

23. When you do 100 tests with a cut-off for significant p-values at 0.05, if all the tests are in fact from the null distribution, how many do you expect to be counted as significant anyway? -- i.e. how many false positive tests will there be out of the 100 when the significance cutoff is 0.05? If you do 2.5 million tests with a significance cutoff of 0.00001, how many false positives would you expect to come through?

24. When doing multiple tests in general, how can you limit the number of false positives that come through? How will that affect the number of true positives that come through?

25. What is the Bonferroni correction?

26. What is the False Discovery Rate (FDR)? …and what are q-values?

27. Does the False Discovery Rate control all types of false discoveries?

28. What types of false positives are not accounted for when controlling the FDR?

29. When reading a paper or report, if the FDR is tiny, how does that affect how you perceive the results?

30. What is Bayes' Rule?

31. What is a prior probability? a likelihood? a posterior probability?

32. What is a prior distribution?

33. What is a posterior distribution?

34. Compare null distributions to prior distributions.

35. Compare p-values and posterior probabilities.

36. Compare confidence intervals to credible intervals.

37. What is reproducibility? reliability? accuracy? validity?

38. What is the Irreproducible Discovery Rate (IDR)?

39. If a result is highly reproducible, does this ensure it is valid?

40. If a result is not reproducible, does this guarantee it is not valid?

41. If results are both reproducible and valid, does this guarantee the interpretation is correct?

42. Are the non-reproducible elements across replicates always "noise"? Or could they be signal?

43. Is it possible that the most reproducible elements across replicates are in fact the noise?

Books:

The Lady Tasting Tea: How Statistics Revolutionized Science In The Twentieth Century.

-- David Salsburg

-- This was a good read - a brief history of statistics, the birth and rise of statistical ideas, and the characters involved

In Pursuit of the Gene: From Darwin to DNA

-- James Schwartz

-- This is also a great history of statistics though not strictly relevant to the p-value topic. Think of it as the pre-P history. I mainly included it here because when I was done reading this, I read "The Lady Tasting Tea" and thought that in some ways it picked up where this book left off. Many of the great statisticians were interested in genetics - that is still true even today.

Mathematical Statistics

-- Wackerly, Mendenhall, Scheafer

Articles (chronicles) concerning the p-value battles:

The Case Against Statistical Significance Testing -- Carver, 1978

Statistical ritual in clinical journals: is there a cure? -- Mainland, 1984

Significance Questing -- Rothman, 1986

Statistical Analysis and the Illusion of Objectivity -- Berger and Berry, 1988

The Earth Is Round (p < .05) -- Cohen, 1994

In Praise of the Null Hypothesis Significance Test -- Hagen, 1997

If Statistical Significance Tests Are Broken/Misused, What Practices Should Supplement or Replace Them? -- Thompson, 1999

Eight common but false objections to the discontinuation of significance testing in the analysis of research data. -- Schmidt, 1997

What If There Were No More Bickering About Statistical Significance Tests? -- Levin, 1998

The significance of "nonsignificance" -- Berry et al, 1998

P-Values and Confidence Intervals: Two Sides of the Same Unsatisfactory Coin -- Feinstein, 1998

On the Past and Future of Null Hypothesis Significance Testing -- Robinson and Wainer, 2001

Tests of significance considered as evidence -- Berkson, 2003

The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant -- Gelman and Stern, 2006

A Dirty Dozen: Twelve P-Value Misconceptions -- Goodman, 2008

What is the probability of replicating a statistically significant effect -- Miller, 2009

Weak statistical standards implicated in scientific irreproducibility -- Hayden, 2013

Scientific method: Statistical errors -- P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume. -- Nuzzo, 2014

Other articles (some of which concern p-value battles -- e.g. motivation for Bayesian Inference):

Bayesian Statistical Inference for Psychological Research -- Edwards et al, 1964

Bayesian statistics in genetics: a guide for the uninitiated -- Shoemaker et al, 1999

The Bayesian Revolution in Genetics -- Beaumont and Rannala, 2004

Effect size, confidence interval and statistical significance: a practical guide for biologists -- Nakagawa and Cuthill, 2007

What to believe: Bayesian methods for data analysis -- Kruschke, 2010

A biologist's guide to statistical thinking and analysis -- Fay and Gerow, 2013

What is Bayesian statistics and why everything else is wrong -- Levine

Wiki pages:

Bayesian Inference

Frequentist Inference

p-value

multiple comparisons problem

posterior probability

sample size determination

the problem of "n" being too big (large sample sizes and small p-values)

effect size

Bonferroni correction

p-rep

statistical power

defining significance in terms of sigma

Other blogs, classes, etc that get into the topic of p-values -- for and against:

Everything Wrong With P-Values Under One Roof

On the scalability of statistical procedures: why the p-value bashers just don’t get it.

New book: Come on people, stop using p-values already

Are all significant p-values created equal?

The p-value is not . . .

Misunderstanding the p-value

Hypothesis Testing -- slides by Vladimir Janis

The Statistical Significance Testing Controversy:A Critical Analysis <-- R. Chris Fraley's syllabus for his class of that title. At bare-minimum, this is just a very extensive amount of references on these topics organized into points such as why to abandon significance tests, why to keep them, effect size, statistical power, etc.

IDR - a very new statistical idea:

ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia.

Measuring reproducibility of High-Throughput Experiments

Biofinysics

Tuesday, February 18, 2014

Why it burns when you P.

No comments:

Post a Comment