Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
P-Values are not Error Probabilities (2003) [pdf] (www.uv.es) similar stories update story
132 points by gwern | karma 33755 | avg karma 4.24 2015-04-06 14:12:58 | hide | past | favorite | 36 comments



view as:

Thanks for sharing this. In general, it's a real shame how few people know how to interpret and use P values correctly. We work with a lot of businesses who ask us to compare populations (e.g., through A/B tests) and people either (a) don't care about the significance between population differences, or (b) are irrationally attached to P values.

Case in point #1: debating whether a P value of 0.051 versus 0.049 is a major difference in the significance test.

Case in point #2: a P value of <0.001 but with extremely low differences in means between populations. With enough data, everything is significant!

End rant. :)


Amen for #2! It's like the entire field of biology is an endless quest for low p-values, regardless if the hypothesis is even interesting. It's like people are not interested to think, they just want to publish something significantly differnt.

haha, Sociology as well -- especially now that the web provides huge amounts of behavioral data.

I much prefer how machine learning folks tend to approach predictive accuracy, though I guess that's not quite the same as understanding relationships between specific variables while controlling for others.


There is a notion of 'feature importance,' which especially comes up in decision trees and random forests, giving a notion of how much a particular feature contributes to the overall prediction. It seems like combining predictive power with feature importance would be an interesting alternate route to demonstrating important correlations. (For example, maybe a model predicts lung cancer with 90% precision, and 'is_smoker' has a 80% feature importance.) Of course, these importances depend a lot on the other features used by the model! If you include a lot of junk features and/or exclude other important features, the importance of your pet feature will shoot up.

Hmm, interesting -- I never considered the idea of including junk features to bias model's preferences of whatever theoretically ambiguous idea you're trying to promote. That's actually brilliant.

Shameless plug; I'm a co-author of a method that leverages adding artificial junk features and removing original ones that are likely nonsense to approximate the set of all features that are relevant to the problem (rather than standard make best model, which may be pretty deceiving). https://m2.icm.edu.pl/boruta

Shoes reduce cholesterol.

And aren't ice cream sales dependent on murder rates?

Did you just discredit Freakonomics?

> they just want to publish something

What's going on in the academia is a classical example of perverse incentives. If your career depends on churning out soul-crushingly bad or insignificant papers that fulfill some superficial criteria of "scienceness" then that's what you'll do. It's dangerously close to cargo cult science.


> are irrationally attached to P values

This is actually a terrible problem. Most people, I think, do this because of nervous cluelessness: they are sure there is some theory that says what it means, but they aren't sure they understand it, so they insist on following some strict rules, whether they make sense or not.

I think the idea is that if you follow a strict rule given in a textbook (even if wrong or inapplicable), that makes people feel safer than if they try to figure the rule out for themselves. At least if they follow a rule, they aren't personally responsible for the outcome.

It can sometimes be quite difficult to explain why a certain rule should not be applied, because the same reasons for why they didn't understand the rule in the first place also make it difficult to understand explanations given by someone else.


> so they insist on following some strict rules, whether they make sense or not.

I took my bachelor in business and we were always told to make rational business decision. That is, informed decisions, based on data.

As for actual statistics, we were told 2 rules of thumbs: You need at least 30 people in a group before you can make meaningful statistics and it is significant if P < 0.005


Unfortunately that advice is hugely misleading. The reason is that the p-value essentially tells you how repeatable your analysis will be, but doesn't say anything about how beneficial making the decision might be. Second, there is a relationship between how large the difference between outcomes are and how many people you need. So if taking your conversion rate from 10 to 15% would be really helpful to your business, you might need say thousands of people to notice that difference statistically.

> The reason is that the p-value essentially tells you how repeatable your analysis will be

Your statement contradicts the introductory paragraph of the article:

> "...the outcomes of these tests are mistakenly believed to yield the following information:... the probability that an initial finding will replicate; ..."

The p-value is the likelihood that the observed effect is due only to chance, not a measure of the repeatability of a result.


Ah yes, you're right! I made a statement about likelihood!

This was meant as an example of how crazily statistics is taught and why people end up just following strict rules.

One of my favourite go-to phrases in health care is: "Statistically significant, but a clinically insignificant difference."

I love it! That's actually a really good way of looking at it, right? Unfortunately for researchers, I'd argue that "Statistically significant, but a clinically insignificant difference." is equivalent to "Theoretically useless, and technically irrelevant." :)

In quant finance, my colleagues and I say, "OK, it's statistically significant, but is it economically significant?" I love that we have parallel sayings in totally different fields.

Right, you're describing statistical significance as it relates to practical significance.

Addressing that first point is "The Difference Between 'Significant' and 'Not Significant' is not Itself Statistically Significant" by Andrew Gelman and Hal Stern.

PDF link: http://www.stat.columbia.edu/~gelman/research/published/sign...


Nitpick: while that paper does mention the basic idea that there's no significant difference between 0.051 and 0.049, it's more about another common mistake: testing the difference between two groups by performing two separate statistical tests and then comparing the outcomes (a <> b), rather than the correct approach which is to do a single test for the difference between groups (a - b <> 0).

TLDR:

"p’s and a’s are not the same thing; they measure different concepts"


https://www.youtube.com/watch?v=5OL1RqHrZQ8 is a good demonstration of this.

"I use pictures from the ESCI software to give a brief, easy account of the Dance of the p Values. The simulation illustrates how enormously and disastrously variable the p value is, simply because of sampling variability. Never trust a p value!"


A good read. I remember asking my stats professor in undergrad about the "whys" of the various hypothesis testing schemes we use, and he eventually just told me that I should take a math-stats class in graduate school. I did that (and it was pretty enjoyable), but it certainly raised just as many questions as it answered, when it comes to the null testing --> hypothesis rejection ritual that scientific disciplines have converged on!

This is exactly the reason when working with my team on A/B testing, I'm always careful to use the phrase _meaningful_ difference (or not). With internet-based tests, the volumes of participants can be huge, so finding statistical significance is a hell of a lot easier than finding a meaningful difference. I like to say, "As N tends to infinity, we are guaranteed to find significance." The sad thing is that there are more testing platforms than I can count on one hand that introduce this p-value fallacy to users. They encourage things like repeatedly checking tests. I don't believe that users realize that the true p-value is oscillating around alpha. If your test isn't significant at them moment, just check a few hours later (it likely will be). Even with Bayesian methods, I've learned the hard way that you really have to be patient and let certainty accrue. More and more I'm lead towards Bandit methods for this reason. When sampling real-time data, you're in effect treating a historical sample as a part population with one part in the future --- that's a pretty dangerous assumption. My solution has been to guide my team towards large tests, work to _really_ move the needle (and have high statistical power). This is where statistical analysis has the best chance of providing certainty. In generally, with web experiments it's best to proceed with a healthy amount of humility.

I found this paper to be quite interesting, but I have two issues with it.

1. At least at the beginning, it focuses excessively on the historical aspects of statistics. For example, it says that "most applied researchers are unmindful of the historical development of methods of statistical inference, and of the conflation of Fisherian a nd Neyman–Pearson ideas." To me, statisticians shouldn't /have/ to understand the history at all. For example, as a physicist, there is absolutely no need for me to understand the evolution of Ampère' theories, Faraday's theories, Maxwell theories, etc. to apply the laws of electricity and magnetism correctly.

2. The difference between p and alpha is central to the paper, but it doesn't seem to have a cogent explanation of what that difference is. (It's very clear who advocated for one and who advocated for the other, but that's not why the difference is important.)


1. I would argue that this is a paper that intentionally goes into detail with regard to the history of statics. After all the main problem here is not how to compute p-values, nor the choice of the threshold (well, that may very well be a problem, but it's a different one) but rather the interpretation, and history is often important for interpretation, even in physics. If one wants a really compact way to convince people that the way they are thinking about p-values is wrong I'd go with the point raised by Steven Goodman in "A Dirty Dozen: Twelve P-Value Misconceptions": since the p-value is computed under the assumption that the null hypothesis is true it cannot be the probability that the null hypothesis is true, by definition.

2. As far as what the difference is, I have to admit I have not found a memorable phrase to explain it. Both Goodman in "Toward Evidence-Based Medical Statistics" and this paper give a similar wording of the difference, which i would rephrase as follows.

The p-value works by inference, taking the data and assigning a probability to that data, not to an hypothesis. It is to be used to corroborate our disbelief in the null, i.e. informally. To use inference to test hypothesis one must use Bayes' Theorem, i.e. Bayes factors, and introduce prior probabilities. Hypothesis testing à la Neyman and Pearson is deductive process, in which one does assign probabilities to the hypotheses, paying the price of only being able to minimize the errors commited, not to draw inferences.

One should also confront "In Fisher’s approach the researcher sets up a null hypothesis that a sample comes from a hypothetical infinite population with a known sampling distribution." with "Neyman–Pearson results are predicated on the assumption of repeated random sampling from a defined population."

Given the natural predisposition by students to works in an inferential manner, it may be wise to bite the bullet and teach Bayes factor instead of the p-value cargo cult (this being a pedagogical choice, not an assessment of frequentism vs. bayesianism or a critique of p-values as conceived by Fischer).


The difference between p values and alpha levels is a bit subtle, and when I first read this paper (while preparing my book, Statistics Done Wrong) it took me a while to figure out.

Here's the idea. If you set alpha = 0.05, you will declare statistically significant any result that gets a p value of 0.05 or less. When there is no statistically significant difference to be found, you will have a 5% chance of falsely detecting one.

But crucially, this applies on average to all tests you conduct with this alpha level. Even if an individual test gets p = 0.000001 or p = 0.04, the overall false positive rate will be 5%.

More succinctly, it doesn't make sense to ask for the false positive rate of a single test. What does that even mean? You can only ask for the false positive rate of a procedure you use many times. So you can't get p = 0.01 and declare this means you have a false positive rate of 1%.


Possibly worth clarifying: the false positive rate (https://en.wikipedia.org/wiki/False_positive_rate) is "probability that a test will return positive, conditional on the hypothesis being false". It's the rate of false positives within the set of negatives, not the rate of false positives within all tests.

I just finished a pretty interesting book on this topic:

"The Cult Of Statistical Significance":

http://www.amazon.com/Cult-Statistical-Significance-Economic...

It basically goes through a bunch of examples, mostly in Economics, but in medicine also (Vioxx) where statistical significance has failed us and people have died for it. As someone who works with statistics for a living, I found to book interesting - but it was pretty depressing to find out that most scientist are using t-test and p-values because it seems to be the status quo and it is the easiest way to get published. The authors of this book suggest a few different things -- publishing the size of your coefficients and using a loss function. In the end, they make the point that statistical significance is different than economic significance, political significance, etc.


Here's an interesting paper on the prevalence of these misconceptions in both students and teachers (at least in Germany).

http://myweb.brooklyn.liu.edu/cortiz/PDF%20Files/Misinterpre...

TLDR: 80% of methodology instructors have a misconception about significance. Scientific psychologists and students perform even worse.



Perhaps the FDA (or NIH?) should employ statisticians to evaluate claims in medical journals where the stakes are potentially higher.

That's why significance and hypothesis testing should die out and be replaced by Bayesian inference.

Legal | privacy