Jim Delaney points to this tutorial by F. Perry Wilson on why the use of a “p less than 0.05” threshold does not imply a false positive rate of 5%, even if all the assumptions of the model are true. This is standard stuff but it’s always good to see it one more time.

Delaney writes:

I [Delaney] think it is interesting. While it seems to utilize an over-simplified, idealized experimental setting where the null and alternative are two mutually exclusive and exhaustive points and the test statistic is also Bernoulli, I think the illustration is very useful. Two important Bayesian concepts are illustrated: (1) we have priors and we should use them. (2) That “Yesterday’s posterior is today’s prior,” presents a coherent scheme for practicing “cumulative” science.

Anyhow, I probably have some minor quibbles with the piece, e.g. it seems like the author refers to both P(X=1|H0) and P(H0|X=1) as “false positive rates”. There is no explanation as to why .05 is still considered a meaningful threshold for P(H0|X=1). But most importantly, there isn’t any acknowledgement that the setting described is highly idealized and that the use of Bayes’ Rule isn’t going to be nearly as simple and straight forward in a realistic setting. As well, it should be acknowledged that it is in those more realistic settings where the problems rife in the reporting of p-values will invalidate their use in such a calculation.

But overall, I think this is an interesting illustration that seems effective for helping the intended audience gain exposure to some important concepts. Hopefully, then they will recognize that the setting is idealized, and know to consult a statistician to help them deal with the more realistic settings.

I really don’t like the whole false-positive, false-negative framework. That said, given that people are thinking that way, it’s good to be clear, so sure, ok.

Does “consult a statistician” fix the problems? Did Cuddy / Bem / Baell just skip the pro statistician step? If so all the Journals need do is employ staff statisticians.

The profession internally seems in a state of flux / indecision.

I don’t think giving statistics a more central role is a good thing, this seems to lead to minimally informative studies about whether changing the conditions changes the result. Researchers should just stop doing those types of studies. Instead they should look for consistent patterns, guess a few explanations for those patterns, deduce other predictions from those guesses, and see which explanation most accurately, precisely, and simply explains what is going on. The role of statistics is to quantify the uncertainty we have about the observations and parameters of the model.

The “consult a statistician” struck me that way as well. I am not a statistician (and realize every day how deficient I am as a result), but the idea that only professional trained statisticians should do statistical work I believe to be false. As important as statistical training and understanding is – and it is growing – the world is leaving behind statisticians in some important ways. For example, the whole discussion of p values misses the fact that much of the data analysis being conducted today (by non-statisticians mostly) use techniques where p values are never mentioned (neural networks, random forests, boosted trees, etc.). This does not mean statistics is not necessary or not useful, but many people are bypassing the step of “consulting a statistician.”

There are all sorts of reasons for this – mostly bad reasons, probably. But I think it is wrong to assume that it is the world at fault and not statisticians themselves. Something is amiss if so many people find it expedient to bypass that step. Part of the reason is that there are too many uses of data and too few statisticians you can trust to really help with what you need. Part of it is because statisticians often focus on issues that are not that relevant for practicing data analysts. I fully acknowledge that part of their irrelevance is due to practicing data analysts not wanting to be bogged down by worrying about things they should be worried about – like getting work published, in the media, etc. But the fault is not all on one party here – in my view, statistics (speaking generally now) has not done a very good job of making clear what really matters and what may not. Add in the complexity of the subject and it is easy to see why people might not choose to consult a statistician.

The history of academic psychology casts doubt on this view. Paul Meehl explained with wonderful clarity, and based on substantive expertise in the field, the statistical issues that “really mattered,” and that were dragging psychology down, 30 years ago — and no one listened. We periodically see defenders of NHST psychology engage in the now-decades-old debates with Andrew at this blog to this day. It seems as though it is human nature to really, really not want to “be bogged down” on the way to exciting discoveries by actually hearing what the statisticians are saying.

“It seems as though it is human nature to really, really not want to “be bogged down” on the way to exciting discoveries by actually hearing what the statisticians are saying.”

+1

Kyle:

Yes, but in the old days the classical statisticians were dominant and they could treat the Meehls of the worlds as eccentric cranks. Now the classical statisticians are on the run and they realize that researchers outside of the Gladwell-Cuddy-Gilbert axis don’t trust p-values anymore.

Andrew,

I didn’t realize Meehl was not in step with the mainstream statisticians of his day, that’s very interesting. I thought he was just begging psychologists to take conventional statistics seriously. In hindsight his philosophical articles read like conventional wisdom.

>”I thought he was just begging psychologists to take conventional statistics seriously.”

Can you provide a quote that gave you this impression? In his writings Meehl equates (somewhat unfairly imo) conventional statistics with “Fisherian” statistics/reasoning. This is just one example of a theme repeated for basically his entire career:

“I suggest to you that Sir Ronald has befuddled us, mesmerized us, and led us down the primrose path. I believe that the almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories in the soft areas is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology.”

http://ww3.haverford.edu/psychology/ddavis/psych212h/meehl.1978.html

Well I’m a layperson, an outsider. But, for example, in passages like this —

“… Even though it is stated in all good elementary statistics texts, including the excellent and most widely used one by Hays (1973, 415-

17), it still does not seem generally recognized that the null

hypothesis in the life sciences is almost always false—if taken literally

—in designs that involve any sort of self-selection or correlations found in the organisms as they come, that is, where perfect randomization of

treatments by the experimenter does not exhaust the manipulations. Hence

even “experimental” (rather than statistical or file data) research will

exhibit this …. Consequently, whether or not the null hypothesis is rejected is simply and solely a function of statistical power.

“Now this is a mathematical point; it does not hinge upon your preferences in philosophy of science or your belief in this or that kind of theory

or instrument…. [T]he region of the independent variable hyperspace in which the levels of a factor are chosen is something **Fisher didn’t have to worry much about in agronomy, for obvious reasons**; but most psychologists have not paid enough attention to Brunswik on representative design….”

I thought he was making the Gelmanian point that the social sciences involve irreducible variation and confounding correlations (and questions for which there is no “true” point estimate) that Fisher simply never anticipated in agronomic research.

Sorry, the quote is from this:

http://www.psych.umn.edu/people/meehlp/WebNEW/PUBLICATIONS/128SocScientistsDontUnderstand.pdf

Right above your quote there is this:

“Thesis: Owing to the abusive reliance upon significance testing—rather than point or interval estimation, curve shape, or ordination—in the social sciences, the usual article summarizing the state of the evidence on a theory (such as appears in the Psychological Bulletin) is nearly useless…Colleagues think I exaggerate in putting it this way. That’s because they can’t stand to face the scary implications of the thesis taken literally, which is how I mean it.”

He pretty clearly states he thinks application of conventional statistics (ie NHST) has rendered journal articles in his field nearly useless. I’m not sure how that could be interpreted as “begging psychologists to take conventional statistics seriously”.

I don’t have access to the Hays text but searching for the reference I found Gigerenzer et al[1] seem to say the book mentioned a “null range” in an appendix and amazon reviewers claiming the 5th edition is riddled with typos[2]. I guess it is possible Meehl was being sarcastic but that isn’t really his style… Without the text I don’t know what he was referring to exactly, it is possible there is just little about NHST in that book. Brunswick doesn’t seem to be a statistician[3], but I am also unfamiliar with his work.

[1] http://faculty.washington.edu/gloftus/Downloads/CPChance.pdf

[2] http://www.amazon.com/Statistics-William-Hays/dp/0030744679

[3] https://en.wikipedia.org/wiki/Egon_Brunswik

And even in the paper you linked, Meehl writes, “I believe [it] is **generally recognized by statisticians today** and by thoughtful social scientists [that] the null hypothesis, taken literally, is always false. I shall not attempt to document this here, because among sophisticated persons it is taken for granted.”

I’ll respond to Anoneuoid here because I can’t nest any further:

I construed EVERY passage we’ve both quoted to mean, “Just go ask a stats prof if NHST makes sense in psychology. They will say it does not. Statisticians know this. Psychology researchers need to learn it.” Even the passage about “abusive reliance on significance testing,” I thought meant abuse by the psychology field AGAINST the tenets of the statistics field.

Apparently I was wrong, and a lot of stats profs would have said that what the pysch profs were doing was just fine. I did not know that. But I don’t think there’s any great mystery about how I could reach that conclusion, since Meehl continually cited standard stats texts to support his critique. If he was unduly mean to Fisher, I’m sorry, but I didn’t think Fisher was still the sum and substance of mainstream statistics in the 1980s.

Kyle:

My colleagues at Berkeley back in the 1990s were definitely mainstream and they bought into all that null hypothesis testing stuff. But forget about them, just consider almost every intro statistics textbook sold today: Null hypothesis significance testing is right there. Today’s textbooks represent yesterday’s consensus.

Andrew,

Not to beat the proverbial horse, but I had the impression that the “real stats” profs taught their students about all the limitations of NHST that we never heard about in Stats for Sociologists (which was, in the 1980s, essentially a course in p hacking). Sad to hear that’s not the case. Thanks for interacting.

Kyle:

At a local university, in the intro stats course given by the statistics dept, the answer key for the final exam gave this as the correct answer “the p-value is the probability of the Null hypothesis being true” as did the textbook for the course though only after stating it less wrongly.

Re Keith’s comment: Sad, sad, sad.

Keith: Here’s my reaction.

A company has developed a new pill that floats in the stomach with some clever technique, and so creates a slower release of the drug. Planning the study Bayesian style:

Stat: What is the prior assumption about the floating time of you pill compared to the control? How likely do you think your pill floats longer than control, and in what range?

Company: We are 90% sure that floating time will be a factor of 1.5 to 2.5 higher.

Stat: That’s an unfair advantage for your product.

Company: Yes, but we are sure. We have tested the product in water, and it works.

How would you argue? We do not have millions of data on female/male ratio or voting patterns in other counties.

Not being an expert in the area, I would want to know more information:

How similar is the decomposition time of the substrate being used in water versus stomach contents?

– In the absence of an answer to 1 that is “Very similar” and substantial evidence to support this claim, I would argue for the next step to be finding a simulated environment that is of higher fidelity.

and to 2 that “There is no better simulated environment than water”

Edit:

Not being an expert in the area, I would want to know more information:

How similar is the decomposition time of the substrate being used in water versus stomach contents?

– In the absence of an answer to 1 that is “Very similar” and substantial evidence to support this claim, I would argue for the next step to be finding a simulated environment that is of higher fidelity.

For a start you could try distinguishing evidence and belief. This is Eg Mike Evans’ approach (though mistakes are all mine).

Eg evidence provided by Y0 for H0

P(H0|Y0)/P(H0)

Similarly for H1

P(H1|Y0)/P(H1)

Evidence from a given experiment is thus related to *change* in belief and your prior ‘divides out’. Your resultant belief will of course depend on your prior but again that is to be distinguished from how much this particular experiment influenced that resultant belief.

So you could say ‘sure you might have strong reasons to favour your product, but they aren’t warranted by this experiment alone’.

I think we often gloss over some important steps when we immediately focus the discussion on the conditional probabilities. There seems to be a lot of misunderstanding around the connections between the following:

1. part of the world under study

2. method for extracting data about it

3. data produced

4. transformations and adjustments applied to data

5. analysis methods applied to identify patterns

6. description of information about the world provided by analytic results

7. integration and differentiation with alternative explanations of results

Totally agree. I was just trying to clarify the formal question about priors and evidence, but in general systematic modelling, data analysis and domain knowledge will likely trump choice of specific tools.

Ideally (to me) these will be formalised as much as possible too though, to allow systematic criticism. Hierarchical modelling and formalised EDA approaches have a lot of appeal for me in this respect. Just like one can hide behind informal (and incorrect) use of priors, one can hide behind informal (and imprecise) theories of the world, data processing procedures etc etc.

Note, though, that how much “evidence” is provided by Y0 (i.e. how much your belief changes) depends very much on what prior you have. The prior therefore does not really “divide out.” Indeed, there is no Bayesian concept of evidence that is prior-independent (aside from the Bayes factor in the case of simple statistical hypotheses, perhaps).

I should be more explicit since it differs from the usual case in subtle ways.

P(H0|Y0)/P(H0)

= [P(Y0|H0)P(H0)/P(Y0)]/P(H0)

= P(Y0|H0)/P(Y0).

i.e. it is essentially a more Bayesian way of formulating likelihoodism. The key difference is retaining the need for P(Y0) for comparison at H0 rather than using the relative comparison of H0 and H1 in which P(Y0) also cancels and leads to likelihoodism.

I’m not sure I understand what you mean. P(Y0|H0)/P(Y0) still depends on the prior probability of Y0. And in order to calculate the prior probability of Y0, you will typically need to have priors over H0 and H1 as well. So P(Y0|H0)/P(Y0) is a prior-dependent quantification of the strength of evidence provided by Y0.

Yet’s that’s true. I suppose I misleadingly implied that the priors have no effect whatsoever.

It certainly requires a prior over Y0 (which is often given by the predictive distribution using priors over H0 and H1, but doesn’t always need to be).

But this will be the same for both H0 and H1 and hence the relative comparison of evidence is just the likelihood ratio. The absolute evidence evaluation simply depends on how likely your observed data is (and of course this may follow from averaging over H0 and H1 models).

I’m sure you could bias things if you really tried. But the point is that it’s not as easy to bias the *evidence* (*change* in belief or relative *change* in belief) for or against H0 and H1 than it is to bias the absolute belief (posteriors) themselves.

See Mike Evans’ book/papers for better discussion than I can provide.

(also, this has what seems to me to be the interesting and natural implication that checks of bias and/or model checks in general are most naturally related to prior and posterior *predictive* checks a la Box, Gelman etc).

Bias in estimation is very different than bias in method. It is bias in methods that are not typically encoded in a model. And it is bias in methods that results in finding random patterns in noise and calling them a signal.

Curious:

Sure.

Again I’m trying to keep distinct my attempt to clarify the formal question about priors and evidence that I think Dieter was raising – e.g. can we just slap a prior on to get whatever conclusion we want – and broader methodological issues.

Wrt the latter – “in general systematic modelling, data analysis and domain knowledge will likely trump choice of specific tools” as above.

Sure, if you quantify the strength of the evidence as the ratio of posterior to prior, then this quantity will in some sense and in many cases be less sensitive to “bias” than the posterior will be. Note, however, that there are lots of different ways of quantifying strength of evidence as a difference between the posterior and the prior, and the measure used by Evans is just one of the many available options. There is a rather large literature on this topic in Bayesian philosophy. The different measures each have their strengths and weaknesses.

Olav – I think we agree :-)

Again my main point is directed towards questions like Dieter’s which seem to imply that we can get whatever conclusions we want by daring to use priors.

I think that

“if you quantify the strength of the evidence as the ratio of posterior to prior, then this quantity will in some sense and in many cases be less sensitive to “bias” than the posterior will be”

is a useful lesson in this particular regard, regardless of whether it solves the all the remaining problems of inference.

Just for your information, Mike Evans has reviewed that rather large literature on this topic in Bayesian philosophy as well as methods to assess biases – a short overview is given here http://www.sciencedirect.com/science/article/pii/S2001037015000549 and a fuller account given in his book https://measuringstatisticalevidence.wordpress.com/

Dieter:

First, I’d ask the person where is the evidence that he is 90% sure that floating time will be a factor of 1.5 to 2.5 higher. This evidence may come, for example, from a confidence interval from a published study, in which case I’d advise that it’s an overestimate because of the statistical significance filter.

Second, I’d lay out the costs and benefits. What happens if a study is designed and outcome X happens? If the [1.5, 2.5] estimate truly is reasonable, that implies some decision recommendations. It’s not about being fair or unfair to the product, it’s about wanting to use resources efficiently.

There were a few results from previous studies which were used for power calculation the old style. But the argument of the company was: We know our product is better, it is “incomparable” to the old stuff. So why should we pre-shrink our product in the protocol (that’s where I wanted to have it stated)?

In the end, the product turned up to be well in the range of controls with old style confidence intervals. And I am sure, that the company never would have used the prior-shrunken range when they marginally would have missed the magic 5%.

Dieter:

Yup, that’s how people learn. The unrealistic prior is more likely to lead to poor decisions. That’s how I’d like frame it, not in terms of fairness to the product but in terms of the organization wanting to make the best decisions. If you really

doknow that the new product is better than what came before, then, sure, use that information in your decision making. But if that “knowledge” is really a false assumption, you’ll pay—which is what happens in any decision problem.>”How would you argue?”

That is a meaningless number because it is in water without telling us the temperature or anything else. What about the range of conditions actually expected in stomach (pH, temperature, salinity, presence of food bolus, etc)? What type of errors may arise during manufacturing that affect this value? What is the effect of storage at various temperature/humidity that may be found in practice? Can you devise any conditions where your pill does not consistently float that much longer? What conditions (either manufacturing, storage, or patient characteristics) may lead to excessive float-time?

Even given all these nuances, in real settings I rarely see experts agree anywhere close to a consensus about the prior. Estimates are all over the place.

The disagreement is not a problem, it accurately reflects that our understanding of such things is rudimentary. Keep doing studies designed to find out “if X changes does Y also change under specific conditions a,b,c” and this situation will continue for as long as society keeps feeding money into the NHST black hole.

Experts don’t have to agree. See, for example, the book Uncertain Judgements by O’Hagan et al:

http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470029994.html

The problem is that typically we do not have neutral experts involved, but rater highly biased researchers involved who want to make their product “significant”.

Asking an expert is not just a matter of saying: “what do you think?” One has to engage in a highly structured elicitation process. The elicitor him-/herself needs to be well-trained in the process. See the SHELF framework (the Sheffield Elicitation Framework) for an example of how this is done.

And I doubt very much that most experts would immediately try to game the elicitation process to get their result to come out “significant”. I don’t know many (actually, I can’t think of any) people who do science maliciously in this manner; the mistakes they make are out of genuine ignorance. Even people like Amy Cuddy are not actively gaming the system; she just doesn’t know what she’s doing (and who can blame her, given the bizarre education one gets in statistics).

I am not talking about science in an purely academic environment, where the outcome to the best or worst is some New York Times article it sounds sensational (= far away from prior).

Live as the statistical cheese in the Hamburger of marketing folks, scientifically oriented industry researchers and academic research is full of tasty surprises.

But that stuff about the elicitation framework seems pretty academic. Are there real studies that use it?

In practice people use priors where there’s no good reason to use this over that. Is there reason to believe results are not sensitive to the different priors from different experts?

Rahul wrote:

“But that stuff about the elicitation framework seems pretty academic. Are there real studies that use it?

In practice people use priors where there’s no good reason to use this over that. Is there reason to believe results are not sensitive to the different priors from different experts?”

Sure they do elicitation in practice, see:

Turner, R. M., Spiegelhalter, D. J., Smith, G., & Thompson, S. G. (2009). Bias modelling in evidence synthesis. Journal of the Royal Statistical Society: Series A (Statistics in Society), 172(1), 21-47. Chicago

Is there reason to believe results are not sensitive to priors? That would depend on how much data you have. When you’re stuck with a problem with little data, experts’ judgements become very important, and in those situations different experts’ priors will probably lead to different results. But that’s just quantifying your uncertainty, through sensitivity analyses, taking into account what you know. That seems a whole lot better than the alternative.

Dieter:

In this case you may have a principal-agent problem, and one improvement could be to connect the people who would pay the costs of a bad decision, to the people whose job is to make these modeling assumptions.

My “water” argument was a stupid simplification, sorry. The tests they made were much more elaborate and well founded from a technical point of view. The outcome of the study was nevertheless compatible with no effect.

For the purely formal question of how priors and evidence interact, see my answer above. Plug in some numbers if you want.

Who is in that picture? Seems out of place in an article about p-values. Lastly, why doesn’t Andy just advocate for a wholesale abandonment of NHST?

1. Galton was a hero to most.

2. I

doadvocate for a wholesale abandonment of NHST! I thought I’d made that clear.I got the original reference but the Galton one…had to Google that!

False positive rates are routinely being used in opposed ways these days, thanks to some flawed reinterpretations pushed by those who wish to formulate statistical hypothesis appraisal in terms of diagnostic testing based on prior prevalence of true nulls. This is causing conflicting definitions in official glossaries offered to guide the public, e.g., http://errorstatistics.com/2016/01/19/high-error-rates-in-discussions-of-error-rates-i/ If you’re going to use a diagnostic testing model for statistical inference, a mistake in my judgment, at least stick to Ioannidis’ term: positive predictive value (PPV).

The false positive rate (conditional on p<0.05)of 36% arrived at in the article is entirely dependent on the choice of a background rate of 90,000 false hypotheses, 10,000 true hypotheses and a test power of 80%. If you make the background rates equal the false positive rate (conditional on p<0.05) becomes 0.0588. If you make the background rate for true hypotheses much higher than for false hypotheses, the false positive rate (conditional on p<0.05) drops to near zero. So, it all depends on ones assumptions. If I was a medical researcher I would hope that my superior intellect and years of study would give me a much higher background rate (for true hypotheses) than 10%. A rate of 10% would not earn me a place in the history of medicine.

I'm not defending the p-value here, I'm simply pointing out a potential flaw in the article.

Peter;

As I wrote above, I don’t like the false-positive, false-negative framework. In most cases I think it’s meaningless to talk about these hypotheses as true or false. Treatments have effects that very: a new treatment can be better for some people and worse for others. The discrete idea of hypotheses being true or false just leads to confusion, I think.

I think there’s always some confusion for the public (and others) because people talk about false positive and false negative in the context of research results but also in the context of an individual (for example) pre-employment drug screen or an HIV test. I don’t even think it makes sense to speak of the former as a false positive. For the latter it makes sense.

Elin:

I agree.

I do agree, but I’ve found it useful (following John Ioannidis’s 2005 paper) to use this as an illustration of why prior probabilities matter even (or especially) if you’re using a NHST framework. People can understand it because they are used to thinking of results in a binary way (treatment works/doesn’t work), flawed though that is.

Agree, but there are downsides to using a rather wrong model that helps folks understand from where they find currently themselves conceptually inadequate for appreciating the less wrong models.

These helpful ladders that once climbed up _need_ to be kicked aside.

But agree, the downsides may be worth it especially if they are gently warned about them.

My experience (in the clinical trials world) is that the usual expectation is that the study will come out with a yes/no decision – a treatment “works” or it doesn’t. Using a single “primary outcome” and p<0.05 threshold makes it easy to dichotomise results in this way, and feeds the habit. Last week, I had an argument with a Professor of Medical Statistics on this very point. The trial in question plans to evaluate an intervention that can clearly have effects on several different and important outcomes, but my colleague still advocated picking a single primary outcome, so that it would be possible for the results to be easily classified as positive or negative.

” …but my colleague still advocated picking a single primary outcome, so that it would be possible for the results to be easily classified as positive or negative.”

Aargh! Not good from the point of view of a medical consumer!! Consumers and physicians need better information than a single “yes/no” decision, so we/they can make informed decisions in individual cases.