This article is a discussion of a paper by Greg Francis for a special issue, edited by E. J. Wagenmakers, of the Journal of Mathematical Psychology. Here’s what I wrote:

Much of statistical practice is an effort to reduce or deny variation and uncertainty. The reduction is done through standardization, replication, and other practices of experimental design, with the idea being to isolate and stabilize the quantity being estimated and then average over many cases. Even so, however, uncertainty persists, and statistical hypothesis testing is in many ways an endeavor to deny this, by reporting binary accept/reject decisions.

Classical statistical methods produce binary statements, but there is no reason to assume that the world works that way. Expressions such as Type 1 error, Type 2 error, false positive, and so on, are based on a model in which the world is divided into real and non-real effects. To put it another way, I understand the general scientific distinction of real vs. non-real effects but I do not think this maps well into the mathematical distinction of θ=0 vs. θ≠0. Yes, there are some unambiguously true effects and some that are arguably zero, but I would guess that the challenge in most current research in psychology is not that effects are zero but that they vary from person to person and in different contexts.

But if we do not want to characterize science as the search for true positives, how should we statistically model the process of scientific publication and discovery? An empirical approach is to identify scientific truth with replicability; hence, the goal of an experimental or observational scientist is to discover effects that replicate in future studies.

The replicability standard seems to be reasonable. Unfortunately, as Francis (in press) and Simmons, Nelson, and Simonsohn (2011) have pointed out, researchers in psychology (and, presumably, in other fields as well) seem to have no problem replicating and getting statistical significance, over and over again, even in the absence of any real effects of the size claimed by the researchers.

. . .

As a student many years ago, I heard about opportunistic stopping rules, the file drawer problem, and other reasons why nominal p-values do not actually represent the true probability that observed data are more extreme than what would be expected by chance. My impression was that these problems represented a minor adjustment and not a major reappraisal of the scientific process. After all, given what we know about scientists’ desire to communicate their efforts, it was hard to imagine that there were file drawers bulging with unpublished results.

More recently, though, there has been a growing sense that psychology, biomedicine, and other fields are being overwhelmed with errors (consider, for example, the generally positive reaction to the paper of Ioannidis, 2005). In two recent series of papers, Gregory Francis and Uri Simonsohn and collaborators have demonstrated too-good-to-be-true patterns of p-values in published papers, indicating that these results should not be taken at face value.

. . .

Although I do not know how useful Francis’s particular method is, overall I am supportive of his work as it draws attention to a serious problem in published research.

Finally, this is not the main point of the present discussion but I think that my anti-hypothesis-testing stance is stronger than that of Francis (in press). I disagree with the following statement from that article:

For both confirmatory and exploratory research, a hypothesis test is appropriate if the outcome drives a specific course of action. Hypothesis tests provide a way to make a decision based on data, and such decisions are useful for choosing an action. If a doctor has to determine whether to treat a patient with drugs or surgery, a hypothesis test might provide useful information to guide the action. Likewise, if an interface designer has to decide whether to replace a blue notification light with a green notification light in a cockpit, a hypothesis test can provide guidance on whether an observed difference in reaction time is different from chance and thereby influence the designer’s choice.

I have no expertise on drugs, surgery, or human factors design and so cannot address these particular examples—but, speaking in general terms, I think Francis is getting things backward here. When making a decision, I think it is necessary to consider effect sizes (not merely the possible existence of a nonzero effect) as well as costs. Here I speak not of the cost of hypothetical false positives or false negatives but of the direct costs and benefits of the decision. An observed difference can be relevant to a decision whether or not that difference is statistically significant.

I’m more sympathetic to Francis’s point of view. Hypothesis testing seems to me reasonable for problem settings in which a binary yes/no decision is required. One should of course use one’s brain to interpret the results and be flexible relating to the cutoff values, and always pay attention to model assumptions. Effect size is important as you said; it is incorporated into the hypothesis testing mechanism since significance is a function of effect size. What researchers need to do better is to explain the reasoning around the minimum detectable effect size that is baked into the testing procedure.

Social science studies seeking to explain behavior is totally different, and I agree that testing does not make much sense there.

Replicability is ultimately what we all want. Even in the binary setting, and even assuming that the same decision would never be repeated, we are aiming for replicability in the imaginative sense. If we were able to do it again, would it have given us similar results?

IMO, whether hypothesis testing is warranted is determined by whether the null distribution is well-defined under the study design, not simply whether the decision is binary or not. I think where inferences go off the rails is when the study design hasn’t eliminated “practical nulls” from explaining deviations from the null distribution that’s applied.

For observational data, unless one can make a strong argument for a natural experiment, I think it’s almost always impossible to eliminate all explanations except for a single one. Therefore, hypothesis testing is rarely warranted in observational settings. At most, it could be used in a descriptive sense – “we’ve eliminated at least one explanation for what we observe, but that’s it”. Unfortunately, the public health literature seems to be plagued with observational p-values.

Kaiser: This seems very close to Peirce’s allegory of palying cards with the devil, where losing leads to eternal damnation, no replay for anyone and no individual solace for any soul that reasonably played the odds. He roughly concluded it was good for the community as a whole and in turn for the community of science if they used intervals for unknowns with good odds of including the truth.

Some times one has to play with the devil, I guess.

yes, the analogy is appropriate. In the business setting, we might run a small-scale test. If “successful”, we may decide to make a big investment, which would require signing contracts with vendors, building out infrastructure, hiring teams, etc. which are not 100% irreversible but close to irreversible.

If the end decision is binary, then even if I provide say a posterior distribution, we will need to use a cutoff in order to arrive at a decision. Any such cutoff would be arbitrary because the cutoff is really determined by people’s risk preference, or more likely, people’s selfish motives. It gets even more complicated because typically a group of people from different departments with different and conflicting objectives (i.e. prefer different cutoffs) have to agree on the cutoff. It is very difficult to get consensus when you walk in with a distribution, as opposed to a “recommended” cutoff (albeit based on some kind of arbitrary alpha).

The other characteristic of this type of decisions is that the effect sizes should be large. If the effect sizes are tiny, there is almost never a reason to invest large sums of money. Therefore, the precision accorded by the full distribution isn’t marginally useful.

A concise summary: conventional hypothesis testing cutoffs can serve as Schelling points when stakeholders can’t trust each other not to advance their own individual interests at the expense of the group goal. I like it!

Possible solution: get people to agree to the cutoff before the data (and hence posterior distribution) are in hand. Then they’re agreeing to the terms of a bet rather than fighting for resources.

Good summary I always like the focal point concept. The problem with the ex ante agreement is what is the enforcement mechanism, as there is no penalty for reneging after the fact. Also, the debate may then shift to questioning the data.

On thinking about this more, I realized that this post is about tests versus estimates and in that sense, I am in the estimates camp, i.e. I look at intervals and intervals have the analogous concept of confidence.

The biotech industry is like playing cards with the devil, though I have heard raunchier metaphors. At any point, ‘not being able to reject the null’ on efficacy or rejecting the null on a safety outcome can literally shut you down as investors shut off the cash spigot.

But most of business research is more incremental. Experiments are incremental and hypothesis testing in combination with good experimental design is a great way to remove unwarranted hunches and focus on factors that are of importance to the product users. Re: cutoffs. There is a lot more equivalence and non-inferiority testing going on now that use CIs in combination with pre-planned cutoffs. This is not Bayesian but the presumptions of a range require some prior knowledge.

That said we have created a culture where scientists think ‘significance’ is magic and when you ask them what effect size is of value, they say a “significant one”.

It is amazing how every time you read the words psychology, p-value and Hypothesis Test in the same place unsurmountable amounts of

bullshitare going to get you soaking wet. And I just wonder… do this bullshit storms are seasonal waves? are we in the summer season wave? And what’s wrong with psychologists anyhow… Too much Freud?But before (I have the feeling I might be late though) the usual suspects begin singing

“Aye!, That’s quite a cutlass ye got thar, what ye need is a good scabbard!”let me address Mr. Greg Francis main point which, by the way, if I am not mistaken Mr. Andrew Gelman also shares with him and which is:So to answer to this comment let me tell you a little story when back in the days my

Inference and Decision Courseprofessor (which, by the way, just like RA Fisher was a Biologists and a Mathematician) was explaining what the power of a test was and mentioned that usually the power chosen among researchers/industry was around 80%.Then I raised my hand (yeah, I one of those that raise his hand a lot to compensate those that never do) and asked him something in the lines of “Why on Earth don’t we make experiments with an 99.99% of power? Surely that would yield much reliable results and we would have a

much accurate measure of the size effect“. Nailed it!… or so I thought.His answer was in the lines of “Sure, and

who is going to pay the bills pal?” Well, I wish his answers were more elaborate but, on the other hand, he went straight to the point which is thatWe do not always have resources for a gazillion experimentsand, nonetheless, we still would like to make inferences with what we have.There are many situations where either the resources are not there or it is simply

unmoralto design large experiments to be able to measure accurate size effects. Sure this information is important for science but in many situationsit comes second placeto moral standards and resources availability.So my psychologists friends, maybe you want to consult one of those statisticians whose

“methodology cannot be named”before you write another paper on p-values, it might be enlightening… Unless, of course, the true purpose of those papers are to drive crazy “those that cannot be named” in which case you are having a moderate success with me so far, which is a dilemma since my future shrink might be one of you and then I would only have religion left as a last resort to comfort my soul.If it turns out that in your field of research you have resources for gazillions of experiments with no associated moral problems then, by all means, do it! But have you thought that maybe your colleagues don’t have resources for very high power tests? That maybe 10 volunteers is all they could get and that might happened not to be enough for accurate measure of the size effect? Are you saying these experiments should not be published?

p-values and Hypothesis tests are fantastic inference tools, and just like many other things in life

it’s not always size that matters, but how you use it.Out of a completely morbid sense of curiosity: whose is the methodology which can’t be name? Voldemort?

Also, are you aware that there are plenty of prominant Frequentist who similairly believe that increase power doesn’t solve the basic problems and that it’s better to concentrate on effect sizes whatever data you have?

What “basic problems”?

Oh! in whatever data you have? really? Okay… let’s imagine a biologist develops a pill for pregnant women that makes babies smarter but, since human experimentation is highly controversial, specially when babies are involved, he agrees with his wife that she will take the pill after getting pregnant. So the

“whatever data you have”in this case isN=1baby being born after the mother took the smart pill.So the baby grows up healthy with no problems and when he is old enough they check his IQ and happens to be 125. Since the CI at 95% for the IQ population is 100∓30 that would give him a p-value higher than 0.05 which do not amount to great evidences to favor the hypothesis that the pill works.

But what happens if we go with your “concentrate on effect sizes whatever data you have” approach? Well, the effect size estimation is of

25, wow! really? 25? so big? is that trulythe only thingyou want to concentrate on? Because if that is the only thing you are going to pay attention whatever the data you have then I have a bridge in Brooklyn I want to sell you… Or a baby smart pill.When you have such large experiments that you can measure the size effects with high accuracy the p-value is simply a validation first step since it is going to be virtually zero, then you focus in the size effect, but when you have very small experiments the size effect estimation might be meaningless, then you focus in the p-value.

So p-value comes first, and size effect comes next and

onlyif it makes sense.> So the baby grows up healthy with no problems and when he is old enough they check his IQ and happens to be 125. Since the CI at 95% for the IQ population is 100∓30 that would give him a p-value higher than 0.05 which do not amount to great evidences to favor the hypothesis that the pill works.

(Worse, actually, since being a genius biologist to make a germ-line modifying pill, you would expect his descendants to be way higher than 100, and closer to 120ish depending on his wife and details, so he’ll be even further inside the CI.)

I’m pretty sure any statistician considering your case would use more than a point estimate at 25 to summarize the effect size, whether they were frequentist or bayesian.

One baby and only one is all you have in this problem which, by the way, N=1 is all you have in very important problems out there.

Concentrating on effect sizes doesn’t necessarily mean “point estimate”. In this case you’d want a range estimate for the effect on this one child. You could say something like: “because of nature of IQ and genetics it’s very unlikely the child would normally have had an IQ as low as 60 or as high as 250. So the effect of the pill was somewhere between (-125,70)”. It’s not saying much, but then you didn’t have much to go on.

So if you Bayesian Universe having a result of 125 when the avg is 100 gives you an effect “somewhere between (-125,70)” So

more room for negative effect that for a positive one…All right, I don’t even know how you did that, but God bless Bayes.It’s a tool called “subtraction”. If you may not have encountered in your studies.

I was illustrating the point that even when almost no information is there you can still get range estimates and you’d want to do so because point estimates are highly misleading. If you don’t like to example, feel free to substitute a more realistic frequentest version. You should have no trouble doing so after you’ve concurred that whole “subtraction” thing.

Nope, I never found the “subtraction” technique in my studies to estimate intervals, that must be a

Bayesian thingy.Not to like your example?

I LOVE IT!I truly thought you were going to backpedal but I am delighted and so happy to see you keep with a straight face your Bayesian (-125,70) interval! Thanks!Your answer is a fantastic example of what can do Bayesian reasoning to a bright person as yourself, so may this serve as a

cautionary taleto any Bayesian wannabePsychologistsreading this. This is what happens when you only have a hammer; everything looks like a nails to you.But yeah,

you ask for the frequentist version. The maximum likelihood estimator for the true mean of the pill effect will be the only result we have, that is 125. Without more data we cannot estimate the variance of the effect but if we assume that the pill would only change the expected value for the IQ distribution and not its variance then we can estimate a 95% CI of 125∓30, that is a 95% chance the interval (95,155) contains the true mean for the pill effect.So the

Bayesian (-125,70) vs the Frequentist 95%CI based (-5,50)… Yep, I should permalink this one for future references. :DFran this is completely trivial and if you’re really having trouble understanding it then you should probably consdier a different profession.

I was just computing upper and lower bounds for the effect size to illustrate a point. It’s incredibly rare that IQs are lower than say 50 and higher than 220. So if the babies IQ is 125 then that bounds the possible size of the effect. At most either:

raised IQ from 50->125

lowerd IQ from 220->125

So the effect has to be a change in IQ somewhere in the interval (-105,75). Bounds computed to illustrate a point so stop being an idiot.

And you did illustrated the point very well, specially now that you have updated your Bayesian mandatory limits so that your interval does not look so ridiculous… still does anyhow.

About you calling me names… let’s make a deal;

you don’t say lies about me and I don’t say truths about you.Okay? Over.Fran, I’m financially set now and so don’t worry about these kinds of things, but do you realize that employers Google potential hires nowadays?

These comments amount to a public record proving that whenever you can’t understand incredibly simple and controversial points you’re in the habit of making a complete ass out of yourself. Hopefully you’ve got a plan B if this statistics gig doesn’t work out. I don’t know any outfit in the US that would touch stats guy like yourself.

To all commenters:

Please be polite and avoid personal attacks (even if you think they are appropriate).

Please do not troll (even if you think the point you are making is clever).

Thank you.

Entsophy:

Good for you. I’m unemployed in a financially bankrupt country spiraling into Balkanisation, other than that I’m great too.

If something is simple why would it be controversial? But what do I know, you are the genius here.

Really? How about if I promise to learn how to calculate 16th century math intervals based on my opinion for its bounds? Also, I would like my possible US employers to know that I can sing

La Cucarachawith a perfect fake Mexican accent in the coffee breaks and that I LOL when told a Spanish woman mustache joke.My plan B is to be happy, but I’m going to hang on plan A for a while just like most humans do. Thanks.

Andrew:

I’m sorry you think I’m trolling, but I guess that irony might be seen as somehow inflammatory, specially when someone does not agree with one’s views so, and only in this sense, you might be right. In any case this is your blog and I don’t want you to feel I am trolling it so I’ll move on… though you’re sometimes naughty and flame bate a bit too, right? ;) But fair enough, nice blog though.

Dear Fran,

The point was you can get an interval estimate even when there’s little data. I illustrated it by calculating the upper and lower bounds for the effect size. There surely are improved interval estimates which illustrate the point even better, but I was doing a back-of-the-envelope calculation to illustrate a point.

Neither the point nor calculation were bayesian, complicated, or controversial in the slightest. You failed to understand any of it and went on a bizarre randomly bold faced anti-bayesian diatribe.

Clearly you’ve got some issues to work through and don’t need me poking at you, so I’ll stop. There’s some great stuff on your blog and I wish you the best of luck.

Using a sample size of N=1 certainly seems like a bad idea! I’ll be sure to keep that in mind. Have you considered explaining all this to the frequentists who hold these views? Larry Wasserman for example: (Notice what the false discovery rate goes to as power -> 1)

http://normaldeviate.wordpress.com/2013/04/27/the-perils-of-hypothesis-testing-again

Or even take it up with Deborah Mayo. She believes that “Severity” needs to be added to the Frequentist arsenal in part because statistical power doesn’t quite get the job done. I don’t think either of them has stumbled across this N=1 problem before and they’d appreciate the heads up.

In the example I gave you N=1 is not either a good or bad idea,

it is simply all you haveto make inference. It isyouthe one claiming “there are plenty of prominant Frequentist who similairly believe … it’s better to concentrate on effect sizeswhateverdata you have” You said that, not Larry nor Deborah.About Larry holding “these views”; What are “these views” you are talking about?.

Larry’s posts simply states the obvious about the obvious Ionnidisthat is, if you have a very low prior probability for studies to be true you will have a high rate of false positives. So what?And about the false discovery rate you want me to notice; Larry is using in his calculations a

fixed alphaat 0.05 regardless the size N of the experiment. That is Okay for making a point about Mr. Ionnidis obvious remarks but, in a real experiment if you increase N when you already have a power close to 1 or 1 in your experiment you need to lower the value of the alpha which, in turn,lowers the rate of false discoveries.And what has to do your interpretations on Mayo’s remarks on statistical power with anything we are talking about? Where is the “whatever data you have” quotations? I think you have bought so much into the idea that anything Frequentist is horrible that you cannot see other than that, even prominent “Frequentist” agree with you now… go figure.

I can’t believe I’m about to explain this, but here goes:

Everyone knows more data is better than less. Everyone also knows that at some point you stop collecting data and that’s all you have. When Wasserman, or anyone else, talks about what methods they like to use, they’re obviously talking about what methods they like to use after they’ve stopped collecting as much data as they can. If you don’t like it that he’d rather look as the size of effects than hypothesis testing, then take it up with him.

If you think there’s no issue with classical Power then take it up with Mayo. She’s the Frequentist advocating “severity” which is similar, but different, from power and claims that it illuminates all the problems with classical hypothesis testing. You should let her know that you’ve discovered there aren’t any problems with hypothesis testing as long as you use high powered tests. She’ll probably be glad for all the trouble you’ve saved her.

I can’t believe it either, but there you go…

Fran:

In this example I think we have a pretty strong prior that the effect of the pill is much less than 25. I’d recommend using this prior information,

whether or notthe observation is statistically significant at the 5% level.And what would that

strongprior be? I mean, for all we know that pill might cause severe retardation on children as much as making them have IQs off the chart.Honestly, how do you know “the effect of the pill is much less than 25” from the information given by the problem andjustby the problem?Do you get the prior from a reasoning similar to what

gwernsaid (June 4, 2013 at 5:08 pm)?:If so we can change the problem to one where a random husband with a random wife steals the pill from a research facility… or that the Biologist’s wife was knocked up by a bodybuilder in the gym, whatever you like the most. The main point of the example was to show that size effect is not

alwayssomething relevant, or even meaningful, eg. what sense could we make of the size effects of an ANOVA with 100 variables without the p-value?The use of prior information is an entirely different matter… maybe for the

nextflame bait? :)Fran:

I got the information that the effect of the pill is much less than 25 from your statement: “Well, the effect size estimation is of 25, wow! really? 25? so big? is that truly the only thing you want to concentrate on? Because if that is the only thing you are going to pay attention whatever the data you have then I have a bridge in Brooklyn I want to sell you… Or a baby smart pill.” This statement indicates to me that you have external knowledge not in the data saying that an effect size of 25 is not very plausible.

This is a point we discuss in one of the early sections of chapter 6 of BDA, that if you look at your posterior inference and it seems to make no sense, that you implicitly have external information that you were not using in your calculations. This sort of thing happens all the time (enough so that we mention it in our book!).

Andrew,

As a matter of fact 25 is the

mostplausible estimation but with so much uncertainty that betting money on that size effect amounts to gambling so, in short, we should not base our inference (or bet our money) in such uncertain size effects estimations, 25 might sound really promising but the p-value indicates the result is not really that special.This is how I think experiments should be carried out under

scarcityof resources.1 – Do an experiment with

very a smallN. even of one. Atthisstage we don’t care about size effect and only about if there is something worthwhile to be pursued.2 – Calculate p-value for your test.

3 – If the p-value is high consider if your idea is worth your money/time/resources…

4 – If the p-value is low, and the N is small, this is evidence for a

highsize effect. Present this result to ask for more resources and plan an experiment that allows you to measure the size effect accuratelyin the next experiment..So the next experiment won’t be so much about whether there is an effect or not, but how much it is that effect suggested by our first experiment… Anyhow, this is me.

One questionAndrew, which of your books is a good entry point for non Bayesians with a fairly good knowledge of statistics?Fran:

I’m sorry to say but your steps 1-4 won’t in general work well if you’re studying small effects with small sample size. See my paper with Weakliem for a detailed discussion of one such example. Statistical methods which might work well for one purpose (the estimation and quantification of large effects) don’t necessarily work well when studying small effects.

If you’re interested in learning more about Bayesian methods, I’d recommend Bayesian Data Analysis, you can start with chapter 1. But wait a few months until the new third edition comes out.

Andrew, although you’re right that his method 1-4 won’t work in general for small effects, I think the point he’s trying to make is valid, in the presence of limited resources we should really be looking for things that have big (good) effects, leave the smaller effects until later. Of course, this is an idealization, but it does make some sense.

Dan:

I agree that we should look for big effects, but I don’t think the way to identify big effects is to look for statistical significance in small datasets. I think that’s a good way to find small effects that appear to be big.

I think it’s an interesting question that could be partially answered by simulation. Suppose that we have a hypothesis generating mechanism, say some kind of drug discovery process. Suppose that it generates drugs whose effects in terms of some decision theoretic tradeoffs (good effects, side effects, cost to produce, etc) are distributed according to some density p(e), what is the best strategy to identify as many good drugs as possible given a fixed budget? The answer will depend of course on the form of p(e) and the cost of running exploratory and confirmatory studies. I don’t think there’s one “answer” here, but framing the question this way could lead to some interesting simulation studies and policy recommendations.

In the presence of heavy tails, where some small number of big payoffs are possible the strategy will likely be much different than if p(e) has light tails and we really only expect incremental improvements

Daniel: My first job as a statistician was on this guy’s http://en.wikipedia.org/wiki/Allan_S._Detsky NIH grant doing a cost benefit analysis of this question for clinical trials.

The _obvious_ answer (there) was doing big ones (adopting treatments that are less effective is really costly).

For some reason his paper (initially entitled clinical trial are cheap) was never published.

He now produces Broadway plays, and in 4 days we get to see if two of them win Tony awards.

Clinical trials are basically confirmatory, in other words, they are pretty late in the discovery process. It seems to me that the “obvious” answer in the early discovery process has been to do enormous numbers of small studies, what they call high-throughput screening. You generate 10k drugs and test them on cell culture models, and follow up on some of them with a few mouse studies, or whatever. Only once you have a decent candidate do you even consider clinical trials, and at that point you definitely want good high powered studies to confirm what you think you’ve discovered, and to identify rare but extremely harmful side effects etc.

> early discovery process has been to do enormous numbers of small studies, what they call high-throughput screening

Peirce argued otherwise, but admitted at the time it was just a superstition that humans’ hypotheses (priors?)are more productive than that noisey data driven search stuff.

My back of the envelope calculations years ago suggested he was right, but your assumptions may imply otherwise.

Appreciate your and Entsophy’s comments.

I think the priors have been useful to direct the shotgun, sure they generate 10k compounds, but they do it based on modifications to some family of compounds that is chosen based on prior knowledge of the types of things that are likely to interact with pathways that are known to be involved in the disease process, etc. So it’s possible for Pierce to have his cake and for me to eat it too ;-)

Andrew:

I find your replicability standard intriguing. But that, too, seems to lead into a binary world of “successful replication” vs “unsuccessful replication”.

How do you judge when a replication is successful? Can you code your decision process explicitly, so anyone using your code will make the exact same decisions when faced with the same replication results?

Andrew: You seem to have undergone a gestalt switch from the Gelman of a short time ago–the one who embraced significance tests.

http://www.rmm-journal.de/downloads/Article_Gelman.pdf

Mayo:

I believed, and still believe, in checking the fit of a model by comparing data to hypothetical replications. This is not the same as significance testing in which a p-value is used to decide whether to reject a model or whether to believe that a finding is true.

Gelman: I don’t know that significance tests are used to decide that a finding is true, and I’m surprised to see you endorsing/spreading the hackneyed and much lampooned view of significance tests, p-values, etc. despite so many of us trying to correct the record. And statistical hypothesis testing denies uncertainty? Where in the world do you get this? (I know it’s not because they don’t use posterior probabilities…)

But never mind, let me ask: when you check the fit of a model using p-value assessments, are you not inferring the adequacy/inadequacy of the model? Tell me what you are doing if not. I don’t particularly like calling it a decision, neither do many people, and I like viewing the output as “whether to believe” even less. But I don’t know what your output is supposed to be…

Mayo:

1. I don’t think hypothesis testing inherently denies uncertainty. But I do think that it is used by many researchers as a way of avoiding uncertainty: it’s all too common for “significant” to be interpreted as “true” and “non-significant” to be interpreted as “zero.” Consider, for example, all the trash science we’ve been discussing on this blog recently, studies that may have some scientific content but which get ruined by their authors’ deterministic interpretations.

2. When I check the fit of a model, I’m assessing its adequacy for some purpose. This is not the same as looking for p< .05 or p<.01 in order to go around saying that some theory is now true.

Andrew: I fail to see how a deterministic interpretation could go hand in hand with error probabilities; and I never hear even the worst test abusers declare a theory is not true, give me A break…

So when you assess adequacy for a purpose, what does this mean? Adequate vs inadequate for a purpose is pretty dichotomous. Do you assess how adequate? I’m unclear as to where the uncertainty enters for you, because as I understand it is not in terms of a posterior probability.

Mayo:

Here’s a quote from a researcher, I posted it on the blog a few days ago: “Our results demonstrate that physically weak males are more reluctant than physically strong males to assert their self-interest…”

Here’s another quote: “Ovulation led single women to become more liberal, less religious, and more likely to vote for Barack Obama. In contrast, ovulation led married women to become more conservative, more religious, and more likely to vote for Mitt Romney.”

These are deterministic statements based on nothing more than p-values that happen to be statistically significant. Researchers make these sorts of statements all the time. It’s not your fault, I’m not saying you would do this, but it’s a serious problem.

Along similar lines, we’ll see claims that a treatment has an effect on men and not on women, when really what is happening is that p< .05 for the men in the study and p>.05 for the women.

In addition to brushing away uncertainty, people also seem to want to brush away uncertainty, thus talking about “the effect” as if it is a constant across all groups and all people. A recent example featured on this blog was a study primarily of male college students which was referred repeatedly (by its authors, not just by reporters and public relations people) as a study of “men” with no qualifications.

P.S. Bayesians do this too, indeed there’s a whole industry (which I hate) of Bayesian methods for getting the posterior probability that a null hypothesis is true. Bayesians use different methods but often have the misguided goal of other statisticians, to deny uncertainty and variation.

These moves from observed associations, and even correlations, to causal claims are poorly warranted, but these are classic fallacies that go beyond tests to reading all manner of “explanations” into the data. I find it very odd to view this as a denial of uncertainty by significance tests. Even if they got their statistics right, the link from stat to substantive causal claim would exist. I just find it odd to regard the statistical vs substantive and correlation vs cause fallacies, which every child knows, some kind of shortcoming with significance tests. Any method or no method can commit these fallacies, especially from observational studies. But when you berate the tests as somehow responsible, you misleadingly suggest that other methods are better, rather than worse. At least error statistical methods can identify the flaws at 3 levels (data, statistical inference, stat-> substantive causal claim) in a systematic way. We can spot the flaws a mile off…

I still don’t know where you want the uncertainty to show up; I’ve indicated how I do.

Mayo:

You write, “I still don’t know where you want the uncertainty to show up;” I want the uncertainty to show up in a posterior distribution for continuous parameters, as described in my books.

> the posterior probability that a null hypothesis is true

J Kadane assured me that no Bayesian would actually mean _the_ posterior probability [but rather just a posterior] :-(

And from a faculty member from a (former) University based statistics department faculty member who has taught statistics (on a confidential report)

Was not significant, thus no difference – was significant thus differences.

I think you properly nailed the denial of uncertainty which is rampant even in the statistics discipline (expecially teaching) very well.

[…] Andrew Gelman […]

Andrew (couldn’t post under your comment). You write, “I want the uncertainty to show up in a posterior distribution for continuous parameters”. Let’s see if I have this right. You would report the posterior probabilities that a model was adequate for a goal. Yes? Now you have also said you are a falsificationist. So is your falsification rule to move from a low enough posterior probability in the adequacy of a model, to the falsity of a claim that the model of is adequate (for the goal). And would high enough posterior in the adequacy of a model translate into something like, not being able to falsify its adequacy or perhaps, accepting it as adequate (the latter would not be falsificationist, but might be more sensible than the former). Or are you no longer falsificationist-leaning.

Mayo:

No, I would not “report the posterior probabilities that a model was adequate for a goal.” That makes no sense to me. I would report the posterior distribution of parameters and make probabilistic predictions within a model.

Andrew: Well if you’re going to falsify as a result, you need a rule from these posteriors to infer the predictions are met satisfactorily or not. Else there is no warrant for rejecting/improving the model. That’s the kind of thing significance tests can do. But specifically, with respect to the misleading interpretations of data that you were just listing, it isn’t obvious how they are avoided by you. The data may fit these hypotheses swimmingly.

Anyhow, this is not the place to discuss this further. In signing off, I just want to record my objection to (mis)portraying statistical tests and other error statistical methods as flawed because of some blatant, age-old misuses or misleading language, like “demonstrate” (flaws that are at least detectable and self-correctable by these same methods, whereas they might remain hidden by other methods now in use). [Those examples should not even be regarded as seeking evidence but at best colorful and often pseudoscientific interpretations.] When the Higgs particle physicists found their 2 and 3 standard deviation effects were disappearing with new data—just to mention a recent example from my blog—they did not say the flaw was with the p-values! They tightened up their analyses and made them more demanding. They didn’t report posterior distributions for the properties of the Higgs, but they were able to make inferences about their values, and identify gaps for further analysis.

http://errorstatistics.com/2013/03/17/update-on-higgs-data-analysis-statistical-flukes-1/

Once you have a fitted Bayesian model, with posterior distribution for the parameters, then you have a generative model for the data, and can generate distributions for the data under the model hypotheses. You can in essence generate fake data. Statistical tests can be used to determine if the actual data is consistent with the distribution of data predicted by the model. This is my understanding of the essence of “posterior predictive checks” as defined by Gelman.

It’s not absolutely a requirement of Bayesian analysis that the model have an interpretation as random sampling under repeated experiments though. We are never going to repeat the last Presidential election yet Nate Silver’s predictions had a Bayesian nature to them. There are aspects of repeatability in most Bayesian models though, such as if we use previous polls and current news items and economic statistics to predict the next set of polls even if we’re not going to repeat the election.

One thing that is pretty clear though, is that there is not some obvious single null hypothesis to consider when doing posterior predictive checks. There are many ways in which the actual data could be considered unusual under the generative model, some of which may be irrelevant to rejecting the model, and some of which may be critical. It seems to me that a posterior predictive check has more of the character of a set of random number generator tests, such as Marsaglia’s DIEHARD, there the interest is in testing the uniform randomness of a stream. In posterior predictive checks the interest is in testing goodness of fit between observed data and posterior predicts, which could be defined in many different ways.

Very nicely stated!

“An empirical approach is to identify scientific truth with replicability; hence, the goal of an experimental or observational scientist is to discover effects that replicate in future studies.”

I would like to take issue with that statement.

Scientific truth does not rest on replication, scientific truth rests (as much as anything) on ‘extensibility’. What I mean is that the result that you publish provides an intellectual foundation that allows you to learn something about a similar problem. Best is when it was not even clear that the problem you solve and the problem solved by the person who read your paper were related. The result tells you something about the world that you did not previously know. That insight, that truth, can be applied to other problems productively.

Interestingly, it is when the reproducibility fails, that you (often) have the greatest opportunity to learn.

I used to teach immunology to undergraduates and I would start each semester with two very short papers by Pasteur published in Science. (The papers are transcriptions from an address he gave and are very readable and enjoyable.) They describe his ability to culture microbes from chickens that died from chicken cholera. Injection of those cultured microbes into healthy chickens resulted in the injected chickens dying from cholera. This result was reproducible. Thus was born the microbial theory of disease*.

At some point in his experiments however, when his experimentalist injected the cultures into chickens, the chickens survived! Failure to reproduce the phenomena, all is lost. Scientific untruth, give up.

Pasteur did not give up. What he found was that the chickens that survived being injected with the cultured microbes were now resistant to chicken cholera. Somehow having been injected in the thigh with the somehow weakened chicken cholera culture, enabled the whole animal (not just the thigh) to be resistant to future cholera infections. (This strongly supported the idea that the disease caused by injecting the cultured microbes was indeed chicken cholera and not something unrelated.) Thus was born immunology*.

This result was reproduced on a huge scale. Paris, at that time, would suffer huge losses of poultry from cholera epidemics. A vaccination program using Pasteur’s methods stopped the epidemics and Parisian’s ate well.

To me what is really the most convincing is not that he was able to reproduce the original finding. Nor that when the experiments started to fail, he was able to do more experiments that proved an arguably bigger point. What was really convincing was that he did analogous experiments with cow anthrax. The experiment was not reproduced with anthrax, but the principles learned were applied to this new problem. The results were analogous, not identical. Different disease, different host organism, different microbe, different culturing conditions.

What is amazing to the modern reader is how incredibly weird some of his ideas are about what is happening to make an infectious agent become a vaccinating agent. His writing almost sounds like he is thinking about phlogiston as the attenuating agent*. Having extraordinary insight, finding two Scientific Truths, and creating two huge new disciplines does not necessarily make your current understanding correct. Science is humbling.

Sometimes we can get so lost in our analysis that we lose sight of the purpose of the scientific endeavor. Science does not rest on one result. Science is a living thing. All interpretations are contingent on the next experiment.

Policy is different. I am not a policy maker, so I’ll leave that to y’all. But I think its really important to remember what a scientific result is and what a result for the purposes of setting policy are.

* Ok, thats an oversimplification of the history, but for a blog comment accurate enough.

[…] Statistical Modeling, Causal Inference, and Social Science […]

[…] Statistical Modeling, Causal Inference, and Social Science […]