The American Statistical Association just released a committee report on the use of p-values. I was one of the members of the committee but I did not write the report.

We were also given the opportunity to add our comments. Here’s what I sent:

The problems with p-values are not just with p-valuesThe ASA’s statement on p-values says, “Valid scientific conclusions based on p-values and related statistics cannot be drawn without at least knowing how many and which analyses were conducted.” I agree, but knowledge of how many analyses were conducted etc. is not enough. The whole point of the “garden of forking paths” (Gelman and Loken, 2014) is that to compute a valid p-value you need to know what analyses

would have been donehad the data been different. Even if the researchers only did a single analysis of the data at hand, they well could’ve done other analyses had the data been different. Remember that “analysis” here also includes rules for data coding, data exclusion, etc.When I was sent an earlier version of the ASA’s statement, I suggested changing the sentence to, “Valid p-values cannot be drawn without knowing, not just what was done with the existing data, but what the choices in data coding, exclusion, and analysis would have been, had the data been different. This ‘what would have been done under other possible datasets’ is central to the definition of p-value.” The concern is not just multiple comparisons, it is multiple

potentialcomparisons.Even experienced users of statistics often have the naive belief that if they did not engage in “cherry-picking . . . data dredging, significance chasing, significance questing, selective inference and p-hacking” (to use the words of the ASA’s statement), and if they clearly state how many and which analyses were conducted, then they’re ok. In practice, though, as Simmons, Nelson, and Simonsohn (2011) have noted, researcher degrees of freedom (including data-exclusion rules; decisions of whether to average groups, compare them, or analyze them separately; choices of regression predictors and iteractions; and so on) can and are performed after seeing the data.

A

scientifichypothesis in a field such as psychology, economics, or medicine can correspond to any number ofstatisticalhypotheses, and if the ASA is going to issue a statement warning about p-values, I think it necessary to emphasize that researcher degrees of freedom—the garden of forking paths—can and does occur even without people realizing what they are doing. A researcher will see the data and make a series of reasonable, theory-respecting choices, ending up with an apparently successful—that is, “statistically significant”—finding, without realizing that the nominal p-value obtained is meaningless.Ultimately the problem is not with p-values but with null-hypothesis significance testing, that parody of falsificationism in which straw-man null hypothesis A is rejected and this is taken as evidence in favor of preferred alternative B (see Gelman, 2014). Whenever this sort of reasoning is being done, the problems discussed above will arise. Confidence intervals, credible intervals, Bayes factors, cross-validation: you name the method, it can and will be twisted, even if inadvertently, to create the appearance of strong evidence where none exists.

What, then, can and should be done? I agree with the ASA statement’s final paragraph, which emphasizes the importance of design, understanding, and context—and I would also add measurement to that list.

What went wrong? How is it that we know that design, data collection, and interpretation of results in context are so important—and yet the practice of statistics is so associated with p-values, a typically misused and misunderstood data summary that is problematic even in the rare cases where it can be mathematically interpreted?

I put much of the blame on statistical education, for two reasons.

First, in our courses and textbooks (my own included), we tend to take the “dataset” and even the statistical model as given, reducing statistics to a mathematical or computational problem of inference and encouraging students and practitioners to think of their data as given. Even when we discuss the design of surveys and experiments, we typically focus on the choice of sample size, not on the importance of valid and reliable measurements. The result is often an attitude that any measurement will do, and a blind quest for statistical significance.

Second, it seems to me that statistics is often sold as a sort of alchemy that transmutes randomness into certainty, an “uncertainty laundering” that begins with data and concludes with success as measured by statistical significance. Again, I do not exempt my own books from this criticism: we present neatly-packaged analyses with clear conclusions. This is what is expected—demanded—of subject-matter journals. Just try publishing a result with p = 0.20. If researchers have been trained with the expectation that they will get statistical significance if they work hard and play by the rules, if granting agencies demand power analyses in which researchers must claim 80\% certainty that they will attain statistical significance, and if that threshold is required for publication, it is no surprise that researchers will routinely satisfy this criterion, and publish, and publish, and publish, even in the absence of any real effects, or in the context of effects that are so variable as to be undetectable in the studies that are being conducted (Gelman and Carlin, 2014).

In summary, I agree with most of the ASA’s statement on p-values but I feel that the problems are deeper, and that the solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.

**P.S.** You probably don’t need to be reminded of this, but these debates have real effects, as naive (or calculating) researchers really do make strong claims based on p-values, claims that can fall apart under theoretical and empirical scrutiny; see for example here and here.

Couldn’t you level the same criticism upon almost any form of inference and almost any dataset?

Ben:

What I’m criticizing is “null-hypothesis significance testing, that parody of falsificationism in which straw-man null hypothesis A is rejected and this is taken as evidence in favor of preferred alternative B. Whenever this sort of reasoning is being done, the problems discussed above will arise. Confidence intervals, credible intervals, Bayes factors, cross-validation: you name the method, it can and will be twisted, even if inadvertently, to create the appearance of strong evidence where none exists.”

So, yes, any form of inference can have that problem with any dataset. You read me correctly on that.

But it’s not necessary to do null hypothesis significance testing at all!

As a theoretician recently making forays into more data-based science, what do you recommend in place of null hypothesis significance testing?

Model comparison approaches are your best bet. Comparing multiple plausible models allows for more in-depth analysis of strengths and weaknesses and avoids the straw man problem. There are Bayesian, information theoretic, machine learning and even frequentist tools for this.

Can you clarify, don’t model comparison approaches still involve null hypothesis testing (e.g., that the fit of one model is surprisingly better than the fit of another, under the null)? How else are the models evaluated with respect to one another?

So, in a frequentist framework, you’re still comparing models by setting one as the “null hypothesis” and then computing a p-value. But there are lots of other conceptual frameworks for this.

Bayesian: assign a prior probability to each model, combine that with the model’s likelihood to get a probability of each model being the true model. You can use Bayes factors as p-value analogues.

Information-theoretic: compute AIC or BIC for each model (likelihood with a complexity penalty) to get an estimate of which model is “best” (either for prediction (AIC) or as an approximation to the Bayesian approach (BIC)). Use *IC weights to get weights for each model.

Machine learning: use cross-validation or a held out data set and check each model’s out of sample predictive accuracy.

L1-regularization and sparse priors: design your analysis so that exactly 0 coefficients are favored, do model selection automatically. Use cross-validation to determine an appropriate level of regularization.

The unifying theme is that you need some way to determine the appropriate level of complexity by balancing number of free parameters and goodness of fit. Frequentist tests usually try to get at this by a) really discouraging exploratory and post-hoc analyses and b) setting the decision rule such that the simpler model is the null and proving a more complex hypothesis is hard.

Ailce:

What you describe as “Bayesian” is not the way I like to do Bayesian statistics. See chapter 7 of BDA3 or my 1995 paper with Rubin for more on this point.

One thing that may help bridge the gap is if statisticians stop talking in terms of the “assumptions” of statistical tests. The assumptions (eg independence) incorporated into the usual null hypothesis are just as important as the numerical value being tested.

This is the same as the auxiliary theories used in testing a research hypothesis (eg our histological stain S for protein X actually detects protein X and not protein Y). No one would seriously defend a claim that “treatment increases levels of protein X” based on mere assumption that stain S detects protein X.

It gets worse because many studies are *designed* to deviate from the statistical hypothesis. A common example would be testing a hypothesis that the data is independently drawn from some distribution and then designing a study where more samples are added if the results are “near significant” (ie the later samples are dependent upon the earlier). This is just as idiotic as if someone used stain Q, known not to detect protein X, instead of stain S.

Perhaps more clearly defining the null hypothesis like this will help lead the mind to considering its plausibility, and logical relationship between that and the research hypothesis. Also, seeing promising results and collected more data (or making many comparisons, etc) is a GOOD THING that scientists SHOULD be doing, that is why they do it. The problem is that the statistical procedure being taught/used is incompatible with good scientific practice. The solution is to get rid of the incompatible procedure, not to avoid looking too closely at the data.

Questions to readers:

Do you know of any Stats 101 graduate course over past 5 years that teaches _research practice_ or how to design, implement, analyze, and report findings from a scientific study?

If yes, does the course discuss things like pre-registration, protocols, checklist for reporting in protocols and final analyses, lab book content and practices, data integrity, storage and handling, reproducibility, among other? Or is it mostly probability, tests, regression, and so-called “credibility revolution” methods (e.g. regression discontinuity, etc)?

It was more than 5 years ago, but my training largely separated the two. There was a research methods course which covered the kind of stuff in your list (although not pre-registration or reproducibility; again, it was a bit further back in time) and a tiny bit of statistics, and then you also took statistics courses.

The stats books I have talked about experimental design but their focus (at least through the filter of the teacher/class) was more about what tests would be appropriate for different designs with some discussion of which designs might be better under different circumstances (e.g. cross-sectional versus longitudinal for different kinds of claims). It wasn’t about the actual process of running a study.

Interesting question, though it was more than five years ago, the graduate course I gave at the Duke Statistics Department on Meta-analysis had that type of material – but the students did not seem to appreciate it with comments like “its just words”. On the other hand, the graduate student who had to mark an assignment I gave to undergrads that had that type of material, actually thanked me for what they learned from it (they were heading on to a statistical consulting position and said they had no idea about these issues).

Meta-analysis more broadly (I called it auditing scientific practice in one talk) is an often omitted topic with inclusion only fairly recently in statistics departments, I think because its a nice Bayesian topic (e.g. the eight schools example). In epidemiology, at least in courses using the Rothman, Greenland, Lash book – I think it usually gets omitted as the Meta-analysis chapter is the last chapter in the book.

I like this quote from Don Berry in the ASA supplement (which I am glad he made because I think it needs to be heeded here)

“Irreproducible research is a huge problem in science and medicine. Statisticians

are well positioned to teach other scientists about reproducibility of research, or

lack thereof. However, most statisticians are as naïve in this regard as the

scientists themselves.”

I’ve been thinking a lot about what a “model” means. It’s easy to think of “a model” as a particular procedure or formula, along with some parameters for that procedure or formula. But the fact is, the data that went into training the model is also part of the model. And the decisions that went into making the training data are also part of the model: which variables to include, which to not include, which to log or square, which to impute — and if imputation is going to be done at all — and so on. (After all, these same procedures will be done to the test data and will be done to future data against which the model is deployed.)

If we can think of the model as including all the training data and all of the manipulations and choices about that training data, it clarifies things like train/test/validation sets (or CV or Bootstrap), etc. Maybe this is totally obvious to everyone but me, but I think it’s a part of the Forking-Paths discussion.

Model actually has a technical definition. I think what you’re thinking about is what “a statistical procedure” means.

Z, You’re right! I had been using the word in the sense that I’ve heard it used, as in:

a) “we have developed a model for detecting fraudulent claims”, which in my experience generally means a Statistical Model (per the technical definition you mentioned), plus the actual values of the coefficients. This is often embodied in an object in, say R, which can then be used to score (predict) new data. In many cases, this is just a formula with variables but with coefficients filled in by the training.

b) “then we trained our model”, the result being a trained model, which is pretty much the same thing as a.

I doesn’t sound like a “procedure” to me, though I may also not be aware of a technical definition of “procedure”. It’s not “statistical model” under the technical definition you mentioned.

The bottom line is I’m talking about the information that on needs in order to take an additional sample and make a prediction. I think it’s common for people to think about formula + coefficients (or algorithm + parameter structure, or something like that) as defining this entity, and want to make the case that this entity needs to be broadened to include any part of the process that determined what data to use, how to manipulate this data, etc.

If you are using training data, then I agree that “all of the manipulations and choices about that training data” are part of the Garden of Forking Paths.

However, it sounds like you are using “model” in more than one way, and this might be causing some confusion in your thinking. My suggestion: Be aware that the word “model” is used in more than one way, and be very careful in interpreting the word in any specific context. (Example: “Normal distribution” is a model in one sense of the word “model,” but “Normal distribution with mean 2.5 and variance 4” is a more specific use of model.)

Thanks! You and Z were sensing the same imprecision in my thinking. See if my reply to Z makes any sense.

“The bottom line is I’m talking about the information that on needs in order to take an additional sample and make a prediction. I think it’s common for people to think about formula + coefficients (or algorithm + parameter structure, or something like that) as defining this entity, and want to make the case that this entity needs to be broadened to include any part of the process that determined what data to use, how to manipulate this data, etc.”

This still doesn’t seem to get the point. Here’s a try at something that makes more sense to me: One needs to pay attention to the details of how a model was developed in order to sensibly interpret any predictions or other conclusions made using the model.

Andrew, in your writings you’ve articulated the problem with multiple *potential* comparisons (“the garden of forking paths”). However, it’s not clear to me what the solution to the problem is. Mainly, this is because I’m not sure what the root cause is.

Is the root cause simple broad-based popular ignorance of the basic principle that evidence against the null hypothesis is not evidence for any specific alternative hypothesis being proposed, esp. when there are many potential alternatives; or is it a technical issue requiring that when multiple *potential* comparisons exist, the degrees of freedom used in computing a standard p-value are incorrect and need to be reduced (either in some way well-known by statisticians — but not by other scientists — or some way as yet to be understood); or is it a cultural issue surrounding academia and the conventions of the publication system; or is it something else entirely?

In Gelman & Loken, 2014, you suggest as remedies pre-registration of the analysis plan and, possibly, multilevel modeling. It seems to me, if this is all it took to solve the problem, then some institution/journal/scientist would simply take the lead and the rest of the community would follow that example (^_~).

At any rate, it would be very helpful if you could point us to several examples of (1) an alternative analysis to contrast with a problematic analysis of the same data and (2) a clear example of how to communicate the results in a publication so as to avoid a misinterpretation of the results as being “statistical significance = truth of our claims”.

Thanks!

Michael:

I have a bunch of published papers here; many of them include applied analyses. You can also see my books. Short answer is that I don’t think you should be doing null hypothesis significance testing. Better to just model the problem directly, then when you find problems with your model it’s an occasion to improve it. Just forget about p-values, Bayes factors, and so on.

I recognize that in practice many strong claims are made on the basis of null hypothesis significance testing. For example, NPR and the New York Times endorsed power pose because of the “p less than .05” results. The path from p less than .05 to major media hype is complicated but it seems pretty clear that, had Carney, Cuddy, and Yap

notreported p less than .05, the publication of the article and all that followed would not have happened. Similarly with that claim, reported completely uncritically in the news media, of the effects of air pollution in China.So I think it’s helpful to understand the logic of how null hypothesis significance testing can go wrong.

To put it another way: I would not recommend calculating p-values. But if I never address the meaning of a p-value, I’ll get people saying that they believe their claims because of all the statistically significant p-values they obtained. The garden of forking paths is a mechanism (or, one might say, a story) for how it can happen that researchers can regularly get p less than .05 even in the presence of pure noise.

But, again, to answer the question in your final paragraph, I point you toward my applied work.

Thank you for your elucidating article Andrew.

I am intrigued, but have searched (not terribly extensively, but did do a cursory scan of your published articles) in vain to find a citation for your “researchers can regularly get p less than .05 even in the presence of pure noise” – which I would like to use in an article I’m writing.

Any suggestions ?

Thank you again,

David

David:

The classic reference on this one is Simmons, Nelson, and Simonsohn (2011).

Another empirical option from another standpoint (this time about choice of standard error estimates in difference-in-difference estimates) from the applied microeconomics literature:

http://zmjones.com/static/causal-inference/bertrand-qje-2004.pdf

There are actually lots of ways for researchers to end up with massively over-rejecting “true” null hypotheses. Andrew’s (well, Simmons & co.’s) paper points to behaviors when the probability calculations are at least theoretically close to right. The problem the paper I link to is focused on using methods that seem like they should produce reasonable probability calculations (coverage rates of confidence intervals) but in application fail to produce anything close to reasonably-sized confidence intervals. And neither of these things touch issues like the Garden of Forking Paths, where the coefficients reported really might be “statistically significant” in terms of their magnitude relative to their precision, but are not actually meaningful or consistent features of the world… you can find improbable draws in any dataset, and given the flexibility of theory in many social sciences, you can always find a way to reasonably interpret those “effects” in terms of the underlying theory… I’d count that as a way for researchers to routinely find p<0.05 in the presence of pure noise, but I don't have an empirical demonstration, simulation, or placebo test reference for that (it is really just the result of probability calculations, multiple hypothesis testing over many outcomes/subgroups, and the flexibility of theory to explain anything).

David:

This is not a citation, but can help you see why “researchers can regularly get p less than .05 even in the presence of pure noise”:

Go to

http://www.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html

and look at the items labeled:

Jerry Dallal’s Simulation of Multiple Testing

Jelly Beans (A Folly of Multiple Testing and Data Snooping)

More Jerry Dallal Simulations: More Jelly Beans Cellphones and Cancer Coffee and …

Michael,

I hope this doesn’t come across as snarky, but I think that one big problem is that people expect a “root cause” of the problem. As I see it, the problem (as with many real life problems) is one with many contributing factors.

One consequence is that it is not reasonable to expect a simple remedy. So, for example, I take the Gelman and Loken suggestions (pre-registration of the analysis plan, multilevel modeling) as suggestions that can help in some instances, but not as panaceas.

Still, I think the request in your last paragraph is a good one. Possibly commenters can help by providing some examples. But your request really is a request for a book of case studies.

“Valid p-values cannot be drawn without knowing, not just what was done with the existing data, but what the choices in data coding, exclusion, and analysis would have been, had the data been different. This ‘what would have been done under other possible datasetsHey, this inspires a good question for Columbia’s stat qualifying exam:

Imagine you’re doing a preregistered study on whether meteorites are caused by a hidden asteroid belt. Initially you get a nominal p-value of .25 using H0: no hidden asteroid belt.

But if the data had been different, and there were a lot more meteorites hitting the earth due to a hidden asteroid belt, then there’s a X% chance the details of the preregistration would be destroyed and the original researchers killed. The research then would be completed by a new individual chosen not at random among the worlds surviving population, who would analyze the data differently then than the original preregistered design.

What’s the real p-value?

What does “valid p-value” even mean? A p-value: “Frequency with which a particular named random number generator generates fake data whose sample test statistic is as extreme or more extreme than the experimentally observed test statistic”

The validity of that statement doesn’t change under garden of forking paths… so, when there are many choices that the experimenter COULD make about which test to do, we agree that p < 0.05 is not very surprising (it happens much more often than in 5% of experiments that chase nothing but noise)…

Since p is a frequency NOT A PROBABILITY, let's model it: (now, capital P will be probability). The symbols H0 will mean "H0 was chosen as the null hypothesis of interest" and "Choices" means "We made some specific choices about data coding etc" and K is our knowledge of how the sausage is made

P( p < 0.05 | Data, H0, Choices, K) = P(p < 0.05 | Data, H0, Choices) P(Data | Reality) P(H0, Choices | K)

Now, we know how the sausage is made, so we know that people choose comparisons and data that make p < 0.05 happen, so

P(p < 0.05 | Data,H0,Choices) ~ 1

Now, suppose there are N possible choices for (H0,Choices) which lead to p < 0.05, since we know how the sausage is made (K) we know that only one of these will be chosen, so we assign 1/N probability to those which "work for our career".

Now, we marginalize away the H0, and Choices…

P(p < 0.05 | K) = sum(1 * 1/N, i= 1..N) = 1

It's a little tongue in cheek, but only a little. consider the following two rephrasings of the same thing:

"oh, that's kind of an interesting looking plot, let me see if the difference there is significant"

"oh, if I choose this comparison as a null hypothesis the data seems likely to contradict it… let me see if that's numerically true, yes it is, I will now obtain tenure provided that I have sufficient storytelling skills"

They're really the same thing, in other words, as a Bayesian, the knowledge we have of how the sausage is made ensures that we put nearly P = 1 on the hypothesis that the sausage contains a small p less than 0.05

Even if you don't "feel like" you're p-hacking, the phrase "oh that's kind of interesting, is that difference there significant?" is basically just directed depth first search for p < 0.05, and when there are Q possible things to try and N of them will "work" you only have to try Q/N things, which means that the more flexibility (forking paths… bigger N) the less time it takes to find p < 0.05, and if it doesn't take a lot of trials, then it doesn't "feel" like p hacking…

The solution to this problem is NOT to prevent people from doing "depth first search for interesting things" (ie. multiple testing corrections, preregistration etc) it's to force people to simultaneously provide evidence FOR their favored hypothesis (data that has high likelihood under their hypothesis), and AGAINST the myriad others which we consider credible prior to experiment (data should have low likelihood under all other credible hypotheses). In other words, try to falsify REAL THEORIES that we find plausible.

“try to falsify real theories that we find plausible” sounds like Bayesian sophistry and sorcery to me. I just want Mayo and Gelman to calculate the objective real p-value for the example I gave thank you very much.

Why didn’t you say so, obviously p = 0.0495

Laplace,

What is your definition of “objective, real p-value”?

I’m pretty sure you’ll find that’s tongue in cheek, he’s attempting to point out that there IS no such thing.

That would be awfully cheeky of him! ;>)

Re the exclusion of measurement as a major concern: I wonder if its absence reflects how measurement issues in some fields are relatively straightforward while in other fields they are endlessly difficult, if not insurmountable.

Consider differences across the social sciences. I often envy economists because they appear to have a set of measures and datasets that are widely accepted within their field. That doesn’t necessarily make these measures valid (and some ongoing debates over the utility of GDP, for example, indicate that even economists are increasingly concerned with their measures). But it does suggest that economists have a set of indicators that they readily use without controversy. That means they can focus on other statistical challenges.

By comparison, political scientists and sociologists seem to be engaged in an ongoing struggle over how to measure their discipline’s core concepts of interest. (Reliable and valid measures of power? Yeah, right.)

So while measurement is fundamentally important to all research, the challenges it poses may not be equally obvious across all disciplines.

“economists have a set of indicators that they readily use without controversy. That means they can focus on other statistical challenges.”

I disagree that having “a set of indicators that they readily use without controversy” means they can stop focusing on whether or not their “indicators” are good for the purposes intended — which might vary from question to question.

Andrew, in response to Ben you wrote: “What I’m criticizing is “null-hypothesis significance testing, that parody of falsificationism in which straw-man null hypothesis A is rejected and this is taken as evidence in favor of preferred alternative B.” Maybe that is what you want to criticize, but you keep criticizing P-values. I agree with your points about accept/reject use of P-values, but P-values are not tools for decision, they are descriptions of evidence.

A P-value answers the question of what the data say about the null hypothesised value of the parameter of interest within the statistical model. Neyman-Pearsonian hypothesis tests yield an ostensibly dichotomous decision regarding the null hypothesis, and they provide an answer the question “what should I do or decide now that I have these data?”, as long as a loss function is provided by pre-data planning of analysis power and alpha. A hypothesis test does not require a P-value, despite conventional practice, and a P-value is not a hypothesis test.

You muddy the waters by failing to distinguish between significance tests and hypothesis tests.

P-values from significance tests are useful and appropriate tools for exploratory studies, and your garden paths insight is not relevant there. Neyman-Pearsonian hypothesis tests are appropriate tools for planned studies intended to be definitive, and your garden paths are a consideration for them. If you do not distinguish between the different types of study you will lead scientists into poor practices and encourage the continuation of their understandable confusion and misdescription of their actual procedures.

Michael:

I commented on p-values because the American Statistical Association asked me to. What are the poor practices that you think I am leading scientists into?

Andrew, by assuming that P-values are only used as part of a decision procedure, you ignore their appropriate role in consideration of the evidence available. Your repeated assertions like this” to compute a valid p-value you need to know what analyses would have been done had the data been different” are false when the P-values are used to see what the data say about the null hypothesised value of the parameter of interest.

Scientists will continue to use a hybrid approach to testing as long as we continue to blur the distinction between evidence-assessing significance tests and decision-producing hypothesis tests (and a generation longer, I expect). Your garden of forking paths discussions are relevant to P-values used as a proxy for the critical regions of hypothesis test procedures, because they affect the relative values of alpha and beta. P-values from exploratory studies interpreted as evidence are not affected by paths or intentions.

Scientists do (should do) more studies that are exploratory than planned studies. We want them to analyse all of their experiments sensibly and to make safe inferences. Not every stage of a scientific program should be analysed for the same information.

Michael presents a caricature of both significance tests and N-P tests, based on an account that’s foreign to both: Royall-style likelihoodism. Cox talks specifically of p-value adjustments for selection since the 50s and I take him as the spokesperson of Fisherian tests. Significance tests and N-P tests both involve the use of p-values (as Lehmann is clear). Their main differences really just involve different types of problems, not different types of reasoning or goals, e.g., in one the alternatives are explicit. In the most familiar cases, one winds up in the same place starting from Fisher’s goals or starting from N-P optimality goals, as has been proved.

It’s also absurd to allege that a stringent test requires an unending list of what you would have done had your mother called, etc. etc.. Here’s my comment on the doc, by the way. http://errorstatistics.com/2016/03/07/dont-throw-out-the-error-control-baby-with-the-bad-statistics-bathwater/

Mayo:

I think the problem comes because researchers are typically

notusing p-values or hypothesis testing to test a model they care about. That is, they’re not doing stringent testing or severe testing or Popperian reasoning or whatever. Rather, they’re rejecting straw-man null hypothesis A as support for their preferred alternative B.Sure, we agree on all that, but it’s a separate issue. It’s altogether essential to be able to identify strong evidence or a warranted inference or whatever one likes to call it. If someone says an impossible condition has to be met to identify compelling evidence, then there’s something wrong with that account of how strong inferences are ever warranted in science. Nor do we need such a condition to criticize moving from a statistically significant result x–even assuming that it is legitimate–to a theory that entails it. I’m not denying it’s relevant to consider flexible choices when considering if a purported p-value is spurious. That’s what the constraints on design, preregistration etc. are supposed to achieve. The onus is on the researchers to show they’ve bent over backwards to probe every loophole. If a domain is unable to achieve or even improve with such assurances, then it’s a pseudoscience. Separately, we’d need to scrutinize if they were even picking up on what they purport to be studying. I entirely agree that there needs to be much more scrutiny of the measurements and proxy variables and toy experiments in certain fields. I don’t even think it would be too difficult to carry this out in many cases. (Faster than replication attempts.)

Mayo, you really need to stop saying that I have a Royall-style approach. Royall may be your personal bete-noir, but I am not he and AWF Edwards, Fisher and Basu are much more important to my attitude.

I did not imply that a stringent test requires an “unending list”. You brought that notion into the conversation all by yourself.

As far as I can tell, our difference over what constitutes a hypothesis test comes from our different interpretations of what Neyman and Pearson meant by this passage wherein they set out their difference from Fisher’s and Student’s significance tests:

“We are inclined to think that as far as a a particular hypothesis is concerned, no test based on the theory of probability can by itself provide any valuable evidence of the truth of falsehood of that hypothesis.”

“But we may look at the purpose of a test from a different viewpoint Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.”

—Neyman & Pearson, 1933 pp. 290–291

I take those two sentences as a repudiation of a probabilistic interpretation of a P-value from a significance test, and as the underlying logic of their alternative, the hypothesis test. You, however, disagree with the straightforward interpretation of those sentences.

Michael: I just was trying to identify likelihoodism as a position as opposed to talking of likelihood ratios because there are very, very few pure likelihoodists around, and the fact that you always, always rehearse the Royal questions. OK, I don’t know the fine difference (let me know some time), but I don’t like using your “likelihoodlum” expression either. But none of this matters.

I have no clue what you can mean by saying N-P repudiate ” a probabilistic interpretation of a p-value from a significance test”? Sorry, what? Are you saying Fisher favors a probabilistic interpretation of a p-value and N-P do not? Truly this makes no sense, perhaps it’s a misprint. Are you talking fiducial?(for Fisher). As for N-P, N-P spokesman Lehmann calls the p-value the “significance probability”–pretty clearly still a probability. So I’m perplexed.

Oh, by the way, the unending list was a reference to Gelman, I was taking up both of your points at once, I should have indicated.

Mayo, in the first edition of his book Lehman makes no mention of P-values by name. (It is as if they are Valdemort.) The nearest he comes is this [bits in square braces are mine]:

“In applications there is usually a nested family of rejection regions, corresponding to different significance levels. It is then good practice to determine not only whether the hypothesis [i.e. the null hypothesis] is accepted or rejected at the given significance level [i.e. at the predetermined alpha level], but also to determine the smallest significance level $\hat{\alpha}=\hat{\alpha}(x)$, the \textit{critical level}, at which the hypothesis would be rejected for the given observation. This number gives an idea of how strongly the data contradict (or support) the hypothesis, and enables others to reach a verdict on the significance level of their choice.”

—Lehman, Testing Statistical Hypotheses, p. 62.

(In the third edition, that paragraph is modified to include the phrase P-value.)

Where does it say that the P-value should be interpreted as a probability? How can that paragraph be read as anything other than an endorsement of an acceptance procedure?

No room under your comment. I’ve responded to you over at Elba: http://errorstatistics.com/2016/03/07/dont-throw-out-the-error-control-baby-with-the-bad-statistics-bathwater/#comment-139546

Michael,

Could you please elaborate more on the use of p-values in exploratory studies? Specially in terms of why I would use p-value as the choice statistic to find interesting associations in a messy dataset. Shouldn’t I focus on theoretically relevant variables, even if they are not statistically significant?

Since the computation of p-values comes with many additional assumptions (e.g., independence in the case of t-tests), they can’t be used as a reliable measure of evidence if my data don’t come from some well-defined experiment or sampling procedure.

Erikson, P-values provide a convenient way to grade the observed effects or associations on a scale of statistical significance, and thereby provide a way to rank them as prospects for further study. You can use that rank any way that you like, and so if there are some variables or comparisons that are of special interest because of theory, previous information, preferred colour or flavour, then you are free to add that special interest into the mix when designing the confirmatory studies. As one of the other participants at the ASA P-value discussion said (Naomi Altman, I think), a small P-value says “Look over here!”. The smaller it is, the more loudly it shouts.

Yes, P-values are affected when the assumption of independence of observations is false, but surely all alternatives are also affected unless the statistical model is rich enough to include the degree of independence. Presumably a rich model can give a P-value that takes the independence into account.

I do not advocate P-values as the _best_ way to see what the data say. Likelihood functions are a better way to see that because they show how the observations match up against all possible values of the parameter of interest, not just the value that is set by the null hypothesis. Most times the same statistical model that gives a P-value can give a likelihood function (maybe always?). However, this current fight is about the role of P-values, not the utility or validity of the likelihood principle.

Give salmon A 2 Gy of ionizing radiation and salmon B no radiation. Y1 lays 500 eggs and Y2 lays 1000 eggs. What is nA and nB? Pretty clear what n is in this case though what I call the two fish falacy is often committed in biology.

But now suppose we do an experiment to determine whether MRI causes double-strand breaks of human lymphocyte DNA and the identification is based on seeing a fluorescent locus in a cell nucleus at a proportion of about 0.015 per cell. Neglecting whether the experiment is blinded or not blinded and whether one observer or an automated system is used for the detection, how many cells must be observed to achieve a p-value of 0.01 in a study that expects to see the average DSB per cell of 0.030? But then tell me how many samples (or patients must be studied as each is an independent generator of the cells to be examined.

The garden of forking paths is frequently discussed here, but never have I seen an actual example of what it actually would look like. Would someone be willing to give a quick example of it? Is it like researcher seeing that a certain mean comparison is likely to be significant (based on descriptives) and then deciding to run that analysis (i.e., NHST) rather than another?

AJ:

Discussion of forking paths and several examples are here.

Given the role of the forking path in the experiment, is it really the case that the p-value should be the issue? e.g. in a well-designed RCT for a specific drug against an alternative, where everyone involved is blinded and confounder adjustment is unnecessary, in a large sample it surely doesn’t matter whether you use frequentist, Bayesian or eyeball methods? In this case I would expect that debates about Bayesian vs. frequentist, for example, would come down to whether there is pre-existing strong evidence against or for a particular hypothesis.

In contrast if you dig out a bunch of e.g. Demographic and Health Surveys and want to use them to test whether female genital mutilation is more prevalent in Muslim than non-Muslim communities (just to give a completely non-controversial example), then you’ll run into issues of missing data, covariate selection, sample design, reporting bias etc. By the time you’ve dealt with all them and gone through all the possible forking paths of model selection, surely whether you presented a p-value is the least of your concerns? In this case even if you could come up with a non-controversial prior distribution (ha!), your results will be so dependent on model selection and data handling methods that any final decision-making criterion is of limited relevance?

A well-designed RCT can still have problems. Examples: How to deal with dropouts and other missing data; getting enough subjects for a large sample. Design is indeed important, but implementation is equally important.

>”in a well-designed RCT for a specific drug against an alternative, where everyone involved is blinded and confounder adjustment is unnecessary, in a large sample”

Can you give a real example of this we can discuss?

I was thinking more of the putative ideal RCT, so I don’t have any real world examples. Even with minor dropout and missing data though the same point holds – a good RCT will reduce almost all of the forking paths, leaving you with clear and simple analytical decisions. At that point, does the choice of discrimination method (e.g. p value vs. posterior interval) matter that much?

Usually not, especially if the sample size is large enough.

I do consider random assignment almost necessary to determine the distribution of calculated p_valuse in conceptually repeated studies if the no effect was in fact true.

When that distribution is ill determined (almost always in epidemiological studies) I don’t see anyway of making sense of reported p_values (or error rates, Bayesian posteriors, credible intervals, likelihood ratios, etc.) no matter how carefully calculated. Sander Greenland argues that point repeatedly in his work and suggest multiple bias analysis based on informed priors as a way forward.

So I would agree, this is just about what to make and say about p-values when they result well designed RCTs (more generally largely bias free study designs) that did not incur serious problems (which is rare in many areas.)

Maybe in another 10 years the ASA will tackle this aspect.

I think that first in human Phase 1 safety trials approach this. The subjects are generally young males in good health. They are living in a lab because they get so many tests and can all eat the same food. Also the studies are often set up as crossover designs so blood levels are compared within the same subject. This data is used to show the pharmokinetics of the test drug and also to flag any unexpected lab results.

Once the drug is given to sick people, confounding and unexpected results do become a problem. But drug studies have extensive protocols and analysis plans that give direction on how to handle data.

I think a counterexample might be the PACE trial that Andrew wrote a lot about earlier this year (e.g. http://andrewgelman.com/2016/01/13/pro-pace/). You could have a lot of subjects, you can randomize them and blind everyone to conditions, etc. But if you have multiple potential measures of interest (and particularly if you might be willing to change your mind about which ones are “important”), then you have the opportunity for forking paths. On the other hand, if you take Andrew’s advice to model everything, then perhaps you would see that even the selected-for-p value-significance effects aren’t that impressive.

BTW, a couple of comments (both quotes from Ron Wasserman) from the report (press release)(that Andrew gives a link to) that I particularly appreciate (because they are points that are often neglected):

“The p-value was never intended to be a substitute for scientific reasoning,”

and

“The issues involved in statistical inference are difficult because inference itself is challenging.”

Agree, but think people need to read through all the submissions in the supplement – which I have started on but not completed ;-)

These two comments seemed to be pulling in concerns that I would like to try to summarize at some time, but for now –

re:the first comment, very informed scientific reasoning is needed to arbitrate the assumptions required so the the p_value is an any way sensible (or any other approaches).

And for the second, inference (making sense of others empirical research even if one was involved) is not just challenging but something likely beyond most statisticians – given it is seldom if ever being addressed in today’s Msc/Phd stats training (see Don Berry quote above).

I think the challenge in making the report was to avoid be too clear about what is beyond statistics and what else is beyond most of today’s statisticians – in what it is expected/desired that p-values and other statistical math techniques should/can provide (in the hands of most of today’s statisticians) to help make sense of empirical research.

P-values have fallen into disrepute partly because “statistics for X” courses have failed to teach a version of the subject which is both accessible and correct. This is unfortunate, because reserving statistics only for an elite body of experts will have consequences for society. An activist who has found evidence of a danger to public health should not be dismissed because they were out-argued by an elite statistician hired by the multinational responsible for the danger. The approval of a drug by the FDA should depend on the efficacy and safety of the drug, not the relative number of elite-statistician-hours bought by the FDA and by the company that developed the drug.

Is it possible to teach a version of “statistics for X” which conveys statistical correctness, possibly at the cost of some statistical power? I would class Conover’s “Practical Non-Parameteric Statistics” and possibly Abelson’s “Statistics as Principled Argument” as attempts at this.

This might be somewhat along the lines of what you are looking for:

http://www.ma.utexas.edu/users/mks/CommonMistakes2014/commonmistakeshome2014.html

It is the class home page (including class notes) for a 12-hour “continuing education” type course I have taught that has as one aim helping people become critical readers of research using frequentist statistics. (It does assume a first course in basic statistics, but goes into more detail than such a course in explaining some of the subtleties of p-values, etc.)

A very nice statement by the ASA, useful for both statisticians and researchers. But it’s published in TAS, a journal that few scientists read! It would be great to get the word out more broadly, maybe the tabloids will pick it up. At least it’s open access…

Stan:

Somebody’s got to teach the news media that there’s nothing special about the tabloids!

> to move toward a greater acceptance of uncertainty and embracing of variation.

Is there something I can read that explains what this means in practice?

Jeff:

See my applied work, for example Red State Blue State. We present a lot of data and discuss some theories but without trying to wrap everything up in a single clean story. Or our 2012 update here, where we talk about what aspects of our earlier findings remained and where we were uncertain. Or this paper on historical elections where we try out a bunch of stories. Or our 1993 paper on the polls where we go through a bunch of possible explanations for variation.

Andrew, how much impact did Red State Blue State have? My feeling is that anyone who implements Andrew’s recommendation here (to embrace uncertainty and variation) will get a big fat rejection from a journal article submission, unless someone enlightened about the issues is reviewing it and editing it (almost never the case). Even when embracing variation, one has to somehow engineer a story, which inevitably takes us into speculation dressed up as a certainty.

Hi Andrew,

Just to clarify, you said above:

“I think the problem comes because researchers are typically not using p-values or hypothesis testing to test a model they care about. That is, they’re not doing stringent testing or severe testing or Popperian reasoning or whatever. Rather, they’re rejecting straw-man null hypothesis A as support for their preferred alternative B.”

I haven’t read enough of your blog to know how you define a straw-man null hypothesis (feel free to point me to a relevant previous blog post), so I’m trying to understand why you think p-values and the like have to go. It seems to me that p-values are okay when used appropriately by scientists who actually care about testing a hypothesis/theory versus telling a splashy story that will advance their career/make them a celebrity. But you are suggesting that even if one follows the rules and takes into account analyses they would have done had the data been different, that’s still not enough to make p-values useful in science. Are you saying this because you think they are always likely to be abused or because you think they really can’t ever tell us anything informative?

Let’s say Amy Cuddy’s finding was actually a true effect and that the steps taken to test the hypothesis/analyze the data were completely appropriate, and the p-value was significant. And let’s say the finding was replicated multiple times. Would you still find this completely uninformative? Sure, the null hypothesis is no effect of power poses on psychological and behavioral states, but is that really that lame/deficient? What would be a better alternative? Am I naive to think this would count as evidence that brief power poses can help make me feel better about myself and that I should try them?

Sabine,

Sorry to say, but yes, I do think you would be “naive to think this would count as evidence that brief power poses can help make [you] feel better about [your]self and that [you] should try them”.

One reason: The type of “findings” you discuss are about averages (or some similar overall measure), not about individuals. So even if power poses on average help people feel better about themselves, we can make no conclusions about their effect on individuals; some individuals might even feel worse.

So you are saying we can’t infer anything probablistic from averages? This seems extreme. Of course power posing could be terrible for me and my self-concept, but insofar as it seems to be generally beneficial for a sample that adequately represents me, I think it would be reasonable for me to give it a try (if I care to improve my feelings of power).

I’m not aware of (and doubt that there is) any “objective” way that we can infer anything probabilistic from averages. However, any individual is free to use results about averages to come up with a subjective probability or otherwise use a result about averages in their own personal decision making.

“Am I naive to think this would count as evidence that brief power poses can help make me feel better about myself and that I should try them?”

How about just getting and plotting estimates of the mean and the uncertainty intervals with very large sample sizes and a repeated measures design, from many replications? What does the p-value add beyond the information you would gain from those summaries?

The p-value is just a ritual incantation we use to justify our journal article’s existence. I have recently been reading some papers from an author (a winner of millions of Euros through grant) who literally made up the p-values, as in the published p-values are not even remotely related to the published t-scores (I’m not talking about rounding errors, but things like t=0.1, p<0.001).

Also, I've recently been reading several papers published in top journals where *none* of the effects in the planned comparisons were statistically significant but the authors p-hacked their way to significance, using various tricks. Basically, all you have to ensure that there is *some* p-value somewhere that is below 0.05. I have done this myself, in an earlier phase of my life.

The p-value just lends scientific credibility to the statement, "hey, I'm right after all". If the p value falls above 0.05, but only just, there are some 500 euphemisms available in psychological sciences for arguing that yes, I was right (when I am in that situation, I write, "did not reach the conventional level of significance", but I only write it as an insider joke now). There is no possible universe in which Gilbert or Cuddy or anyone else would publish a paper saying, guys, I can't find evidence for my theory. Try naming some people in your field who published evidence against their own theory. In a 30-40 year career it is statistically impossible that a researcher will never find evidence inconsistent with their theory (even if it's a mistake).

So, if there is only a unidirectional outcome possible with the p-value, why bother? Just publish your means and SEs, as a sort of poor man's Bayesian approach, and move on. Or just fit a Bayesian model and report the posterior and let your detractors deal with it.

> just fit a Bayesian model and report the posterior and let your detractors deal with it

Now, if we only had a clear understanding of what to make of posterior probabilities that could be widely communicated…

To the general scientific community!

I like the analogy of science Peirce offered of standing in a bog where the ground seems secure ready to move when we realize its giving way. Statistics is just the science of inductive inference, informed/enabled by math, and I believe the same analogy is apt. There are not sure methods/solutions available, rather aspects we think for now we sort of get not too wrong. Unfortunately, the working in math continually seems to mislead many of us that somewhere/someday there will be sure methods/solutions (e.g. Dennis Lindley’s interview Tony O’Hagan where he talked about finally making statistics a rigorous axiom based subject like all other areas of math.)

More simply put – we don’t solutions but rather just (hopefully sensible) ways to struggle through observations we some how get.

Wouldn’t that make a great motto for a statistical society?

Okay, I understand the view that p-values are widely abused, but that’s not the p-values’ fault. They still seem to have utility when used correctly. They allow you to indicate how improbable your results are under the null, and take this as support for your theory. What I am really wanting to know is why Andrew thinks this is uninformative, and really my question is specifically asking about his view that looking at the improbability of one’s results under the null is always or even often so deficient as to be completely uninformative.

Sabine,

Nobody’s blaming the p-values.

But saying, “They allow you to indicate how improbable your results are under the null, and take this as support for your theory” misses many important points. To list just two:

1. The p-value depends on the model. So to rationally justify using a p-value as support for a theory, you need to provide a good reason why the model adequately fits the theory and the question being asked about it. This is rarely done.

2. The p-value depends on the sample size. So at the very least, you need to consider what sample size gives you a good chance of detecting a difference of practical importance. This is also rarely done.

In particular, I don’t think either of these points was addressed in the power pose case that you are using as an example.

PS

Daniel Lakeland just gave a good one-line summary of the problems with p-values on another thread (http://andrewgelman.com/2016/03/10/good-advice-can-do-you-bad/#comment-265719):

“Garden of Forking Paths” means little more than “p < 0.05 is easy to find"

I understand your points that these things are rarely done, but what I’m asking is whether it is the case that when these things ARE done, does someone taking Andrew’s position still find p-values worthless. I use power posing as a hypothetical example (i.e., assuming a large sample, a prespecified hypothesis and appropriate model to test it, no changing the data analysis upon seeing the data, etc.). I’m just seeking to understand why someone would not find value in p-values when these conditions are met. Specifically, if it comes down to the null being a straw man, I don’t see why that is so in a scenario like this. Is contrasting a hypothesis against the null really so weak? I know it’s argued that this is weak because the null is never really true, and if that’s what he’s getting at, I’d be interested in knowing. I’m not sure I find that a compelling argument to abandon p-values but I’m willing to entertain the thought.

Well, once you’ve rejected the null that mu=0, there are an infinity of possible values implied by mu!=0. If you did a power pose study and rejected the null that mu=0, would you be willing to write that we have evidence that the effect of power posing is to increase or decrease mu? No, of course not. You look at the sign of your sample mean and if it is positive, you conclude that power posing increases mu. However, the p-value didn’t give you that. The p-value only helped you reject the hypothesis that mu=0. The next step, to argue for your favored alternative hypothesis, that mu is positive, is not based on the null hypothesis test itself. Also, when you do a single statistical test,and get p=0.0001, you don’t know if you are one of three possible worlds (a) null is actually and you just got “lucky” (b) null is false but you are in a low power situation, in which case you are in danger of suffering from Type S and M errors (wrong sign, exaggerated effect), or (c) you are in a high power situation (in which case you are golden). If all our experiments involved situation (c) life would be good. But they don’t. It’s more likely with Bem, Cuddy, the red color and sex studies that mu is very close to zero (just because it’s implausible to think that people have ESP, and that life is just a matter of waving your arms in the air, or that there are simple outward indicators of biological events), so we usually end up in situation (a) or (b), which means that most published studies are just publishing noise.

Also, if you run many replications, and if you repeatedly get p<0.05, then you can start to be sure that your effect might be real. But this information is not coming from the p-value; it's coming from the replications. So, in the interesting case where you can replicate your effect, the p-value is giving you nothing, it's the consistent sign and magnitude of the sample mean that is giving you something. (This is what Andrew calls the "secret weapon".)

Okay, but in my hypothetical example I’m using the p-value to test whether a difference in a predicted direction would be highly surprising/improbable under the null. I’m assuming I have good power to detect the effect. True, this could still be a false positive, but that’s besides the point. I’m asking why, even under these ideal circumstances, would someone argue that null hypothesis testing is useless. My interpretation of Andrew’s comments is that even if all experiments involved your situation c, he would still reject null hypothesis testing, and I want to know why and whether there is sound justification for that or whether it is going too far.

Also, if you have high power this doesn’t mean you can assume the null is false even before you do the study, right? (Otherwise why do the null hypothesis test). So once you do the study, you are still surrounded by fog: are in you in the world where the true mu=0, or are you in the world where the true mu has the sign and magnitude you expect it to have? The p-value might tell you mu!=0, but what you really want to know is whether the true mu has the sign and magnitude you expect it to have. The p-value answers the wrong question, one you didn’t really want an answer to.

Even if you think you have high power, you would still want to replicate your result to confirm that you can get a robust result. And there the informative thing is not the p-value but the replicability of the result.

Sabine replied to Shravan,

“My interpretation of Andrew’s comments is that even if all experiments involved your situation c, he would still reject null hypothesis testing”

As Shravan said, we don’t know whether a given experiment fits situation a, b, or c, so your question is not one about the real world.

But would Andrew still reject null hypothesis testing even if the world were a fantasy one where all experiments were in situation c? I don’t really know — I can’t read his mind. He might, because one of his objections to null hypothesis testing in the real world is that they posit a “yes, no” situation. But he might not, if indeed the real world were a simpler one. But the question is really moot, since we live the the real world we live in, not in a fantasy one.

Sabine: You are not completely alone – for instance see the 9_Greenland_Senn_Rothman_Carlin_Poole_Goodman_Altman.pdf in the ASA supplement http://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108

Now, their explanation for not abandoning seems to be the lack a currently agreed upon better alternative (i.e. growing old is not great, but better than the alternative.)

For instance, this comment “Bayesian statistics offers methods that attempt to incorporate the needed information directly into the statistical model; they have not however achieved the popularity of P-values and confidence intervals, in part because of philosophical objections and in part because no conventions have become established for their use.”

There really needs to be conventions [understanding] established for their use.

I do believe if any alternative is not well sorted out and explained widely to statisticians and others in this regard, in X years we will just have another as long or longer guide to misinterpretations. (The issue is getting at what CS Peirce called the pragmatic grade of understanding of the concept, beyond just the proper definition and identification of it.)

Thanks! I’ve learned a bit about Bayesian statistics (hoping to learn more) and I can see how this approach may be of more value than null hypothesis testing, but I believe Andrew suggests that using Bayes factors is no better than using p-values.

I understand that much of the dissatisfaction with using p-values arises from how they are misused, but my point is that when used properly they do seem to be informative, and I think the logic behind null hypothesis testing is elegant (if often misconstrued). Again, what I’m really trying to understand is why someone would argue that they have no good use even under ideal circumstances (the straw man null hypothesis comment).

Hi Sabine, you write

” in my hypothetical example I’m using the p-value to test whether a difference in a predicted direction would be highly surprising/improbable under the null. I’m assuming I have good power to detect the effect.”

Presumably you’d have to have looked at previous work to estimate power. Due to the existence of Type S and M error, your estimate could be a wild overestimate (people often don’t publish a result if it comes out significant but in the wrong direction—this is how people manage to have 30-40 year long careers with consistent results that nobody else can replicate). So your assumption that you have high power is not a certainty but just a hope. That’s why you’re still left face to face with a p-value and the three situations (a), (b), (c), and you don’t know which possible world you are in. The single p-value and the associated null hyp test will tell you nothing about the actual hypothesis for this reason. Replications will (but they don’t need p-values, just consistent outcomes).

In response to Sabine’s 10:44 am comment, where she said:

“I’ve learned a bit about Bayesian statistics (hoping to learn more) and I can see how this approach may be of more value than null hypothesis testing, but I believe Andrew suggests that using Bayes factors is no better than using p-values.”

Yes, Andrew does not support using Bayes factors. I believe this is at least in part because they still have the dichotomous nature of p-values, but I believe there are other problems with them as well (but don’t recall at the moment what they are). My understanding is that his preferred approach is to model each problem individually, and use the posterior to help understand what is going on and make decisions.

Also, your comment “what I’m really trying to understand is why someone would argue that they have no good use even under ideal circumstances” seems to neglect the reality that ideal circumstances just do not seem to exist in real life problems.

Sabine, plug in a number like 2.49841 into the inverse equation solver here: http://mrob.com/pub/ries/

After a few seconds, you should see about two dozen relatively simple algebraic equations that approximate that number to within 10%. If that value was an experimental result, there could be multiple theories that allow deduction of each of those algebraic equations. Now imagine if we allowed all algebraic functions that were consistent with a value in the same direction (ie positive or negative). How many different theories would be consistent with the result then?

Here is my point: To distinguish between different explanations for an observed pattern/result, we need both theories making precise predictions and precise measurements. The “alternative hypothesis” that usually maps to the research hypothesis is too vague to be of any use in distinguishing between different explanations for the observations. Default-nil-NHST is of no use to the scientist who wants to distinguish between different plausible explanations. This is innate to that procedure, making it fatally flawed.

While it is not an innate property of NHST, the worse problem is that in practice this procedure allows proliferation of vague pseudosciency theories and discourages collection of precise observations. Researchers never collect precise observations, because they only feel the need to rule out the “null” hypothesis. Theorists can’t distinguish between their theories, because the observations are too imprecise (and biased because only results where a large difference between two conditions was observed are published).

Sabine,

Anoneuoid’s last two paragraphs make some good points. To see some instances of them, you might try reading Measuring and Reasoning, by Fred Bookstein.

Martha,

Thanks for alerting me to Bookstein. There is an interesting teaser here:

“In place of p-values there is an unusual concentration on crucial details of measurement—where suites of variables come from, how calibration of machines can maximize their reproducibility—that are almost always overlooked in textbooks of statistical method. The exceptions to this generalization, such as the 1989 book by Harald Martens and Tormod Næs on multivariate analysis of mass spectrograms or (I note modestly) my 1991 book on the biomathematical foundations of morphometrics, arerare but, when successful, prove to be citation classics partly by reason of that rarity.

But they all share one central rhetorical concern: consilience (Dogma 6), the convergence of evidence from multiple sources. Now consilience requires a relatively deep understanding of the way that such multiple sources relate to a common hypothesis. To have a reasonable chance of making sense in these domains we must take real (physical, biophysical) models of system behavior (the organism on its own, and the organism in interaction with our instruments) as seriously as we typically take abstract (statistical) models of noise or empirical covariance structure. Serious frustrations and paradoxes can easily arise in this connection. Over in the psychological sciences, Paul Meehl once wisecracked that most pairs of variables are correlated at the so-called “crud factor” level of ±0.25 or so. It is this correlation, not a correlation of zero, that represents an appropriate null hypothesis in these sciences. Closer to home, in my own application domain of morphometrics, landmark shape distributions are never spherical in shape space. The broken symmetries are properly taken not as algebraic defects in our formulas, but as biological aspects of the real world; they are signal rather than lack of fit. A few years ago Kanti Mardia, John Kent and I published a quite different model for a-priori ignorance in morphometrics, an intrinsic random field model in which noise is self-similar at every scale. I am still awaiting news that somebody has tried to fit their data to that.”

https://www1.maths.leeds.ac.uk/statistics/workshop/lasr2010/proceedings/L2010-05.pdf

Anoneuoid,

Nice quote and link. The Meel idea for an appropriate null hypothesis is a gem.

Thanks for your elaboration of this point. I think the problem I am having is that the discussion is divorced from specific examples of what I currently think are defensible uses of null hypothesis testing that yield information worth having, and thus it’s difficult for me to evaluate if the argument strongly implies that p-values are worthless.

So here is a less controversy-laden hypothetical. I have a hypothesis that selective sustained attention is mediated by verbal representations. So I set up an experiment in which individuals complete several trials on which they instructed to track a moving target novel shape in the midst of distractors and indicate where it was before it disappeared from the screen. An experimental group is taught names for the target shapes, while the control group is familiarized with the shapes but receives no label training. I predict the experimental group will be better on average at tracking the target shapes than the control group. Assuming a difference in the predicted direction, I test whether the difference is surprisingly large under the null hypothesis. Assume my sample is a healthy size, capable of the effect if there is one. And also assume there were attempts to measure the dv with precision (e.g., using multiple observations, etc.) And lo and behold, p is less than .05.

Can you tell me why this is not informative? Sure, one could say that there may be other reasons than the labels that could explain why participants do better in the labeling condition, but careful design/matching of conditions on all aspects other than the manipulation could minimize this possibility. One could also say this isn’t a test of a well-developed theory, or a comparison of theories, and therefore not an interesting contribution, but I think this is debatable. I may be really interested in whether language plays a role in such cognitive processes, and this seems to be a suitable framework that allows me to test the question I’m interested in.

Sabine,

(I got this in the wrong place in the thread, so here is another better-placed try.)

In your March 11, 10:30 pm comment, you say, “I test whether the difference is surprisingly large under the null hypothesis.”

How do you propose to test for this?

Sabine, you wrote:

” I predict the experimental group will be better on average at tracking the target shapes than the control group. Assuming a difference in the predicted direction, I test whether the difference is surprisingly large under the null hypothesis. Assume my sample is a healthy size, capable of the effect if there is one. And also assume there were attempts to measure the dv with precision (e.g., using multiple observations, etc.) And lo and behold, p is less than .05.

Can you tell me why this is not informative?”

Let’s say you expected mu to be positive.

Two possible scenarios with p<0.05:

1. Your sample mean is positive in sign and you get the p<0.05. You can rule out (with alpha probability of being wrong under hypothetical repeated runs of the experiment) that there is no effect (I.e., mu=0).

This is now a publishable result.

2. Your sample mean is negative in sign and you get the p<0.05. You can rule out (with alpha probability of being wrong under hypothetical repeated runs of the experiment) that there is no effect (I.e., mu=0).

This is no longer publishable as your hypothesis was not supported by the sample means.

The p-value in both cases gave you the same information (mu!=0, possibly), but the decision as to whether you have support for your particular hypothesis doesn't come from the p-value at all. In (1) we would be happy, and in (2) we would be sad. For the same p-value.

Maybe read Gelman et al on Type S and M errors and run some simulations to understand what this really means for your studies. This stuff is not just theoretical: as an exercise, for your kind of research question, try running the same experiment five times with real subjects (literally the same setup, different subjects from the same subject pool) and watch the means flip-flop. That is what is happening to me. I take a published result in a major journal, replicate it as exactly as I can, and get the opposite pattern or a mean close to 0. Even my own experiments' results flip flop all over the place. Harvard professors might suspect that maybe I just don't know how to do experiments (I don't work in a prestigious university, i.e., not in the US). It's possible. So try it out yourself.

Here is a defence of p-values:

http://www.r-bloggers.com/its-not-the-p-values-fault-reflections-on-the-recent-asa-statement-relevant-r-resources/

Shravan,

Thanks for the link to the R-Blogger piece. I would not call it a defense of p-values so much as a caution not to interpret the ASA report as saying don’t use p-values — in particular, pointing out “responsible” ways of using p-values, and that the cautions about p-values also apply to many other statistical techniques.

The problem is, the recommendations of statisticians are really hard to put into practice

molecular biology consists largely of moving little drops of water (or 99.9% water with buffer or protein or DNA or whatever) from tube to tube, often in sets of ten or twelve, tube one is condition one, tube two is replicate, tube three is variation on one, etc

To keep sane, and make sure you do not make a mistake, people do tube one, tube 2….

A stats person would be to add water to the tubes randomly, but then you would never do the experiment right

until, since these are all done totally by hand

any help here ???

E:

That reminds me of a story that I’ll have to tell here sometime . . . Anyway, short answer to your question is that it’s not necessary to add water to the tubes randomly, you should just include tube number as a regression predictor in your analysis. We discuss this sort of thing in chapter 8 of BDA3 and chapter 9 of ARM.

Sabine wrote:

“I have a hypothesis that selective sustained attention is mediated by verbal representations. So I set up an experiment in which individuals complete several trials on which they instructed to track a moving target novel shape in the midst of distractors and indicate where it was before it disappeared from the screen. An experimental group is taught names for the target shapes, while the control group is familiarized with the shapes but receives no label training. I predict the experimental group will be better on average at tracking the target shapes than the control group.”

http://andrewgelman.com/2016/03/07/29212/#comment-265856

First, lets get rid of the statistical aspect of this problem. Forget p-values, bayes factors, and all of that. Assume we know for a fact, with 100% confidence, that the experimental group always performs better than the control group under these conditions. The problem is that there are other explanations you need to address. Further, this list is going to be, for all practical purposes, endless because your prediction is too vague. Here are a few:

1) Having names for the target shapes makes them more memorable and thus easier to track.

2) The “instructor” verbally assigning the labels spends more time on those that are on the screen most often (or otherwise leaks information somehow).

3) The process of familiarizing the control group actually confuses them, or tires them out, etc

4) The names given by the instructor are shorter or more memorable than the labels each subject would “self-assign” on average

5) The experimental group gets more “training” with the shapes because it takes extra time to verbally assign the labels

Some of these alternatives may be interesting in their own right, but others would just be boring experimental artifacts. So p under .05 does not mean anything interesting is going on. On the other hand, if the difference between the two groups was exactly zero on average, that would also be an interesting result. How is it that people perform so consistently on this task?

P-values are simply representative of a wide numbers of issues which pertain to real-world statistical analysis since alternative statistical approaches will never be completely robust to misinterpretation even if better. Thus these need to be paired with a real and concerted effort to improve education on issues such as study design and inference. We also need to create better incentive structures for researchers to ensure that these same mistakes are not repeated again.

Participants in this discussion may be interested in what the eminent statistician and probabilist David Freedman, had to say about this decades ago:

http://www.math.rochester.edu/people/faculty/cmlr/Advice-Files/Freedman-Shoe-Leather.pdf

(see also some of his references)

I will not comment further.

Cheers,

Bert

Bert:

Freedman was a good writer, even if much of what he wrote made no sense.

Many thanks for your excellent forum.

An observation on the problem of teaching of P-values. I come to this from oceanography/environmental science. Scientists in my community have been so indoctrinated into the P-value concept that they often stop thinking. A colleague just took a college stats course, taught by an ecologist, and P-values, null hypothesis testing, was essential.

P-values are too common, even when irrelevant. But the worst part is that many of our observational

data sets are not samples, they are actually the population. So a P-test is applied to “how much is rainfall increasing in NY over the last 30 years?” “What is the relationship between satellite biomass and temperature over Long Island Sound in the last 15 years”? This use of P values to include or exclude complete data sets is common and accepted. (good commentary by Nicholls, 2001, Bulletin of American Meteorological Society.)

I see papers accepting ecologically insignificant trends because p<0.05, they accept trends that are false because they should not have used least squares regression (like an El Nino producing an outlier at one end of the data set) because p 0.05. And then there are the data mining papers.

I’ve tried to work this a review at a time, and ask people to report effect sizes, misfit, or uncertainties and to deemphasize or delete p-values; and also to use appropriate statistics.

(Unfortunately, I’m also self-taught, and pretty much a hack.) The teaching needs to change, and the

editorial practices at the journals I know of also need to change. If the ASA makes a bigger deal

about the elementary stuff, that might help.

Yes, it’s not just that teaching is often poor, but textbooks often teach poor practices. Feel free to use any of the stuff I’ve got posted at http://www.ma.utexas.edu/users/mks/ that might be helpful; in particular:

May 2015 SSI Course (or 2016 when I get it posted in late May): Common Mistake in Using Statistics: Spotting Them and Avoiding Them.

Biostatistics Guest Lecture

M358K Instructor Materials

Blog: Musings on Using and Misusing Statistics

Martha,

I saw the course has been posted. Common Mistake in Using Statistics: Spotting Them and Avoiding Them.

Many thanks for the link, a lot of good information.