Demis Glasford does research in social psychology and asks:

I was wondering if you had ever considered publishing a top ten ‘do’s/don’ts’ for those of us that are committed to doing better science, but don’t necessarily have the time to devote to all of these issues [of statistics and research methods].

Obviously, there is a lot of nuance in both methods and stats for any particular project. So, I’m not asking you for a ‘one size fits all’, but more of a 5 or 10 factor checklist as a framework for those of us committed to doing better work, but worried we may not have the expertise or time to follow-through on these commitments. Sort of a—whatever you do, at least do x, y, and z.

I looked up Glasford on the internet and found this description of his research:

The focus of much of my work is on three interrelated streams of research questions that are concerned with understanding: how people make decisions about what to do when faced with injustice; what compels people to join and stay involved in political protest that can benefit their own group, as well as groups they do not belong to; and when, why, and what helps individuals from groups of differing power improve relations with one another.

Wow—this sounds important. I should talk with this guy.

In the meantime, do I have a checklist of 10 items? I’ve given advice to psychology researchers from time to time but I don’t have a convenient list of 10 things.

But we should have such a list! Can you make some suggestions in comments? Also, if anyone out there is in contact with any leading social psychologists, maybe we could get their thoughts too? There’s a lot I disagree with in the writings of, say, Susan “terrorists” Fiske or Daniel “shameless little bullies” Gilbert or Mark “Evilicious” Hauser or all the other people you’re sick of hearing about on this blog—but, say what you want about these people, they’ve thought a lot about psychology research and I’d be interested in what *their* top 10 tips would be. Not tips on how to get published in PNAS or wherever, but tips on doing better science.

Unfortunately I don’t expect we’ll hear from the above people (I’d be happy to be surprised on that end, though!), so in the meantime I’d love to hear your thoughts.

OK, I’ll start with items #1, 2, and 3 on the top-10 list, in decreasing order of importance:

1. Learning from data involves three stages of extrapolation: from sample to population, from treatment group to control group, and from measurement to the underlying construct of interest. Worry about all three of these, but especially about measurement, as this tends to be taken for granted in statistics discussions.

2. Variance is as important as bias. To put it another way, take a look at your (estimated) bias and your standard error. Whichever of these is higher, that’s what you should be concerned about.

3. Measurement error and variation are concerns even if your estimate is more than 2 standard errors from zero. Indeed, if variation or measurement error are high, then you learn almost nothing from an estimate even if it happens to be “statistically significant.”

OK, none of the above are so pithy, and I’m open to the idea of other items bumping these down the list.

It’s your turn.

How about publish your code (and if possible, your data)? It’s not directly “statistical” per se, but even before you get to anything about methods, you’ve already lost the game if you’ve made an undiagnosed error in your coding or cleaning. Publishing the code makes it easier for others to check your work, and knowing that it will be public means you put in more time making sure it is readable and correct.

+1

Make the full data/estimation trail publicly available; while this looks like downside risk, the advantage is that you don’t have to respond to anyone who simply speculates about your results.

“I was wondering if you had ever considered publishing a top ten ‘do’s/don’ts’ for those of us that are committed to doing better science, but don’t necessarily have the time to devote to all of these issues [of statistics and research methods]. “

1. Design/ create/ be part of a system in which you would have the time to devote to all these issues (related to doing better science), because they are seen as being of primary importance?

This is how you end up with people thinking things like “p less than 0.05 therefore the result is real” and then blindly pumping out reports of such discoveries. They are so busy reading about pointless studies, running pointless studies, and applying for grants to run pointless studies that they never get a chance to think about what they are doing. This is institutional (and world-wide afaict), so:

1) Devote time to understanding the logic behind drawing conclusions from data and designing useful studies.

Right. It’s fine to say that you aren’t going to go deep into the specifics of how to fit certain complicated models or whatever, but at the basic level of understanding, you need to know what does and what does not logically follow from the information you have available to you. Abandoning the broken logic of straw-man NHST is the first step, the next step is understanding what to replace it with.

Yes, it really is not complicated. It is almost at the point of common sense.

I do understand the difficulty people have accepting the problem with NHST* though. It was only after multiple years searching for a reasonable argument in its favor that I finally accepted there is nothing of value there.

The difficulty is in accepting so many highly educated people (often including themselves) have been wasting their careers like this, not with understanding the actual issues. Those I (at least loosely) grasped immediately upon being taught NHST, leading me to think I may have been going insane… until I discovered Paul Meehl, who so clearly explained the problem.

*Usual Disclaimer: This refers to the most common use case where the “null” hypothesis is not predicted by any theory. When the “null” hypothesis is predicted by theory, that is a completely different procedure which needs to be considered separately.

4. Learn from the data mining people. They know they are going to go down the garden of forking paths, and so split the data into a Training set and a Test set (sure, 10-fold cross-validation is cooler, but let’s not get ahead of ourselves here).

Don’t have high school students collect data for your multimillion-dollar governmental program.

I can’t help thinking this is a reference to some specific study that went wrong.

But I disagree with the principle. Properly vetted, trained and supervised, there is no reason that high school students can’t do an excellent job at collecting appropriate types of data. In fact, there are some types of data collection I can imagine where they would be better suited to doing it than any other group.

Yes, honestly I think whatever Jordan was thinking of was probably, at its root, better summarized by “don’t have highly trained PhDs in Consumer Behavior with professorships at Cornell supervise students”

I was specifically referring to this paper: http://www.sciencedirect.com/science/article/pii/S0091743512003222, where it is stated in the abstract:

“The scalability of this is underscored by the success of Study 2, which was implemented and executed for negligible cost by a high school student volunteer.”

For the record, they haven’t released the data so I don’t know if there are problems with the data collected by the high school student. However, there are serious issues with every data set the lab has released, and those data sets appear to have been collected by undergraduates (who are not authors on the papers).

My intention was not to disparage undergraduates, or high school students. Even elementary school science fairs likely conduct more useful experiments than Cornell.

Right, in fact, we might improve science a lot by firing wide swaths of PhDs and replacing them with really motivated high school students who haven’t yet had the logic systematically beaten out of them.

That was my point, you’re likely pointing the finger at the wrong entity. It was the PI not the student that broke those studies.

I think this is a good point, and maybe lead to a better rule of thumb (though i did enjoy OPs comments). Namely, be involved in the data collection process, whoch doesnt have to mean direct involvement but should at least be cognizant of who is doing what and what resources/guidance is available to people (maybe this is just advice for running any sort of lab).

Do not underestimate the importance of accurate reporting, which is a whole spectrum of dos and donts on its own: Clarify in the text why you are doing the science, and clarify why you are using the methods you use. Use tables and/or figures to complement your text. Write in an engaging way yet avoid mixing up findings with personal interpretations. Check your report many times because typos and out of date tables/figures/statements can undermine readability.

Do write with the aim of exposing your ideas of the criticism of your peers.

Don’t resent and/or ignore your critics.

Connected to this: Be curious and skeptical! Always keep an open mind for the idea that what you think you know and what you take for granted might be wrong; be very curious about ways your own ideas and results might be wrong or misleading. (This may lead you to either change your views at some point; or to numerous failed attempts to make your results fall, in which case they’ll come out much stronger.)

“Don’t resent and/or ignore your critics.” Invite them and embrace them!

> “Don’t resent and/or ignore your critics.” Invite them and embrace them!

If you make that rule 0 and folks actually embrace it then almost any list should do.

+1

+2

Don’t tell lies. If you know of measures/studies that do not support your hypothesis, do not ignore them. If you generated your theory after looking at the data, do not write that your results were predicted by your theory.

Perhaps modify this to start, “Don’t tell lies. This includes lies of omission….”?

I really don’t think actual lies, that researchers think of as lies are the main problem. This is the distinction between p-hacking and forking paths. There are definitely some liars like Hausman or Weggy but most people who produce bad research are honestly looking for real stuff, they just pursue it in an ineffective way.

Possibly a better way to put Greg’s suggestion would be, “Be intellectually honest.” This (to me) includes maintaining an active effort not to inadvertently tell lies — in particular, viewing all techniques with skepticism, trying to understand when they do or do not apply, asking how sound the evidence for using them in a particular situation is, making a serious effort to understand criticisms rather than dismissing them, and acknowledging when you have made a mistake.

It also includes looking for sources of uncertainty, and acknowledging them and their possible effect on conclusions.

+1 On a related note, if you do not know what you are doing, then do not pretend otherwise. Meaning, if you do not know which measure is important for your hypothesis or do not know how to distinguish relevant from irrelevant studies, then you cannot be properly testing your hypothesis. It’s part of being intellectually honest. What many psychological scientists, perhaps, do not realize is that we are all in this situation some of the time.

Well, yes, Greg, but there’s this problem:

https://www.youtube.com/watch?v=wvVPdyYeaQU

Could almost be our very own Nick Brown ….

I often use a more generous version of Cleese’s message, saying that someone is clueless that they’re clueless.

Yes, that is a challenge. We can only try to correct them (and try to recognize that sometimes we need correcting ourselves). Even so, I think there is tendency in psychology for people to (knowingly) present their results as more solid than is appropriate.

To counter this tendency, may need to allow people to describe the limitations of a study. Other times, it means we should ask people to keep working on a topic for a few more years before publishing a really good study. The latter approach is hard to follow in the current academic environment, though. Search committees pay attention to the number of publications, and they have a difficult time judging the quality of the work.

Greg said ” I think there is tendency in psychology for people to (knowingly) present their results as more solid than is appropriate. “

Sad to hear, but I believe you, since you are in the field and I am not.

A corollary:

Don’t expect to be believed. Present your results as if skepticism, not credulity, is the natural posture of your reader.

The curious thing about liars is that they expect you to believe them.

+1

Theories should guide research, not *just* raw data.

The projects that I consulted for that I was most impressed with were the ones in which the researcher was deeply entrenched in figuring out the mechanisms they were investigating yet had very limited statistical capabilities and needed a statistician to present their findings in a way reviewers would accept. This also often involved discussing which measures should be most (a) related to the mechanism of interest and (b) stable.

The projects that seemed most troubling to me were the ones in which the researcher came to me with a huge pile of data and said “we’re going to go where the data takes us”.

But this cuts both ways. Starting with a theory is often a good way to use forked paths to “prove” what you already thought was true. Sometimes following the data is a healthy protection against many abuses. I suspect neither strategy is sufficient by itself. Those that start with theories need other ways to protect against “torturing the data until it confesses,” while those that start with the data need to protect against being “fooled by randomness.”

+1

+ 1

There’s a saying/joke I heard in a graduate theory course in the context of warning against HARKing: “Nobody but the author believes the theory, everybody but the author believes the analysis.”

And the theory as to be specific enough to actually be testable. The problem with papers like fat arms is that the theory is so vague that you can’t really test it, not that there is no theory at all (in that case the theory seems to be “people are idiots”).

+1/2 “Testable” needs to be defined clearly so that people don’t think it means something like, “Oh,I can just gather data with my own choice of measure and do a hypothesis test.”

1. Be willing to change your mind about an effect or mechanism on the basis of high quality data.

2. Map the effect of data analytic decisions (i.e., report a multiverse analysis).

3. When doing any kind of covariance modeling/SEM report the fit and residuals of all plausible alternative models.

4. Abandon discrete rules-of-thumb that represent arbitrary lines in the sand that supposedly distinguish “good” from “not good” (e.g., p-values, reliability estimates, fit statistics, sample size ratios).

5. If using a confirmatory approach preregister not just your hypotheses but also a list of all variables collected and the planned sample size.

6. If you primarily do observational/correlational research try to integrate some experimental methods into your research (and vice versa).

7. Consider how situations and individual differences interact to influence behavior. Too many psychologists still primarily study only one of these.

My professor used to hang and article on his wall that had four or five suggestions about doing good science. The one that really struct was the emphasis that one should inspect the raw data whenever possible. That helps build an understanding of the limitations of the measurements and possible sources of error.

1. Be willing to change your mind about an effect or mechanism on the basis of high quality data.

2. Map the effect of data analytic decisions (i.e., report a multiverse analysis).

3. When doing any kind of covariance modeling/SEM report the fit and residuals of all plausible alternative models.

4. Abandon discrete rules-of-thumb that represent arbitrary lines in the sand that supposedly distinguish “good” from “not good” (e.g., p-values, reliability estimates, fit statistics, sample size ratios).

5. If using a confirmatory approach preregister not just your hypotheses but also a list of all variables collected and the planned sample size.

6. If you primarily do observational/correlational research try to integrate some experimental methods into your research (and vice versa).

7. Consider how situations and individual differences interact to influence behavior. Too many psychologists still primarily study only one of these.

8. Be open to inductive methods

9. Recognize that 99% of all research ideas are not particularly good and that you need to be willing to let go of most of the “insights” that were inspired by that really good bottle of Malbec.

10. Don’t blindly trust the work done by your collaborators on a joint project. Double-check everything.

Pre-specified everything that can be pre-specified. And if it is not possible, do not confuse an exploratory dive into data with a pre-specified experiment that tested hypotheses that were clear up-front. If you can, even publically register your experiment and your planneded analyses up-front for added credibility.

1) If you’re a first generation scientist or other under-represented minority in science take these (and all) good science suggestions with a grain of salt and feel free to play the game. Mining noise is probably a waste of time and it’s nice to be all cutting-edge and reproducible but lots of your colleagues had family support to get them through volunteer CV-building lab gigs or un/under-paid data-science internships and these things will count as markers of accomplishment or commitment to science. You might need something else to get the same.

2) Don’t forget what counts as playing the game and what counts as science.

You do realize, don’t you, that if this appalling advice is followed, the result will be that no one will trust the work of first generation scientist or other under-represented minorities?

That’s completely untrue. People trust work produced to the standards of the field. In my experience people who don’t already have an ‘in’ of some sort are the ones most likely to have excessively lofty ideas of the standards of the field.

But if “playing the game” means publishing flashy results which are not in fact supported by data, or tinkering with data, or the like, doesn’t this strategy backfire at some point of “player’s” career?

I have thought about this and i do not see any real negative consequences for those who are “playing the game” even if the “get caught”. I am speaking of psychology/behavioural science, as i am most familiar with that. It looks to me like:

– “playing the game” is not even considered to be “bad” by a lot of people, although it seems like the view about this is changing.

– many institutions back their scientists

– many colleagues back their “friends”

– those that are being “caught playing the game” (i.e. the Wansink or Cuddy case?) are mostly senior researchers with tenure and/or influence who will experience no real negative consequences. They may state that they “have lost grants” or something like that, but this doesn’t really matter i reason. They still have tenure or book-deals, and a nice paycheck to go with that.

– when all else fails, they could start talking about how everything is “context-dependend”, and that “more research is needed”, and stuff like that.

So in my reasoning, a) there are no clear rules of what is bad about “playing the game”, and “dubious” previous actions can all be accounted for by providing some (pseudo?) scientific reason, and b) those who “get caught playing the game” are being protected by the current system (i.e. tenure, influence of institutions and senior colleagues, people are given credit/taken seriously/listened to not because of what they say or do, but just because they are a professor) and suffer no real negative consequences.

Anon:

I think we should proceed on two tracks:

1. Reduce incentives to do bad science. Methods of achieving goal 1 include setting up norms of preregistration, estimation of type M and type S errors, and Bayesian or hierarchical models to pool noisy estimates toward 0. P-value thresholds have traditionally been considered as a method to reduce incentives to bad science! They haven’t worked so well (forking paths and all that), but we should remember that as one of the goals.

2. Help scientists do better. Methods of achieving goal 2 include better measurement and better connections of theory to measurement, larger sample sizes and integration of data from multiple sources, within-person designs, and multilevel models.

We need both.

Without addressing power structures in science the incentives to do bad science will never go away. When a grad student tells you to drop an inconvenient control (or your favorite field-specific variation, e.g.-forking paths) so you can publish already, you can just laugh at them. When a senior colleague holding the purse strings does so and starts labeling you as “unproductive” it’s a different story (I mean, I think laughing is a fine policy but I can’t give that out as career advice).

Personally I think one of the ways to move in that direction is to create real early-career non-faculty positions rather than pretending “post-docs” are training positions.

Sure. But it seemed to me that Krzysztof above was addressing young researchers in the early stage of their careers. They are in a very different position than those with established careers. And like you said, it seems that the views are changing (probably slowly, but I can’t see us going back to the way things were before the replication crisis). Thus to me honesty & integrity seems like a better strategy.

(I’m also talking about psychology)

I’m saying these discussions about how to do better science can be confusing for people who don’t have the cultural training that comes with a privileged background (academic family, the right kind of wealth, established scientists for mentors, whatever). They should know that there are 1) standards of the field; 2) standards of the field 5 years from now; and 3) Andrew Gelman’s current standards. They will have to make compromises of some sort to succeed and all I want to communicate is that it’s fine for there to be a spread of #1-#3 in their work… __just like in everybody else’s work (including Andrew’s)__

Krzysztof:

Indeed, life would be pretty boring if all my own work lived up to my standards. If this were ever the case, I’d take it as a signal to raise my standards!

Andrew:

One of the things I enjoy about your work and blog is how straightforward you are about this stuff. OTOH your standards are pretty high so you might’ve reached the point of diminishing returns on raising them.

Hi Krzysztof,

One thing I am wondering is why people who recognize all the BS going on, and know that they are partaking in the BS, still want to be part of these research communities that demand it? It isn’t a job that necessarily pays well or anything, it has to be a constant stress to scientifically minded people to be knowingly producing BS… What is the motivation to stay in the community?

I think most communities have some BS trade-off and you have to find one you’re willing to work with. Uber sure pays technical people well (or so I hear) and they have some fun problems to tackle but the cultural BS and the HR department… oh boy. I wouldn’t take that trade-off. Also if you figure out the BS mid-stream when you’re part of a project, well, academia is a feudal system so you still need your advisor’s blessing.

Kinda depressing.

This doesn’t answer it, at least not for me. You seem to be saying there are office politics anywhere you go.

Ok, but dealing with office politics is totally different from a community telling you to literally produce BS or leave (my paraphrase of how you described research earlier). Ie, the BS

isthe job, not tangential to it. Also, consider that most people who go into research start out actually liking science and against destroying it.There must be some more motivation I am missing.

Good points.

Avoid:

1. significosis, an inordinate focus on statistically significant results;

2. neophilia, an excessive appreciation for novelty;

3. theorrhea, a mania for new theory;

4. arigorium, a deficiency of rigor in theoretical and empirical work;

5. and finally, disjunctivitis, a proclivity to produce large quantities of redundant, trivial, and incoherent works.

See: On doing better science: From thrill of discovery to policy implications

https://doi.org/10.1016/j.leaqua.2017.01.006

Thanks for the list and the link. A quote worth quoting:

“Science is not necessarily self-correcting because the process of publication may not be properly managed by the publication gatekeepers. For whatever reasons, including failure to robustly replicate research, there are

simply too many findings that become “received doctrines” (Barrett, 1972) or “unchallenged fallacies” (Ioannidis, 2012). We have to collectively make an effort to address this problem and to ensure that what is published is robustly done, tentatively accepted, and then actively challenged. Only then can we claim to approach the truth.”

Science is only asymptotically self-correcting.

Yes, science is asymptotically self correcting…but only if the estimator is consistent. And many of tbe time it is not. Problems include:

Using HLM (random-effects) incorrectly (Halaby, 2004; McNeish, Stapleton, & Silverman, 2016) or failing to partial out fixed effects by including the cluster means of the within variables (Antonakis et al., 2010; Rabe-Hesketh & Skrondal, 2008);

Comparing groups when selection to group is endogenous (Bascle, 2008; Certo, Busenbark, Woo, & Semadeni, 2016; Li, 2013) or based on a known a cutoff but not modeling this assignment via a regression discontinuity design (Cook, 2008);

Estimating mediation models via Baron-Kenny or the Preacher-Hayes type methods, but not comparing estimates to an instrumental-variable estimator (Antonakis et al., 2010; Shaver, 2005)—bootstrapping an inconsistent estimate does not make it consistent;

Sampling on the dependent variable and not accounting for survival bias or other forms of selection effects (Denrell, 2003, 2005);

Using self-selected samples or so-called “snowball samples” (Marcus, Weigelt, Hergert, Gurt, & Gelléri, 2016) and not reflecting on how samples can produce bias (Fiedler, 2000);

Using partial least squares (PLS)—editors are desk rejecting articles that use PLS (Guide & Ketokivi, 2015) and I will do likewise because PLS should never be used in applied work (Rönkkö, McIntosh, Antonakis, & Edwards, 2016). Authors should use 2SLS if they require to use a limited information estimator;

Ignoring overidentification tests when significant (i.e., the χ2 test of fit) in simultaneous or structural equation models (Ropovik, 2015)—such models have biased estimates (Bollen, Kirby, Curran, Paxton, & Chen, 2007; Hayduk, 2014; McIntosh, 2007); indexes of fit (e.g., CFI, RMSEA) should not be trusted because they depend on idiosyncrasies of the model (Chen, Curran, Bollen, Kirby, & Paxton, 2008; Marsh, Hau, & Wen, 2004; Savalei, 2012).

The list goes on (for some checklists see Antonakis et al., 2010):

On making causal claims: A review and recommendations

https://doi.org/10.1016/j.leaqua.2010.10.010

I’d say, “Science is at best only asymptotically self correcting.”

I write as a journal editor. The following represent common reasons for rejection of papers. (The most common is sending them to the wrong journal, but that is a matter of presentation, not research itself.)

1. Before you do a study, imagine that you get the result that is consistent with your story about what is going on. Now imagine that someone else told you about the result, and the story about why it happens. And you don’t believe it. You say, “That story couldn’t be right! Instead, the result must be due to …” The “Instead” is called an alternative explanation. Thinking of these before the study is done is much better than letting reviewers tell you about them.

2. File drawers are usually considered to be bad places for results. But sometimes that is exactly where the results belong. You don’t have to publish everything you do.

3. Some results are more like manipulation checks than findings. Before you look for an experimental result, ask yourself, “If I don’t get this result, will I reject my hypothesis? or will I reject the experiment?” The latter happens when, in some sense, the hypothesis must be true. It may help you to see this if you imagine extreme cases of the variable you manipulate. That said, sometimes it is worthwhile to find a new, more sensitive, experimental method for demonstrating the obvious.

4. Before you look for a result, check the literature in fields other than your own. Psychologists and economists tend to ignore each others’ results, and both fields ignore the work of philosophers, which is sometimes relevant. And everyone tends to ignore research done more than 10 years earlier. Failure to do these things can lead to wheel re-invention, or square wheels.

Nice

I have respect for Philip Tetlock & his wife Barbara Mellers at U of Pennsylvania. Expert Political Judgment a masterpiece, in my view.

I do have one suggestion. I would love to see a list of all the definitions of p-values & perhaps the context used. Maybe there is one undertaken already.

You mean, all the wrong definitions?

Writing clearly would help.

However, sometimes what seems clear is actually missing important details or qualifications.

Martha,’by clear’, I did not suggest that we should therefore exclude important details or qualifications.

I did not intend to imply that you suggested excluding important details or qualifications. My point is more that “clear” is a vague term that can be interpreted very differently by different people.

Do all the exploratory data analysis you want, but don’t publish the results without checking them first on a *different* data set.

Carolina:

This advice won’t work in fields such as economics and political science when working with historical data.

I’d be curious to hear more about what you think about this problem, since my sense is the social psych crowd (the subset committed to better science) tends to be pretty dogmatic about discounting results of non-pre-registered data analyses, which makes use of past datasets of high quality difficult to justify. To me, I think it comes down to a sound justification of the model specification and a smart interpretation of the effect size, not to say that isn’t always the case already.

Jacob:

In social psychology or cell biology where you can simply replicate the experiment, then, sure, I can see that it could be reasonable to require a preregistered replication.

Perhaps “don’t trust” rather than “don’t publish”?

Smell the data, and get down and dirty to see what’s really going on. In the late 60s I did an “experiment” in police patrol in two police districts in Boston. In the control district nothing changed. In the experimental district we doubled the number of patrol cars. As a result, the arrest rate shot up. What a finding — increased patrol dramatically increases arrests! That is, until I explored a little further.

I then talked to one of the arresting officers. He said, “We knew there were these car strippers in the district, so we took them out. It was like shooting fish in a barrel. We couldn’t wait to get back to policing the really rough areas.” You see, cops aren’t fungible. And to increase the patrol force in the experimental district, the tactical patrol team (the Marines of the department) were used.

Sounds like The Law of Unintended Consequences.

There are a lot of great general statements of principle here, but I wonder how we’ll they’d with those who we currently suspect of doing “bad” science, or at least “not good enough” science. I can see plenty of people agreeing to these principles “in principle” and then keeping on doing what they’re doing.

My favorite sobering “principle” I learned from reading this blog, and I now make a point of sharing it with anyone who wants to discuss their p-values:

When reporting a p-value, you are making the claim that you would have performed the *identical* analysis with *any* different set of data from the population to which you are drawing inference, or that any difference in analysis across samples would be determined by a consistently applied general principle accounted for in the sampling distribution from which your p-value is calculated. This means all data exclusion / coding / recoding / combining / transformation decisions were guided by a well specified general principle that would be applied consistently across new data sets. It means that the sample size was either set in stone with no chance of being changed upon inspection of the data, or it was guided by a well specified principle that would be applied consistently across new data sets. It means that the statistical model used for conducting this test would also be the one used for any other set of data, and the statistical test(s) conducted would be the same ones – and only ones – reported for any other set of data.

Otherwise the whole thing falls apart. I had never thought of the p-value as a claim about what would have been done if the data were different before I found this blog. Now that’s how I teach it to students and describe it to collaborators, and my impression is that it tends to sober people up.

It’s worth noting that all these complaints about p-values *also* apply to Bayesian inference. And not just Bayesian hypothesis testing.

A:

Selection bias is always a concern, but the issue raised by Ben—the issue of what would’ve been done had the data been different—is particularly relevant for p-values because they are

specificallyabout what would’ve been done had the data been different. I don’t think other statistical procedures such as Bayesian inference are in general so sensitive to this particular issue. To put it another way: regressions, Bayesian inferences, classifiers, etc., do all sorts of things. A p-value isonlya statement about what would’ve been seen had the data been different, so I think it’s particularly sensitive to forking paths.Andrew:

True, but in the least, it’s a more transparent problem than a hidden problem.

To either illustrate my point or inform me how it’s not such a valid concern, how you suggest a research perform an analysis in the following case: they collect their data and the relation they expected to see doesn’t materialize at first. However, 12 transformations later, they find a relation that does seem strong, for which they now run a Bayesian analysis on.

I’m guessing you are less convinced of this result than if the relation they hypothesized in advance appeared. How, exactly, does this uncertainty creep into the Bayesian framework? The closest way that I could see is that given that it took 12 tries to find this transformation, it must a priori be a very unlikely transformation required for correct inference! So if the researcher is honest in their analysis, they should have a prior that puts very little weight on this transformation.

But I’m curious how you introduce a prior that really captures this in any sort of mathematical meaningful way, especially when the different transformations are all inspired by looking at the data itself, and they had no idea how many transformations they would run through until they found something they thought looked good? We all understand that p-values acquired with forking paths should not be taken too seriously…but do we have any of quantifying the forking path effect in Bayesian analyses that’s any more rigorous in practice?

I would say that this problem is not purely academic: we have lots of commentators on this blog who state that multiple comparisons are just not an issue with Bayesian analyses…but I would challenge them to come up with a meaningful prior for a forking paths analysis!

The fact that the researchers tried twelve transformations before they found one that “worked” indicates that they consider lots of different possible transformations plausible

a priori. An honest Bayesian would therefore construct a dispersed prior in the space of transformations, perhaps via a stochastic process of the sort typically used in Bayesian nonparametrics.My life is un-closed html tags and sadness.

“An honest Bayesian would therefore construct a dispersed prior in the space of transformations, perhaps via a stochastic process of the sort typically used in Bayesian nonparametrics.”

But, in practice, does this happen? Is it done in any sort of meaningful way? I am asking from genuine curiosity.

Similarly, multiple comparison methods can be used in a very similar manner with something like an alpha spending function (“I’ll spend alpha = 0.025 on the linear relation, 0.0125 on the quadratic, etc.)…but do we really think that the same researchers who did not use alpha spending functions for their multiple comparisons will come up with a realistic dispersed prior that honestly takes into account their uncertainty across transformations in any sort of meaningful manner? If you can cite a realistic (or even unrealistic) example of this in published literature, I would be interested to see.

Furthermore, there is exactly the “forking paths” issue that Gelman raises about p-values. Sure, you looked at two transformations (or maybe even just one!) before you decided you figured out how the relation was supposed to work…but if you only do something like mixture model prior over those two transformations, it’s not taking into account what would have happened if you had not been satisfied with those two transformations and tried a third.

I don’t know if it happens in practice; 90% of everything is crap, so I would assume not. I was just addressing your desire to know how a Bayesian prior could be introduced to uncertainty about an appropriate transformation — demonstrated by the researchers’ willingness to explore different possibilities — can be captured in a mathematical meaningful way.

In a Bayesian analysis with a function, it’s possible to introduce a large number of parameters and still retain control over the “shape” so that it doesn’t just fit through each data point. In this sense, you offer a lot of flexibility, but constrain it to some reasonable subset. You can do this with stochastic processes, but you can also do it with basis expansion and priors built on various dependency ideas eg. http://models.street-artists.org/2016/08/23/on-incorporating-assertions-in-bayesian-models/

Some models are easier to fit than others obviously…

I think there are plenty of examples where people fit nonlinear functions with a bunch of shape parameters using Bayesian methods, and in some sense this is the same thing (ie. you could transform something and fit a line, or you could fit a transformation to the raw data…)

On the other hand, I don’t have any kind of published examples I could quote off the top of my head. But if you search for “gaussian process” that’ll give you a gazillion hits, and it’s an example of the sort of thing we’re talking about.

Daniel:

This is a discussion about forking paths, not about how to fit non-linear functions. I used transformations of the data as an easy example, but it could have just as easily been using different exclusion criteria, comparing different subgroups, etc.

Corey:

My point is that Frequentist techniques *do* provide valid methods of inference from the issues of forking paths (i.e. you could use an alpha spending function where you spend 0.025 on the first hypothesis, 0.0125 on the second, etc.)…but no one uses them (a) because they don’t realize they have to or (b) because they don’t want to due to weakening their inference. Unless anyone has any evidence to the contrary, switching to Bayesian inference doesn’t answer either (a) or (b).

Daniel:

Oops, with the comparison of subgroups example, that WAS an example where coming up with the Bayesian prior was actually fairly straightforward…but at the same time, that’s also an example where having an extremely low p-value means the posterior probability that the sign of your estimate is right is very high (unless you really believe there is a high probability of the true null), so it falls into Gelman’s rule of “Why we (usually) don’t have to worry about multiple comparisons”.

Yes, I know having high certainty that you have the sign right is not particularly important…but I think *that* this is the correct argument about why letting p-values decide publication is really stupid.

err, “Yes, I know having high certainty that you have the sign right is not particularly important…” should read “Yes, I know having high certainty that you have the sign right is not particularly impressive…”

“This is a discussion about forking paths, not about how to fit non-linear functions”

In a very real sense at a fundamental level these are equivalent problems.

Let TR(data) be a computational function implemented in a formal language, with length less than 10^6 bytes for example. A “forking paths” analysis is one in which instead of a single well-theoretically founded transformation, we have a nontrivial set of these tranformations from which to choose, and we choose one after considering several.

The Bayesian version of this is to provide weights over all the transformations you want to consider… and then come up with posterior weights. So long as the posterior weight is near 1 on the one you finally choose then the “choose the one you like” version is equivalent to the full Bayesian analysis. But when the data doesn’t strongly pick-out exactly one of the transforms, truncating the others out of the fit is a bad approximation to the posterior… it’s gaming the system in precisely the way that p-hacking games the system.

Now, although in theory we can make a correspondence between TR(data) in the space of computational formal language programs, actually doing this kind of thing in practice involves coding some Stan program for example with a flexible representation of the transformation parameterized by several parameters of interest.

a reader: that would appear to be the point of your second comment in the thread (tagged October 10, 2017 at 10:50 pm

) but not the comment I replied to originally (October 10, 2017 at 4:03 pm). I don’t disagree with your second comment.

a reader, replying to Corey, said:

“My point is that Frequentist techniques *do* provide valid methods of inference from the issues of forking paths (i.e. you could use an alpha spending function where you spend 0.025 on the first hypothesis, 0.0125 on the second, etc.)…but no one uses them (a) because they don’t realize they have to or (b) because they don’t want to due to weakening their inference. Unless anyone has any evidence to the contrary, switching to Bayesian inference doesn’t answer either (a) or (b).”

Yes, to the extent that there are “valid” frequentist techniques for dealing with multiple inference. However, 1) Bayesian or other methods using shrinkage estimates are also valid ways of dealing with multiple inference, and have the advantage that they generally give tighter estimates than frequentest methods.

2) The problem of forking paths can also occur without intentional multiple inference, by letting the data influence the choice of analysis.

I do agree that there is a lot of ignorance of the problem (and of methods for dealing with it), and that reluctance to use methods that might give weaker “inference” than less valid methods is also a big problem.

Daniel:

The issue with forking paths is not that there’s no correct way to account for this in a frequentist setting (there are), but rather that in practice, researchers don’t account for this, and very typically don’t realize they are making a mistake.

My point isn’t that Bayesian analysis *can’t* address this issue, but rather I don’t think the odds that the typical researcher using Bayesian methods is anymore likely to properly address it, especially as we push more Bayesian users who have a weaker background. To illustrate:

“The Bayesian version of this is to provide weights over all the transformations you want to consider… and then come up with posterior weights. “

But you do know of anyone who does this? I would love to see published literature that discusses such a prior, but truth be told, I doubt I will. More over, exactly like the p-value argument, it should be over all possible transformations you would ever consider…even if you really stopped at the first model you hypothesized!

So Bayesian methods provide a solution…that virtually no one will ever use. I would argue this is even worse, because (a) people think that they don’t have worry about multiple comparisons since they have used Bayesian methods (but don’t realize to really properly control for this, they would need to write a prior that accounted for any possible way they would ever look at the data, every possible way they would decide that a data point is corrupted and should be dropped, etc.) and (b) they will have even higher overconfidence in their findings because their inappropriate prior probably lead to a tighter credible interval than the corresponding confidence interval.

Martha:

I agree with what you’ve said about many Frequentist methods being inefficient when it comes to multiple comparisons and that standard Bayesian methods often provide tighter inference in the face of multiple comparison, especially when one can borrow information about related effect sizes. In fact, standard MLE estimates look positively worthless if one has model with a large number of nuisance parameters for which we already have some insight into their effect size.

But I’ll also note that regularization methods, such as LASSO + ridge regression, are completely justified in the Frequentist framework, so regularization is not purely owned by the Bayesian camp. One of the nicest things about looking at it from the Bayesian perspective is that it gives us real insight about how much to penalize; the LASSO prior looks a little silly if you have good understanding of all your variables.

I will also note that Ben refers to the issue of dropping data, not just forking paths.

Yes, the key (as I understand it) is that in order to calculate a p-value you need a sampling distribution, and the sampling distribution is the distribution of values your test statistic would take on via the identical analysis if the null were true. So this isn’t a general statement about the abuse of flexibility, it’s actually part of the definition of a p-value. If you would have done anything differently with different data, then the sampling distribution used to calculate the p-value is incorrect.

Ben:

To continue along those lines: Bayesian data analysis also requires a sampling distribution: it comes into the likelihood and the predictive distributions. Perhaps more to the point, a Bayesian posterior distribution can be summarized in a million ways, hence forking paths in what to report. Still I think these concerns are much worse for p-values, which are

entirelyabout forking paths.The more general principle I’d include is that in science you should be open about absolutely everything you do. You have no right to expect anyone else to take your word for anything.

+1

I tried to come up with a list similar to what you describe here:

http://sometimesimwrong.typepad.com/wrong/2017/06/whatisrigor.html

Basic logic course helpful. However after reading extensively about statistics controversies, I believe that use of statistics in biomedical research remains an open question. I lean to Rex Kline’s view that use of statistics will recede.

Correct & incorrect definitions

Drop the attitude that the data/universe owes you significant results because you worked hard to get the data. Work towards creating a system where not getting eesults does not hit you so hard as it now does.

Sofia:

Yes. In the immortal words of John Tukey, “The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

Andrew,

But “not getting results” is unhelpful here, right? This is where the p-value thinking hurts in a hidden way – we lose all ability to think about “precision” as “meaningful and useful uncertainty” when we think of uncertainty instead as “an effect exists” or “something worked”. I’m actually not totally sure how similar our thinking is on this. But doesn’t this seem more like a case for putting our efforts as much into bounding the possible size of effects as it does about whether or not some regression does or does not produce “significant” results? That whole bit just needs to disappear.

I think we agree up to that point, but how about this as a generic statement of findings for a hypothetical empirical exercise: “Our models estimate an effect of a unit change in X on outcome Y of between A and B”… then who cares whether there is a zero in between A and B? Just get the best estimates of A and B possible. If you are working in a world where the units don’t matter at all or mean anything and all you can do is test “group 1 different than group 2″… I mean, why would you be there? – especially if you were the one who designed the experiment and data collection? Why wouldn’t you have measured something meaningful?

I guess I just worry that once you are in the world of “it worked” or “it didn’t work” you might already have lost (or at least lost your way). You as in the second-person tense there. But back to you as in Andrew, I’m not sure if we think about this problem in the same way when it comes to the thinking about exactly how to “quantify uncertainty” and/or the actual implementation of that concept in research practice. I thought this was a reasonable place to press on it, since your response here seemed to (uncharacteristically to me) embrace the “it worked/didn’t work” framework. And I still struggle sometimes with translating the conceptual idea (“embrace uncertainty”) with the practice of making statistical arguments in scientific research.

Or maybe I’m just misunderstanding what you meant by “reasonable answer” in your quote above, and now I look like I’m straw-manning you here…which, ok, but I still wanna ask.

dy/dx is probably between A and B *is* a result ;-)

The problem is we fetishize results like “handing out toothpaste before Halloween at public schools reduces cavities by age 12 p < 0.01”

whereas handing out toothpaste before Halloween at public schools alters cavities per person per year by between -0.002 and +0.004 is considered “failure to discover anything significant”

umm…. ಠ_ಠ

Think long and hard about the relationships between your research questions, study design, and measurement. You may be asking interesting questions, but with the wrong design or measurement, you won’t be able to answer them.

+1

+2

Try to observe how your data being collected.

Don’t do what Donny Don’t does, even though Donny Don’t does publish in PPNAS, Science, Nature, does get lots of press and funding and…

Extremely cynical recommendation: When you read an article start with the assumption that the authors and reviewers did not know what they were doing. A really good paper will convince you that this assumption was wrong.

A little cynical, but more importantly: very pragmatic.

Don’t let anyone tempt you to glitzify, TEDdify, or Gladwellify your findings.

Andrew if you are at ASA, could you post your presentation here on your blog? I’m sure your audiences would be eager to read it.

Thanks in advance. Regrds, Sameera

From PLoS Computational Biology, 2016:

“Ten Simple Rules for Effective Statistical Practice”

Robert E. Kass, Brian S. Caffo, Marie Davidian, Xiao-Li Meng, Bin Yu, Nancy Reid

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004961

Their proposed rules (elaborated for 1-2 paragraphs each in the article) are:

“Rule 1: Statistical Methods Should Enable Data to Answer Scientific Questions

Rule 2: Signals Always Come with Noise

Rule 3: Plan Ahead, Really Ahead

Rule 4: Worry about Data Quality

Rule 5: Statistical Analysis Is More Than a Set of Computations

Rule 6: Keep it Simple

Rule 7: Provide Assessments of Variability

Rule 8: Check Your Assumptions

Rule 9: When Possible, Replicate!

Rule 10: Make Your Analysis Reproducible”

Building off Andrew’s #3, no more sum scores. Structural equation modeling or multilevel modeling allow us to explicitly account for measurement error within the analysis.

SEM comes with its own long laundry list of problems though.

no doubt