Discussion on preregistration of research studies

Chris Chambers and I had an enlightening discussion the other day at the blog of Rolf Zwaan, regarding the Garden of Forking Paths (go here and scroll down through the comments).

Chris sent me the following note:

I’m writing a book at the moment about reforming practices in psychological research (focusing on various bad practices such as p-hacking, HARKing, low statistical power, publication bias, lack of data sharing etc. – and posing solutions such as pre-registration, Bayesian hypothesis testing, mandatory data archiving etc.) and I am arriving at rather unsettling conclusion: that null hypothesis significance testing (NHST) simply isn’t valid for observational research. If this is true then most of the psychological literature is statistically flawed.

I was wonder what your thoughts were on this, both from a statistical point of view and from your experience working in an observational field.

We all know about the dangers of researcher degrees of freedom. We also know how it is easy it is to obtain significant p values in exploratory analyses that are meaningless and misleading. Dorothy Bishop has a great example on her blog of a four-way ANOVA conducted on a null data set in which a significant p value for at least one main effect or interaction will be found at least 50% of the time): http://deevybee.blogspot.co.uk/2013/07/why-we-need-pre-registration.html

Given the threat of researcher degrees of freedom, do you feel that NHST ever an appropriate approach to exploratory (unregistered) inferential statistical analysis? And, given these concerns, why should anyone believe the outcome of a NHST procedure that isn’t pre-registered?

I replied: my brief answer is that different methods, derived from different philosophies, can be mathematically equivalent to each other. So, null hypothesis significance testing is equivalent to a classical conf interval which is equivalent to Bayes with flat prior which can make sense if the effects size are large and the measurement error is strong.

Chris responded:

If the best and only defence to researcher degrees of freedom is pre-registration, then how can scientists securely interpret p values in observational research? How can they even interpret them in their own research, given our own unconscious bias? That is, doesn’t interpreting a p value carry the concrete requirement that no researcher dfs have been exploited?

I would also be very interested to hear your critique of pre-registration (as it applies to your research) in more detail. What is it specifically about pre-registration that would have prevented your most important discoveries? All pre-registration does is enable readers to distinguish confirmatory analysis from exploratory analysis – it doesn’t block exploratory analysis or hinder it in any way (that I can see). That being so, and assuming your major discoveries stemmed from exploratory analysis, why would having those same exploratory analyses form part of a pre-registered study make any difference to their interpretation or impact? (or would it have changed your mindset in some way, e.g. by making you more conservative in your approach?). I find this discussion intriguing because I’ve never seen pre-registration an an enemy of exploration, only as an aid to distinguish hypothesis testing from hypothesis generation.

I replied:

I don’t think the existence of preregistration would have killed my results, and I support proposals in psychology and political science to allow preregistration to be done in an open way. I just wouldn’t want preregistration to be required, indeed the concept of preregistration would seem to me to be just about impossible to apply in the analysis of public datasets such as we use in political science. And even in our analysis of non-public datasets, we learned most of what to look at after looking at the data.

And here’s Chris again:

I don’t think pre-registration should be mandatory either. Though I think it should be strongly encouraged in fields where undisclosed flexibility is identified as a major cause of false discoveries (which is certainly the case in psychology and cognitive neuroscience). As you say, it’s more challenging for areas that rely on analysis of existing datasets. In psychology I think the solution in that case is to consider all analyses of existing datasets as (by definition) exploratory and thus most valuable in terms of hypothesis generation and modeling.

Having said that, I don’t know if you were aware of this but the revised Declaration of Helsinki (to which major psychology and neuro journals adhere) now requires mandatory pre-registration. See clause 35 especially here: http://jama.jamanetwork.com/article.aspx?articleid=1760318#ResearchRegistrationandPublicationandDisseminationofResults

I took a look at the relevant section: “Every research study involving human subjects must be registered in a publicly accessible database before recruitment of the first subject.”

I’m not clear what it means for a study to be “registered.” I don’t know that this would require the analysis to be specified ahead of time.

Chris responded:

Indeed, quite possibly not. But it raises the bar substantially and normalises the idea of saying something about what the researcher is going to do before doing it. From there it becomes not a question of “did you pre-register?” but of “what did you pre-register?”

I guess, as a start, that people will preregister the designs in their NIH proposals. I’m hoping they will be registering their data as well.

And here are my previously published thoughts on preregistration in political science.

41 thoughts on “Discussion on preregistration of research studies

  1. I am, unapologetically, a frequentist. That said, several years ago I reached the same conclusion as Chris, namely that significance testing and p-values simply aren’t valid for most observational research. Harkening back to the discussion a few blog posts ago on Pearl’s blog, I personally subscribe to his first two options (for they are not mutually exclusive) for causal inference: 1) confine (most) causal inference to randomized trials (I added “most” because occasionally ceteris paribus can be invoked in non-randomized settings); 2) otherwise, stick with data description. Although, perhaps I’ll soon change my mind, because I’m currently reading his book. Anyway, I think even Fisher (at least, later Fisher) might agree with Chris and I since he himself wrote that “the physical act of randomization . . . is necessary for the validity of any test of significance” (Design of Experiments, 8th ed, p. 45).

  2. Pre-registration should be a necessary pre-requisite for experimental that later want to wear the label of being confirmatory. Possibly not for anything else.

    I think all other forms of research, everything that is not confirmatory experimental work, will benefit from this practice being adopted by experimentalists, to the degree to which they are connected to confirmatory experiments. Investigations of pre-existing data sets aren’t experiments – pre-registration makes little sense here. However, a friend of mine proposed that some of the benefits of pre-registration might become available for such analyses if, for example, the analysis scripts (e.g. your R scripts) are put on an online, public version tracking system. That way, at least some of the (necessary) parameter tuning is public.
    But if experimentalists are no longer incentivised to dress up their explorations as confirmations, real exploratory research should rise in the standings, because it doesn’t get as much unfair competition anymore. In a way, it’d be an incentive for admitting uncertainty, whereas now, experimentalists are incentivised to pretend certainty.

    I’m sure this is mostly in alignment with Chris’ point.

  3. One point that studies might make clearer is whether they were exploratory. Often an exploratory study tries to disguise itself as confirmatory.

    • Yes, and the study can disguise itself even to the researchers themselves! That is, someone like Bem or Tracy/Beall can think they are doing a confirmatory study, even if they’re not. A key difficulty, I think, is that the studies of Bem, Tracy/Beall, etc., are intended to be confirmatory of a scientific hypothesis (e.g., that there is ESP that manifests itself in successful guesses in a particular experimental setup, or that ovulation influences social behavior in a way that manifests itself in colors of clothing) but they are exploratory in a statistical sense in that the specific analyses and statistical comparisons being performed are highly contingent on data (the “garden of forking paths” issue).

  4. The _surprise_ of statisticians realising the tenuous at best attempt to get that probability (given the null hypothesis is true) of results like this or larger (to get that nice Uniform(0,1) distribution of p-values), seems to recur fairly often. Not that long ago, someone published even a paper claiming that by definition the distribution of p_values would be Uniform(0,1) in observational studies. The residual confounding that will almost always be present in observational studies blocks that determination. Technically randomisation is sufficient but not necessary for it to be determinable.

    Sander Greenland has written extensively about this in epidemiology and suggested multiple bias analysis (MBA) as an alternative about 10 years ago. But people do keep rediscovering the problem.

    As for registering observational studies, a co-operative group in Canada is now doing this regularly, where each of about 8 of them in different provinces (with different data bases) agree on an analysis protocol, carry out the analyses quite separately and share the results concurrently. Difference in results are then investigated. Now if they all get the same result, all it might mean is the confounding is simply the same and they are all getting the wrong answer (but they can try to rule that out as well).

        • OK, but I don’t know much about their process but hopefully they are using some good group facilitation to get an agreed upon design.

          I used to be involved (~1990) in this with convening expert panels to develop judgemental predictive indexes. Our research team’s social science member would solicit separate starting views from each before they came to the meeting as a way to start facilitating. I always thought his role was the most important, especially when I had repeated seen how quickly 7 to 9 individuals who had previously little contact would converge into 2 or 3 polarizing groups and how he could often slow that down or reverse it.

          Recently had a short interesting discussion with this guy about how that’s done these days – http://www.ted.com/talks/eric_berlow_and_sean_gourley_mapping_ideas_worth_spreading.html or http://vibrantdatalabs.org/

  5. How does randomization guarantee validity of NHST results? It seems to me all the same problems are faced, only less so since at least some effort has been put towards minimizing baseline differences. NHST seems inappropriate in all cases where there is not a plausible prediction to use as a null hypothesis.

    • Adding to this. If you know so little about a subject matter that you cannot make a point or interval prediction, isn’t exploratory analysis the appropriate thing to do? “Confirming” that two groups are different or the correlation is not exactly zero doesn’t seem very helpful in figuring out what is going on when you prod the black box.

    • It has nothing to do with baseline balance or baseline differences on a population level, randomization allows one to make individual-level causal inferences based on the finite counterfactual populations… Even if one cannot really pinpoint to which individuals the inferences apply (e.g., can’t really tell who benefited from receiving a drug, only that some individuals did). Here are two of my favorite papers: http://www.ncbi.nlm.nih.gov/pubmed/12413233
      http://www.ncbi.nlm.nih.gov/pubmed/7997705

      Of course, this all assumes that a randomized study is “well-conducted” (i.e., really randomized), with all the bells and whistles like allocation concealment, very high retention, etc.

      • Mark, Interesting Papers. I have a question regarding Stewart 2002. Lets first use his example of drug vs placebo where the drug group had lowered blood pressure on average.

        1) There are limits to the amount blood pressure can be reduced (floor effect; the relationship between deltaBP and BP is not linear). Say the placebo group happened to have lower blood pressure at baseline and simply taking part in the study (seeing the doctor) lowers BP. Would we not expect to see smaller deltaBP for the placebo group, thus our conclusion that the drug caused reduced BP is faulty?

        2) What about studies where there is no baseline? What is the baseline “cancer growth rate” of each individual when studying carcinogenesis? What is the baseline “capacity for recovery” in studies of brain injury? Looking at control results for these types of studies reveals huge variability, indicating factors at play just as strong as any effect of a treatment could be. Until these processes are understood well enough so that we can experimentally adjust them at will, it is only safe to assume have no idea. If you find out later these baselines happened to differ between the groups, would you still maintain a difference observed originally was caused by the treatment?

        I would claim that causality can only be asserted when we are able to predict results at the individual level and use that as the null hypothesis, because any difference observed can always be plausibly explained by unknown baseline differences. Presence of apparent treatment effects at the population level are only indications that something is worth looking at further.

        If your null hypothesis is “no difference” and we always expect baseline differences to exist, the best you can say is that results are consistent with a treatment effect (Affirming the Consequent). There is no opportunity for Modus Tollens arguments regarding the research hypothesis. The results are useful for making preliminary claims and future hypothesis formation (exploratory research), but not suitable for drawing strong conclusions (confirmatory research).

        If I am confused or just wrong somewhere please point it out. This is something that has been bugging me for awhile.

  6. On the very next post on Rolf’s blog I wrote a comment that contained an argument against pre-registration. It got just one response, but I would really like to hear the thoughts of others about this argument. I should say upfront that I am not really opposed to pre-registration, but I think this argument suggests it is rather silly for many situations in experimental psychology.

    My concern is about what should be inferred when a researcher sticks to the plan. Does success for a pre-registered strategy lend some extra confidence in the results or in the theoretical conclusion? Does it increase belief in the process that produced the registered hypotheses? A consideration of two extremes suggests that it does not.

    Extreme case 1. Suppose a researcher generates a hypothesis by flipping a coin. It comes up “heads”, so the researcher pre-registers the hypothesis that there will be a significant difference of means. The experiment is subsequently run and finds the predicted difference. Whether the populations truly differ or not, surely such an experimental outcome does not actually validate the process by which the hypothesis was generated. For the experiment to validate the prediction of the hypothesis (not just the hypothesis itself), there needs to be some justification for the prediction.

    Extreme case 2. Suppose a researcher generates a hypothesis by deriving an effect size from a quantitative theory that has previously been published in the literature. The researcher pre-registers this hypothesis and the subsequent experiment finds the predicted difference. Such an experimental finding may be strong validation of the hypothesis and of the quantitative theory, but it does not seem that pre-registration has anything to do with such validation. Since the theory has previously been published, other researchers could follow the steps of the original researcher and derive the very same predicted effect size. In a situation such as this it seems unnecessary to pre-register the hypothesis because it follows from existing ideas.

    Most research problems are neither of these extremes, but I still cannot see a situation where pre-registration helps. If the predicted hypotheses (and methods and measures) are clearly derived from existing theory, then pre-registration does not add much to the investigation. On the other hand, if the hypotheses (and methods and measures) are not clearly defined by existing theory, then pre-registration cannot change that situation.

    To put it another way, if a researcher is doing fully confirmatory work, then pre-registration is not necessary. If a researcher is doing fully exploratory work, then pre-registration should not be done at all. A problem we have in psychology is that many people think only confirmatory work is proper and that exploratory work is non-scientific. To the contrary, both processes are essential to science.

    Moreover, it is not true that only confirmatory work can reject or validate theoretical predictions. The difference between confirmatory and exploratory work is mostly about the efficiency of the experimental design. Confirmatory work is focused on specific questions, so the design emphasizes getting answers to those questions and is likely to give definitive answers. Exploratory work is less focused on specific questions, so the design is less likely to produce definitive answers to any questions (but it might, just by happenstance).

    For some of the specific cases where people have argued for pre-registration, the true problem was that the reported data did not provide a convincing argument for or against presented theoretical ideas. If researchers will just pay attention to the uncertainty in the measurements relative to the considered theoretical ideas, then it does not really matter whether the design is confirmatory or exploratory or whether the experiment was pre-registered or not.

    • Hi Greg, that’s an interesting argument but in my opinion it doesn’t captures the benefits of pre-registration.

      If we take your Extreme case 1 – it’s basically a badly motivated experiment. Pre-reg can’t help provide a rationale for a badly reasoned (or random) hypothesis. But neither is it meant to, so that’s as a red herring. What pre-registration does prevent is publication of HARKed hypotheses, that is, hypothesis derived from the data which are then presented as a priori. This practice (which is by no means rare in psychology – see John et al 2012 and Kerr 1998) is not possible if the hypothesis is pre-registere. This means that your random or poorly reasoned hypothesis is more likely to appear in a non pre-registered article via HARKing than to be supported by the results of a pre-registered study.

      Moving to Extreme case 2. This doesn’t actually seem all that extreme. The Registered Reports submissions we’ve had so far at Cortex* followed the approach of deriving an effect size from previous work. You argue that if the researchers find an effect predicted by their hypothesis then “it seems unnecessary to pre-register the hypothesis because it follows from existing ideas.” But this argument overlooks the fact that pre-registration is much more than specifying hypotheses based on previous ideas. It also involve pre-specifying the analysis pipeline for the primary outcome measures, or at a minimum pre-specifying the range of contingencies for any data-led decisions. So while pre-registration isn’t necessary to perform this kind of confirmatory study (nobody has argued that it is), a pre-registered study assures the reader that researcher degrees of freedom were not exploited to produce a desirable outcome. As Andrew points out in his ‘forking paths’ article, these researchers dfs can even be exploited unconsciously when researchers THINK they engaging in a confirmatory study.

      The upshot is simply this. If we accept that:
      1) a p value that is acquired due to exploitation of researcher degrees of freedom lacks the same evidential value of a p value that doesn’t; and
      2) that pre-registration prevents (or at least greatly reduces) researcher degrees of freedom

      then the logical conclusion is that pre-registered studies that rely on NHST will, on average, have higher evidential value.

      You say “it is not true that only confirmatory work can reject or validate theoretical predictions.” I would say this can be true depending on how you define “validate” and “predictions”. Purely exploratory analysis with no research question and no hypotheses can certainly produce valuable findings that are theoretically useful. But it arguably isn’t hypothesis testing; it is hypothesis generation. In psychology, this kind of approach would be useful for specifying hypotheses to be pre-registered and studied in follow up experiments.

      I agree completely with you about the value of exploratory research. It’s a great travesty that psychology and cognitive neuroscience have fallen into the trap of only valuing work that appears to be hypothesis-driven. One of the benefits of pre-registration is that it draws a clear dividing line between the kinds of studies that suitable vs. not suitable for hypothesis-testing (and hence pre-reg). Drawing this line could increase the value of exploration. One can imagine new formats of article springing up which are exclusively exploratory. No hypothesis testing. No p values. There is great value in this and it’s something we’re currently considering at Cortex. What there isn’t great value in doing, in my opinion, is casting everything as confirmatory and thereby conflating confirmation with exploration.

      *There is an additional benefit of journal-based pre-registration, which is that the journal is forced to make a publishing decision that isn’t based on the results. Therefore this also prevents publication bias.

      • I think that pre-registration just makes obvious what should already be known. It seems to me that what you hope to do with pre-registration is force researchers to commit to making a real prediction and then creating an experiment that properly tests that prediction. This is a laudable goal.

        But such a goal does not make sense if researchers do not have any hope of achieving it. Most researchers in psychology design experiments based on vague ideas, intuition, or curiosity. This is exploratory work, which we agree is important. To ask such researchers (or even invite them) to make predictions is rather silly. They may generate some predictions, but those predictions will not really correspond to much. We don’t run confirmatory studies to identify which researchers can generate good guesses. We run confirmatory studies to test aspects of theoretical ideas.

        Perhaps the goal is more reasonable if there is a theory that can generate a prediction that motivates a good experimental design/analysis. But if the theory exists, then researchers in the field will agree about how the prediction is connected to the theory, and thus pre-registration is unnecessary. If there is disagreement among researchers, then pre-registration hardly helps the situation.

        My expectation is that many researchers who are motivated to pre-register their hypotheses will quickly realise that they cannot do it because their theories are not sufficiently precise. That’s a good discovery for those researchers, but I don’t think the best outcome of an emphasis on pre-registration should be that it is not used. So the hypotheses that will be pre-registered (those based on previous findings or appropriate theories) are those that do not need to be pre-registered.

        Pre-registration does deal with some types of researcher degrees of freedom, such as using a fixed sample size (rather than practicing optional stopping), dropping conditions, and HARKing. But these are exactly the issues that are handled by a good theoretical motivation for experimental design. Without a good theoretical motivation for experimental design, researchers are engaged in exploratory work, where p values should not matter so much and practices such as HARKing do make sense; along with an appropriate cautionary interpretation.

        I think we mostly agree with regard to your last two paragraphs. In particular, I was intrigued by your suggestion that on benefit of pre-registration iso to forcing journals to make a decision that isn’t based on the results and thereby reduce publication bias. This benefit could be true, but it seems like a way to “trick” journals into doing what they should have been doing all along.

        • I’m not sure that I see things as black and white as this.

          Unless I’m mistaken (and please correct me if I am), you seem to be arguing that the only experiment valid for pre-registration would be one in which everything about it (hypotheses, experimental procedures, analysis pipeline) was already determined and presumably identical to a previous study – i.e. a direct replication. And because everything about that previous study was pre-determined and known in detail, why both pre-registering?

          And on the flip side (again, unless I’ve misunderstood), you’re arguing that any deviation from said previous hypothesis/procedure/analysis renders the experiment, by definition, purely exploratory and hence unnecessary to pre-register.

          I’m not sure whether I should go on at this point, in case I’ve got you completely wrong. But what the hell…

          Assuming that’s what you meant, I think it oversimplifies the distinction between confirmatory and exploratory testing. A theory will rarely (if ever) specify such precise conditions, and researchers always have a decision space of legal choices to make when it comes design and analysis. It’s precisely the ability to *exploit* that decision space after the fact in the pursuit of p<.05 that makes pre-registration useful.

          You also point out that a strong theoretical motivation should provide sufficient basis for avoiding researcher degrees of freedom in the first place – and I would agree for the most constrained situation where the researcher is doing a direct replication (and has full knowledge of the experimental procedures and analysis of the previous study). But, I would argue, in the real world of psychological research, where direct replication rarely happens, this theoretical motivation is never strong enough to be of any use. It is far outweighed by the need to produce results where p<.05.

          I don't think that journal-based pre-registration "tricks" journals into doing what they should be, although I agreed that they definitely shouldn't be assessing papers based on results. But they do, and they have done so for over 50 years. Journal-based pre-registration is useful because it overcomes this bias, not through trickery, but by blinding the publication process to the results. Put differently, it is nothing more than an application one part of the scientific method (controlled blinding) to the process by which science itself is produced.

          (I agree btw that p values should not matter for exploratory work – in fact they do not seem valid at all because there is no a priori hypothesis to test). HARKing is never justified, in my opinion, because it is an act of deception to pretend that a "hypothesis" derived from the data was thought of in advance. HARKed hypotheses, by definition, cannot be falsified by the studies that propose them – they therefore distort the scientific record.

        • Just in case I was giving the wrong impression, I want to state that I appreciate your contributions to the discussions about publishing and statistical analysis, and I think that your ability to convince Cortex to set up a pre-registration system is a good thing. I am sure that it took a lot of effort to convince people that there was a problem and that something needed to change. I hope you take my comments as an exploration about the philosophical basis of scientific practice rather than as a condemnation of your efforts.

          It seems like most of your arguments are pragmatic (perhaps the extremes of purely confirmatory or purely exploratory investigations do not exist); but I would hope that the arguments for pre-registration would also apply to an ideal world as well. Do you agree that if I have a theory that makes a well-reasoned prediction that there is no need to pre-register? Do you agree that a researcher who generates predictions by guessing also has no reason to pre-register?

          In the more pragmatic cases, I think you are concerned about various kinds of deceptions that are available for researchers to (perhaps unintentionally) exploit. But from my perspective these things only matter if the researcher is claiming evidence for a theory, and it is pretty easy to identify when that is not really true. If a paper says, “as predicted by the theory…… p=.02” it is usually pretty easy to tell whether or not this was a real prediction. The reader need simply follow the logic that lead to the prediction. How was the predicted effect size (or distribution of the effect size) generated? How was the sample size for the experiment selected? How were the measurement variables identified relative to the theoretical predictions? Typically the reader does not go past the first step because in psychological research there is rarely a predicted effect size. Right then the reader knows that what the article discusses is actually exploratory work. It may be well-motivated and interesting exploratory work. I agree with you that in a case like this researchers would be better off to skip the p-values and instead focus on the measurement precision. Note too that then most of the researcher degrees of freedom disappear because there is no p-threshold to cross and no big claim in the conclusion. Researchers can still choose to mislead, but that is fraud and pre-registration is not going to stop it.

          My use of the term “trick” was perhaps unfair. Still, I would like to be in a situation where journals publish good experimental results because the experimental design is good, the measurements are meaningful, and the information is valuable. That can happen with or without pre-registration, and I would be disappointed if the only way to publish null findings is via pre-registration.

          If what you mean by HARKing is to hypothesise after the results are known and then to claim that the hypotheses were generated first, then we are in agreement that this is not proper. It is fraud. Such behaviour is also pretty easy to catch (because there is insufficient justification of the theoretical prediction). On the other hand, I don’t see a problem with hypothesising after the results are known and presenting those hypotheses as interesting possibilities. I hope scientists do learn something from their data.

          Consider an alternative situation. Suppose scientists and journals were more accepting of exploratory work and that when doing such work scientists reported the uncertainty in their measurements (confidence intervals or highest density intervals) and did not bother with hypothesis testing. Moreover, suppose scientists started to develop quantitative theories that predicted effects and promoted the design of confirmatory experiments that appropriately test those predictions. That’s the kind of situation I want for psychological science, and I don’t see that pre-registration needs to play a role.

        • Greg:

          1. Thanks much for the detailed discussion. Just as I blog for free and (I hope) provide a public service, you and others comment for free and greatly enhance what this blog has to offer.

          2. You write:

          If what you mean by HARKing is to hypothesise after the results are known and then to claim that the hypotheses were generated first, then we are in agreement that this is not proper. It is fraud.

          I disagree. As Eric and I discuss in the Garden of Forking Paths paper, I think that harking occurs by accident all the time.

          To consider three notorious recent examples, Kanazawa, Bem, and Tracy/Beall obviously, obviously harked. But I don’t know if they realized they were harking! They state that they really did choose their hypotheses ahead of time, and I have no reason to doubt their sincerity in these statements. The problem is, yes, they chose their general scientific hypotheses ahead of time, but, no, they did not choose their specific statistical comparisons (or data inclusion rules) ahead of time.

          I think that so so much of the confusion around these issues arises from conflation of the general scientific hypothesis and the specific statistical comparison. And it doesn’t help that so many statistics textbooks (including mine!) are somewhat sloppy about this point, that a single scientific hypothesis can be studied by so many different statistical comparisons.

        • Thanks Greg. Not to worry, you haven’t given any bad impression at all. I appreciate the discussion – your argument is actually one of the clearest and most well reasoned I’ve seen against pre-registration. This is the kind of robust discussion I enjoy greatly!

          With respect to the questions you raised:

          Do you agree that if I have a theory that makes a well-reasoned prediction that there is no need to pre-register?

          * I don’t think pre-registration should ever be mandatory so I wouldn’t say there is ever a “need” to pre-register. However I would say that it is in both your interests and mine to do so. By pre-registering you assure me that you haven’t exploited researcher degrees of freedom to support your hypothesis, tweaking arbitrary aspects of the analysis (any of which could be “legally” justified by previous work) to, e.g., push a p=.06 to p=.04. Such decisions could be as simple as altering criteria for excluding outliers or deciding whether or not to include a covariate. And it is in your interests to pre-register because it eliminates doubt of such df exploitation, and it also helps protect you from your own confirmation bias and desire to find a particular result. It is easy for us to forget the various exploratory analyses we conduct in routine treatment of data, and then fall into the trap of focusing our attention on the outcomes that we already believe. We are all human, afterall, and all fallible. This is of course one reason why pre-registration of study protocols is standard in clinical medicine. When society decides that bad practice can cost lives, we pre-register. When it can’t, we settle for more lax standards. That’s not a rationale societal position.

          Do you agree that a researcher who generates predictions by guessing also has no reason to pre-register?

          * Well, first I don’t think a researcher should ever generate a prediction by guessing, and I don’t think it happens anyway. The closest scenario, I suppose, would be a HARKed hypothesis that is shoehorned into a study to “predict” an unexpected finding – in this case, since the unexpected finding is probably a false positive, the hypothesis is effectively a post hoc guess. With enough trawling of the literature it is easy enough to find a result to support almost any hypothesis. However, to answer you question more directly: if a researcher truly has no rationale for a particular hypothesis, then they shouldn’t propose one. And if there is no hypothesis then there is no need to pre-register: if all of the analyses are explicitly exploratory then the purpose of pre-registration – to distinguish confirmation from exploration – is moot.

          I agree with Andrew that HARKing can indeed be unconscious and I think a lot of the time it is. In the end, I don’t think it really matters whether researchers “intend” to engage in questionable practices or not. All that matters is whether they do, and whether we tolerate a system which allows the effects to damage science. At the moment we do, but solutions such as pre-registration are within our grasp.

          I agree that in a perfect world, journals would indeed be more accepting of exploratory work. We could then settle on alternative statistics in those cases (e.g. Geoff Cummings’ “New Statistics”) and save NHST and other alternative techniques for hypothesis testing for confirmatory research. I don’t necessarily think that hypothesis testing deserves any privileged place in psychology. It is valuable and it reflects our philosophical origins in the hypothetico-deductive model of the scientific method, but it isn’t the only approach that is valuable. As to whether we could survive *only* on exploration, I’m not sure. Won’t there always be a role for hypothesis testing? And where there are hypotheses, pre-registration is the natural friend to ensure that researchers aren’t tempted, either consciously or unconsciously, to exploit the publication system in the pursuit of distorted incentives.

        • While a few people who argue for pre-registration see it as a tool (ways to get closer to an ideal) in an interconnected web of mis-aligned incentives, able to improve science by re-aligning some incentives, your alternative propositions start at the desired end state (the ideal world) and I thus don’t see how you can frame them as alternatives:
          – “if scientists and journals were more accepting of exploratory work” – they aren’t and they won’t be if incentives don’t change, for example by making it harder for exploratory work to pose as confirmatory).
          – “we should improve peer review overall” – peer review has systemic problems, especially pre-publication peer review. For example, reviewers may not have the time to read all the literature necessary to evaluate whether a theoretical prediction was actually justified. It may not be possible to improve it much, but at least some articles which garner attention can be closely examined after publication (as you and the blogosphere do).

          You also say that “[HARKing] is also pretty easy to catch (because there is insufficient justification of the theoretical prediction).”
          That, I just don’t get. Yeah, you might be able to spot an embodiment researcher who obviously rationalised their result post-hoc using a proverb as “theory”. And there are still people who simply do not see value in theory, so they proudly profess their fishing prowess.
          But no, someone who HARKs has hindsight to work with, so on average you would expect them to be able to make more precise predictions and give better justifications. Yes, that might mean they had to selectively search the literature to generate these post-hoc hypotheses (and again, not necessarily intentionally, if I search for “correlation education reproductive timing” I may just happen to find more positive associations than if I search “relationship education age at first birth”).
          But, realistically, in how many fields are you able to spot a selective literature review? And yes, sometimes editors don’t find peer reviewers who straddle the right fields to do so either.
          I think pre-registration, when done right, would make this tremendously easier. I can essentially put the pre-registered plan and the actual analyses side-by-side, no deep knowledge of the field needed.
          If the precisely predictive theory were always published before an empirical test came out, that would be the same as pre-registration, but that is not the case – one disincentive is the fear of being scooped.

        • Ruben,

          You are correct that I am an idealist. I believe that most scientists want to do good work, and that they want their contributions to be valuable and useful. I also believe that most scientists are sincere about the conclusions they generate in their papers. I have recently become convinced, however, that such sincerity is often misplaced. The work that is published and the conclusions that are drawn are often not compatible. That is, even if the conclusions happen to be correct, the argument connecting data to theory is often not valid. Papers in psychology often start with a vague prediction (perhaps motivated by a vague theory). The vague prediction motivates the design of an experiment, but the experiment cannot really test the theory because the vagueness prohibits the design of a test that could adequately reject the theory or the prediction. The published result may be claimed to validate the prediction, but the whole argument was invalid. The effect may be real, and maybe even the theory is correct, but the experiment was not really an adequate test.

          I do not believe that journals have any problem with publishing exploratory work. Indeed, they hardly publish any other type of experimental work. The problem is that authors believe they are publishing confirmatory work and journals encourage them to report their findings that way.

          I may have overstated the ease of detecting HARKing. What I really meant was that it is easy to identify when the connection between data and theory is invalid. An experiment designed to test a theory needs to be explicitly motivated by the theory. Not just, “My theory predicts better recall in condition A than in condition B”. We cannot design an experiment to test such a prediction with a hypothesis test. Since there is no estimate of the effect size, we cannot know what sample size is required. If I run a test with 50 subjects in each group and I do not reject the null, I really have no basis to claim the prediction is false (maybe the prediction is valid but the effect is small). For that matter, even if I do reject the null (say, p=0.03), that still does not mean it was good experiment for testing the theory. It is possible that just by happenstance the data do reject the prediction (e.g., recall is much better in condition B than in condition A), but this was not due to the design of the experiment.

          When someone HARKs, they usually generate a theory that is something of the form “there will be better recall in condition A than in condition B”. Since there is no way this theoretical statement could have designed a good experiment, I do not believe the data provide evidence for the theory (regardless of whether there really was HARKing).

          One potential benefit of pre-registration, which Chris alluded to, is that it forces people to sit down and think carefully about their theories, predictions, and experimental design. That should be common practice, and if consideration of pre-registration encourages such thinking then I am all for it. My skepticism about pre-registration is that there are other motivations for careful thinking, so pre-registration is not necessary.

    • I tried to leave a response to your comment on Rolf’s blog, unfortunately the authentication was broken for me and I couldn’t submit it, so it was lost.

      The gist of it was: An important and somewhat neglected point, that you didn’t really raise in first comment but allude to here, is that pre-registration can heighten our confidence in a theory, not just in the finding. I have to read quite broadly across fields and need to assess the validity of findings and theories in other fields. In my own very narrow sub-field, I can tell a theoretically motivated analysis with the standard outcome measure apart from one that used wishful thinking, researcher df and post-hoc reasoning to derive analysis strategy and outcome choice. There, I wouldn’t _need_ pre-registration (though it would save me some time in sifting the good from the bad).

      But in other fields, I do. A pre-registered analysis plan is a proclamation of confidence in a good, testable theory. That’s why people talk about “badges”. There has to be something to back up the feat behind the badge, but people can understand and respect the badge, even if they lack all the expertise to judge the feat on their own.

      As an example, I’m interested in the genetics of personality:
      Candidate gene studies in this area have come to be viewed as mostly non-replicable nonsense, because sometimes researchers don’t find anything in multiple-comparisons-corrected GWAS and then make up some post-hoc “theory” to justify lower significance thresholds for their best candidates.
      If some researchers worked differently, for example analysing biological pathways and genetic expression, building on previous work etc. to derive their real predictions about candidate genes’ effects, this would change everything – these candidate genes studies should be very replicable.
      Even though this is one of my areas of interest, I hardly have the expertise to judge the validity of the purported theory and biological pathway behind every candidate gene. So, I treat most candidate gene studies as bollocks, they don’t change my beliefs. That’s conservative and I’d prefer for things to be different, but I can’t read it all. If there were a badge of pre-registration, I would know what to pay attention to.

      • Ruben,

        The idea that pre-registration engenders confidence in there being a good and testable theory is largely dependent on the peer review and editorial process. I suspect some reviewers and editors will do it well and some will not. It seems to me that there is not necessarily more confidence in the evaluation of a pre-registered article than in the evaluation of a non pre-registered article. Maybe there turns out to be something in the process that does improve the evaluation, but I think that just means we should improve peer review overall.

        Moreover, I think the impetus is on authors. If they have a theory that drives the predictions, then that should be made explicit in the paper. My suspicion is that in many cases whatever theory is being considered is not in a form to actually generate testable predictions.

    • @Greg:

      What about fishing? Sans pre-registration I can try to correlate, say, female fertility with color of skirt, pants, scarf, cap, shoes, car, pen, house till I find a variable (say scarf color) that works & then go ahead and publish it with a significant p-value and using some form of appealing theoretical explanation to bolster why indeed scarf color works.

      Won’t fishing studies of this sort be hampered by pre-registration?

      • I don’t see how you could generate a sufficiently appealing theoretical explanation to bolster why scarf colour works. It’s not enough for a scientist to say, “I predict effect X”; there has to be justification for the prediction. The scientist has to explain how effect X is the result of various mechanisms. Moreover, for the experiment to be a good test of the described theoretical explanation, that theory has to describe a predicted correlation value (or a distribution of correlation values) and then demonstrate why the proposed experiment is a good test for that prediction. A good (frequentist) test must have a decent chance of showing whether the prediction is true or false. So if the predicted correlation value is 0.3 and the sample size is 45, then this is not a very good test (power=0.52). If such an experiment produces a non-significant result, the researcher does not know what to conclude, because the probability of success was basically a coin flip.

        Pre-registration would hamper these kinds of fishing expeditions, but sometimes researchers want to fish. These variables are easy to measure in a survey, and it is possible to get lots of samples very quickly. If there is some reason to do it, why not check if something correlates? In such a case, I would not bother with a hypothesis test, but reporting what was found seems like the proper thing to do. I guess a researcher could pre-register a plan to fish through the large data set to see if something jumps out, but why bother doing that? Of course, this supposes that there really is some reason to make these measurements.

        • There was this recent much discussed paper that claimed to show that “women who were at peak fertility were three times more likely to wear red or pink shirts”.

          It’s only a little leap from shirts to scarves, right? I think there’s tremendous leeway if the bar is low: i.e. to merely generate a plausible explanation.

        • Yes, but that study could be judged for what it was. Several of the reported effect just barely satisfied significance (one did not even meet the traditional definition, if I remember correctly) and there was no meaningful theory presented to justify the design of the experiment. Although I think those authors erred in presenting their findings as if they had been predicted by some theory, readers could fairly easily discern that a theory did not actually predict the results. The authors also erred by making claims that sounded much more definitive than were warranted by the data. Again, this is something that readers could readily identify in the data. If the authors had instead presented their findings as tentatively suggesting a relationship between fertility and wearing pink or red shirts, then it would not have been published in one of the top journals in psychology; and I don’t think we would have had much to complain about.

  7. Pingback: Around the Traps 24/1/14 | Vote-Often.com

  8. In December 2004, I announced that the demographic factor that best predicted Presidential voting by state in both 2000 and 2004 was a statistic I had just made up that estimated the average years married between 18 and 44 for white women.

    I’ll admit that sound pretty contrived, and I certainly wouldn’t have come up with that if I’d had to pre-register my research program.

    But then my Years Married variable worked well in 2008: John McCain carried 19 of the top 20 states on this same metric, while Obama captured the 25 of the bottom 26.

    And, “Years Married” had its best won-loss record yet in 2012. Mitt Romney carried 23 of the 24 highest-ranked states. Barack Obama won 25 of the 26 lowest-ranked states. The correlation coefficient of Years Married with Romney’s share of the vote by state was 0.84 (and 0.88 if you include the District of Columbia).

    I worry that too much emphasis on good hygiene will discourage that kind of creativity.

    For graphs, see:

    http://www.vdare.com/articles/happy-white-married-people-vote-republican-so-why-doesnt-the-gop-work-on-making-white-peopl

    • Steve:

      In your case it would seem like the ideal time for registration would have been 2010.

      A culture of registration would have made it harder for you to make strong claims in 2008 – but that seems appropriate?

      There is always going to be a tension between a “hypothesis being well motivated” versus selectively reported and or picked just to explain the data. Steve Goodman pointed out once that this “previously laid out before” is just a convenient predictor of the first being true – if it’s publicly available.

      I used to use a variation on a story like yours to argue (Peirce’s view) that evidence has to be based on the background knowledge and intentionality of the interpreter. So I would add a researcher who had worked on your measure for years (published in journals unknown to you) and they become aware of your 2008 study in 2009 – they have much stronger evidence than you in 2010. In 2010, if you become aware of their work and assess it as credible – you should have stronger evidence too.

      • I made strong claims in late 2004, 2005, 2006, and in this article on February 11, 2008, more than two elections ago:

        http://www.theamericanconservative.com/articles/value-voters/

        This is the article of mine that Dr. Gelman read first, and then kindly invited me to participate in an online discussion at TPM.

        Since the r coefficient for the correlation of GOP share of the two party Presidential vote in the 50 states plus DC with average years married among white women 18-44 in 2000 has been

        2000: 0.87
        2004: 0.91
        2008: 0.88
        2012: 0.88

        I’ll make the strong claim that (assuming no Perot-like third party candidate) it will be at least 0.75 in 2016.

        • George Hawley of the U. of Houston political science department has validated that my general theory of “affordable family formation” correlating strongly with the red-blue balance per state was also true at the county level in 2000. See his recent article:

          Home affordability, female marriage rates and vote choice in the 2000 US presidential election: Evidence from US counties

          http://ppq.sagepub.com/content/18/5/771

    • Indeed, nobody (I think) is suggesting that preregistration should be required, only that it be an option that should be used in cases where people really feel they have a specific hypothesis they want to test. Which would describe almost none of my own research.

  9. Back to the original question: “Given the threat of researcher degrees of freedom, do you feel that NHST ever an appropriate approach to exploratory (unregistered) inferential statistical analysis?”

    Let me float an idea: p-values be meaningless or non-meaningless depending on the researcher’s state of mind, even if the analysis conducted is exactly the same.

    Imagine a drug trial that gives folks a cholesterol drug and then follows them for 5 years. Imagine two researchers who run one identical model on one identical dataset, measuring the effect on cholesterol levels at six months (not some other time period, not some other outcome such as mortality, etc.).

    But the first researcher was specifically looking for that particular outcome, whereas the second researcher chose it by coincidence while rooting around for the first positive result that turned up. The first researcher’s p-value is the answer to the question: “what is the probability of observing this data if this drug, in fact, has no effect on cholesterol at six months?”

    But the second researcher is really asking, “what is the probability of selecting an outcome on the first try as to which the data would seem significant when analyzed as if I had a predetermined outcome in mind?” And if the second researcher had in fact fished through the data and models, then the p-value is really the answer to: “Under a null hypothesis that the drug shows no effect on any possible subgroup at any time under any model specification, what is the probability of observing this very narrowly defined positive outcome?” Does that question have an answer?

    • “what is the probability of observing this data if this drug, in fact, has no effect on cholesterol at six months?”

      Stuart,

      This is not the definition of a p-value. In practice, the p-value is used to gives the probability of getting the results given that two groups (or baseline vs final, etc) are exactly the same and there is only sampling error present. This is never true, so the p-value provides little information in the context of NHST. At least in the case of the t-test, the p-value (along with sample size) also indexes an effect size, but this has little to do with the NHST procedure:
      http://arxiv.org/abs/1311.0081

      • One of the problems with p-values, I feel, is that the way they are defined they answer a question that’s very rarely of interest to the people using p-values.

        When such rampant confusion exists over what a metric actually even means, I attribute a big share of blame to the guys coming up with the metric itself. With p-values, the practitioner almost always needs to indulge in a mental contortion to map them to the metric he had in mind as a goal.

        • Rahul, I agree. My current opinion is that the most effective solution would be to put more focus on modelling the process that resulted in the data observed. These types of guesses are useful exercises even if wrong, while false positive p-values have negative utility if wrong.

  10. Pingback: Overfitting vs. Open Data – The 100% CI

Comments are closed.