How is preregistration like random sampling and controlled experimentation

image

In the discussion following my talk yesterday, someone asked about preregistration and I gave an answer that I really liked, something I’d never thought of before.

I started with my usual story that preregistration is great in two settings: (a) replicating your own exploratory work (as in the 50 shades of gray paper), and (b) replicating the work of others (as was famously done by Ranehill et al.) but that replication isn’t so easy in econ or poli sci because we can’t just send the economy into recession or start a new war just to get more data points. So in my own career I’ve actually only once done a preregistered replication.

The one place that preregistration is really needed, I said, is if you want clean p-values. A p-value is very explicitly a statement about how you would’ve analyzed the data, had they come out differently. Sometimes when I’ve criticized published p-values on the grounds of forking paths, the original authors have fought back angrily, saying how unfair it is for me to first make an assumption about what they would’ve done under different conditions, and then make conclusions based on these assumptions. But they’re getting things backward: By stating a p-value at all, they’re the ones who are making a very strong assumption about their hypothetical behavior—an assumption that, in general, I have no reason to believe.

Preregistration is in fact the only way to ensure that p-values can be taken at their nominal values. In that way, preregistration is like random sampling which, strictly speaking, is the only way that sampling probabilities, estimates, standard errors, etc., can be taken at their nominal values; and like randomized treatment assignment which, strictly speaking, is the only way that the usual causal estimates are valid.

Yes, you can do surveys and get estimates and standard errors without ever taking a random sample—I do this all the time, despite not having the permission of Buggy Whip President Michael W. Link—but to do this we need to make assumptions.

And, yes, you can do causal inference from observational studies—indeed, in many settings this is absolutely necessary—but, again, assumptions are needed.

And, yes, you can compute p-values without preregistering your rules for data collection, data cleaning, data exclusion, data coding, and data analysis—people do it all the time—but such p-values are necessarily based on assumptions.

Just as a serious social science journal—or even Psychological Science or PPNAS—would never accept a paper on sampling without some discussion of the representativeness of the sample, and just as they would never accept a causal inference based on a simple regression with no identification strategy and no discussion of imbalance between treatment and control groups, so should they not take seriously a p-value without a careful assessment of the assumptions underlying it.

Or you can have random sampling, or you can have a randomized experiment, or you can have preregistration. These are methods of design of a study that make analysis less sensitive to assumptions.

Random sampling, random assignment, and preregistration can be a pain in the ass; they can be expensive; even when you do them there are often breaks such as nonresponse, non-compliance, or unexpected features in the data that require new, unanticipated decisions; and sometimes you can’t do them at all.

Random sampling, random assignment, and preregistration are not universal solutions, nor are they, in general, comprehensive solutions. What they are, is a way to make certain inferences more robust and less reliant on untestable assumptions. Whether it makes sense to go through with these design steps, depends on context.

So, yes, when someone tells me that preregistration is silly because it ties a researchers hands behind his or her back, I agree. Preregistration is costly. So is random sampling, so is randomized treatment assignment. These are costly and often not worth the trouble. And if you don’t want to do them, fine. But then show your assumptions. You make the call.

To my mind, the analogy between random sampling, random assignment, and preregistration is excellent. These are three parallel ideas, and to me it seems like just an accident of history that the the first two of these ideas are in every statistics textbook and are considered the default approach, wheres the third idea is only recently gaining popularity. Perhaps this has to do with analyses becoming open-ended. Perhaps fifty years ago there were fewer choices in data collection, processing, and analysis—fewer “researcher degrees of freedom”—so the implicit assumption underlying naively computed p-values was closer to actual practice.

51 thoughts on “How is preregistration like random sampling and controlled experimentation

  1. Here’s an anecdote where pre-registration didn’t work quite as planned. I was recently involved in a randomized controlled trial that had been pre-registered. Unfortunately, the language we used to describe certain aspects of the study in our pre-registration had an ambiguity that we did not recognize at the time. When we submitted our manuscript to a journal, a reviewer cross-checked our report with the pre-registration. He/she interpreted the ambiguous language differently from what we had intended and wrote a caustic review about how we were moving the goalposts. (In fairness, he/she also found real problems with our interpretation of our results, which we are now revising.)

    In the end, we were able to point up the ambiguous language and even demonstrate that we had stuck to our original intentions by showing what we had said in our grant proposal (written long before the pre-registration, and, luckily, clear on the disputed point). I’m not aware of other people having had this experience yet, but I imagine it will occur with appreciable frequency, as it’s sometimes hard to describe things completely unambiguously, and sometimes you don’t even realize that you have failed at that task. I can also imagine that some researchers will want to preserve as many “researcher degrees of freedom” as possible and deliberately write their pre-registrations ambiguously.

    • Maybe a solution is to actually write the code for your analysis before you get the data. Check that it works on fake data, then when you get the data from your study you just press ‘go’.

        • Unless you’re running a very complex design without much forethought, it’s not tougher. If you’re doing it right, you’ve piloted your experiment on at least a few people as part of debugging, so you have a few logs of the data lying around that are ‘fake data’. You can use those to generate genuinely fake data (by filling the results in with random numbers), or just take them as they are, and write your analysis script around them. If you do that with your final pilots (e.g., after you’ve worked out your design and data recording procedure), these look just like your real dataset will. I don’t often preregister publicly yet, but this procedure (write the analysis script based on pilot data) is my usual standard except in very exploratory situations.

          If you are running a very complex design without collecting a pilot subject or two first, then you’re probably not doing it right anyway.

    • I’m definitely in favor of pre-registered studies, mostly due to the garden of forking paths / publication bias argumentation that Andrew and others have made.

      That said, now that I’m just starting on writing up my second one, I do have a nagging worry. What if my pre-registration plan was too sloppy, or I actually made a legitimate mistake in my analysis plan? For example, what if I said “I’m not going to get rid of any outliers” but then, someone in my study said they had 9000 alcoholic drinks last Monday (I study alcohol use)? By my pre-registered plan, I would now have to keep that obviously incorrect value in the data, increasing error in the analysis.

      I worry sometimes that pre-registered studies might actually have estimates that are more error-prone on average in some ways. The mechanism is simple: Carelessness in a rushed pre-regisration plan might lock in bad decisions that cannot be fixed due to pre-registration, increasing error. Most scientists are rushed, time-starved, and eager to start collecting data before grant funding runs out, so they are likely to rush through the pre-registration stage. I am starting to think that the only reasonable solution is to have the pre-registered plan reviewed first before the study commences. But I’m not sure that’s always feasible either.

      • Having pre-registered several projects now (started about a year ago), I can see that these types of issues that Sean and Clyde describe can happen. I find the following simple solution very useful: deviate from the pre-registration and mention that you did.
        If you wrote you won’t delete outliers, and a person with 9000 drinks on Monday shows up in your datasets, show this to the reader, explain how your pre-registration now needs to be changed, and present the analysis with and without. I would suspect that most reviewers and readers would be comfortable with such behavior.

        I also object to saying that pre-registration “ties your hands” – you are still allowed to do all sort of different “garden-of-forking” path types of things, even if you are pre-registered. As long as you describe that you are deviating from the pre-registration, and the reader know this, this is perfectly permissible. It draws a clear demarcation line for the reader of what was hypothesis-conforming, and what was hypothesis-generating.

        • Yes, this is the standard that the COMPARE project used when reviewing clinical trials published in top medical journals (http://compare-trials.org). They flagged authors for not reporting what they said they’d report in the pre-registrations, and they flagged authors for reporting outcomes that weren’t pre-registered, but they didn’t flag any discrepancies from the pre-registration that the authors acknowledged in their published papers. This doesn’t seem at all restrictive to me, and I think authors would get used to it quickly if a “just acknowledge all discrepancies” policy were enforced. It should be easy and painless to comply with.

        • That’s a really good point, thanks! That keeps things transparent, without locking you into doing something dumb if you made a mistake during pre-registration.

        • 1. “As long as you describe that you are deviating from the pre-registration, and the reader know this, this is perfectly permissible.”

          I’d add: “and you have a good reason for the deviation, and explain what it is and why it is a good reason.”

          2. Part of doing good work in any field is learning from your mistakes — and, where possible, learning from others’ mistakes. I would like to think that “I will not get rid of any outliers” would be something that would not appear in a preregistration — rather, “I will not delete outliers unless their value is unreasonable for the situation — e.g., …” Indeed, the advice I give on outliers (regardless of preregistration) is “Do not delete outliers just because they are outliers. Only delete them if you have a good reason to believe that they are recoring errors. For example, if your data include height 7 ft for a man, that is an outlier, but within the realm of feasible heights for a man. But if your data includes height 11 ft for a man — that is most likely a recording error, so it’s reasonable to delete it.”

        • Or even better – when it is possible like in double-blind randomized clinical trials – go back to the person reporting the outlier and ask them to correct any seemingly absurd values, before you know which patient is in which treatment group. In a lot of ways confirmatory clinical trials for gaining approval of a drug are quite exemplary with respect to pre-specification of the analysis strategy prior to unblinding of the trial. Of course, that requires a huge operational overhead in terms of data management, statisticians that write extensively detailed analysis plans without having data and not everyone can realistically do this for small academic projects.

        • Yes, agreed with both points, Martha. I personally would still include something more iron-clad in the pre-reg (regarding outliers and everything else), precisely for the reason that it removes any possible insinuation of excluding / not excluding a datapoint that would make your results look better.
          In the end, pre-registrations are great, I highly recommend them, and if you need to deviate, you just need to be transparent about it.

        • I would tend to agree with Felix regarding pre-registered decision rules slanting towards more rigid then flexible.

          You write: “I will not delete outliers unless their value is unreasonable for the situation”. Which is eminently reasonable. However, it also leaves the door open (even for well intentioned, non p-hacking) researchers to define reasonable after the fact in a way that is much more favourable to their current line of inquiry.

          If we use your example of heights, the decision seems obvious. But what about 7′-6″ or 8′-1″ or 8′-2″, etc? A continous variable can be cut off at many points, and all may be rightly be presented as reasonable. Also, some cutoffs may be reasonable under one transformation but not another.

          An issue I see here is that the degrees of freedom are still quite high, but a study could be given greater weight because it is complete conformance with the pre-registration.

          I think Felix is right when he says to make slightly bolder claims about data analysis (which will be based on prior knowledge about the field) and explain why you deviated.

        • To clarify: I wrote “I will not delete outliers unless their value is unreasonable for the situation — e.g., …” .

          This was intended as recommending something much more specific than just “I will not delete outliers unless their value is unreasonable for the situation” — but, of course, I could not give a one-size-fits-all description of what “unreasonable for the situation” means, since that depends on context. The “e.g. …” was intended to designate something carefully thought out for the particular context.

        • Felix: I totally agree but when I was in clinical that seem to be fairly widely understood with authors like David Cox and other writing on that.

          So this is just a note about different in statistical practice in different areas of application.

        • That’s interesting Keith – I can only speak about my little academic corner (which is mostly non-clinical psych), and what we talked about here was neither widely understood nor practiced. But I am glad it was different in different fields.

    • Is this that much different from data problems, meaning for example that drug test results though they may be presented in tables require much jiggering to fit them into the places in those tables and that involves interpretations of what the study means and how it works at the arguing over whether this is what killed the patient level. Similar issue at different levels of design.

  2. In game theory it is often noted that constraints can improve outcomes. The way they do this is by curtailing opportunistic behavior through precommittment. The value of the constraint is directly related to the level of social harm of opportunistic behavior.

      • I was teaching a game theory segment this week, and none of my students had seen Strangelove. They did, though, come up with the Chicken game, as represented in Footloose.

        To tie it all back together – if you pre-register your chicken game strategy, you better know that they know that you know that they don’t want to die. And they better know that you know that they know that you know. You know?

  3. I’ve been working as an editor of a conference dedicated to preregistered studies. We’ve just gotten the first draft of final reports, so I can’t talk about them yet, but the final set of approved proposals showed a number of benefits beyond p-values. One of the biggest is that authors were willing to invest a lot more time, effort and resources in gathering data. This led to larger samples, better measures, more power, all in the service of answering tougher questions. Add to this the fact that reviews and editorial oversight can improve a study a lot more before the data has been gathered!

    Our call for papers is here , with details on our process here .

  4. Andrew,
    Maybe I am too enamored with your recent paper on multiverse analysis but I assume that you would be happy (happiest?) with a preregistration of a multiverse analysis? That is, discuss plausible alternative specifications of how to analyze data and then present effect size estimates, inferences etc. for them all?
    It seems important to show the level of robustness of a particular inference to alternative data analytic approaches.

  5. – People will learn/ use text blocks more. I.e. ‘Outlier removal is by method x, tests repeated with and without outliers.’
    – If the preregistration was sloppy, most likely the design was sloppy. ‘You mean I can’t actually answer question X with this design/data?’ is a common problem (with graduate students). At this point we either write off the hypothesis test part of the study, or via the garden of forking paths risk misleading the scientific community, potentially wasting even more money.
    – Either way, the study can still be used in an exploratory fashion, changes to the analysis can be made. It’s just that researchaers can’t lie including to themselves, what they planned to do and what they decided on after seing the data.
    – For some grant proposals you actually have to write down all the (groundbreaking and yet totally predictable) studies you are going to do. If you can write that a year before you get money, you can write the preregistration the day before the study starts.

    • ‘Outlier removal is by method x, tests repeated with and without outliers.’

      This isn’t enough — the preregistration also needs to include a good reason why this choice was made. (Frankly, I’m skeptical that the method given here has a good justification.)

  6. I’ve always thought of preregistration as similar to robustness tests. Is that wrong? If I can show in a methodological appendix that my analysis is robust to a number of other specifications or measurement strategies doesn’t that also indicate that I have clean p-values?

    • Austin:

      I think the best solution is to get rid of p-values.

      But if you want p-values, you have to accept that the p-value is defined based on how the data coding, data exclusion, and data analysis would’ve been performed, had the data been different. That distribution, based on what the analyst would have done, had the data been different, is where the probability comes into the p-value.

      A robustness check is several different analyses done on the same data set. This is not the same as a statement of what would’ve been done, had that data been different.

    • Ben:

      In our preregistered replication we preregistered the whole thing: data, data model, prior distribution, computation.

      Again, I think preregistration has value, but it’s not the whole story. It generally makes sense to go beyond the preregistered plan and do more.

  7. I find this analogy quite useful. +1 for (finally) putting this issue in the proper framework for me. Scholars might be less apt to complain about pre-registration from this RCT perspective.

    With randomized-controlled trials, I usually tell students that all of the hard work is done upfront (at the design phase). Calculating average treatment effects is trivial. Pre-registration forces you to think about the data analysis upfront like you would have to do in an experiment. Very cool.

  8. I would love to see honest prereg in social science studies. Imagine: we are going to set up x trials, a number large enough to draw attention but way too small to eliminate ordinary variation, and then we’re going to over-emphasize tenuous relations so we can generate a finding that people can read minds kind of maybe sometimes but definitely within this confidence level.

    On a somewhat related level, I was reading about intelligence “key judgements”. I didn’t realize there are terms of art for that. The neat thing is it’s a way of putting on a label that subsumes and obscures the same kind of noisy guesswork involved in so many studies: they introduce two clear forms of bias to solve the problem of insufficient evidence. That is, they interpret the evidence according to the conclusion they have been asked to evaluate. This resolves noise and power, etc. to a conclusion they label “key judgement”. The neat thing is this gets at prereg too: the study design picks a route to a point and that route requires interpretation of each piece of evidence according to the route and its end. So it isn’t a genuine null – and I’m not convinced many claimed nulls are real – because the evaluation of the evidence occurs at an abstracted layer which is “prereg’d” in actuality to generate a route to this end. As I learned about this, I realized why we have so many intelligence failures: we seem to prefer storytellers who can make sense of routes that lead to ends that fit within the desired context. I also find cool that this shows: a) how we generate largely unobservable pathways to a conclusion by interpreting a series of points according to the conclusion, meaning we interpret an abstraction layer away (or more) from points that could also fit in other contexts leading to other conclusions at any point (and generally over swathes of points), and b) the lack of understanding of how coin flip interpretations/decisions are constrained by context. The last really fascinates me because we see how constraint imposes “down” and how that meets the “up” movement generated internally – meaning coin flips take meaning in context that changes externally from coin flip to flip – and how we confuse the “storyline” with the “data”, meaning we impose story on data without recognizing the line is a construct drawn on top of (or below, behind, etc.) data points (that themselves tend in complex situations to be abstracted lines). No wonder we get stuff so badly wrong! I was probably unclear in this paragraph and I apologize.

    • I meant to add it blows my mind how often people think they’re working on the actual data when nearly all mathematics and analysis is actually the creation of abstract versions of whatever is underlying. You can take a sea of freaking points and generate set theory and thus all of mathematics and then tell all sorts of stories about the freaking points. Some repeat they are constants. But most are patterns drawn on the sand before the tide rolls over the beach. It’s truly weird to me that people have barely grasped the concepts of set theoretical limits but even more basic statemtents of those ideas like “deconstructionism”, meaning the elimination of objective interpretation and the promotion of subjective and perspective based narratives (and it’s unfortunate corollary: the war between perspectives as though one is objective). So much of the social science nonsense you criticise fails this basic concept: it’s a freaking slice and it can’t be objectively true when you have small n and there are data interpretation issues, etc.

  9. I understand the arguments for replication, sort of. I’ve done a few simulations and convinced myself that any summary of inference is off if you aggressively search for patterns in the data and then present an analysis that omits that you did that.

    But why isn’t there a method that is completely invariant to researcher behaviour? This method should take as input the data, assumptions about how the data is generated, and a claim. It then returns a numerical, interpretable measure of the extent to which (data, data generating assumptions) support the claim.

    Is Bayes that? With an applied Bayesian analysis, does it not matter at all whether researchers judiciously explored the space of researcher degrees of freedom but write as if they didn’t? It seems hard to believe. It seems you can probably always first search for a pattern, then work backwards and create an inference-story, using frequentist or bayes, that has it coming out as strong evidence.

    • Bayes is exactly that. The data is input, the assumptions about how the data is generated are those that are explicitly stated (in the Stan code for example), and the claim is typically something about where in the parameter space the numerical values of the parameters used in the model are.

      The posterior probability is exactly the “numerical, interpretable measure of the extent to which (data, data generating assumptions, and pre-data assumptions about the parameters…) support the claim”

      the biggest issue is really the “goodness” of the assumptions. But, if you say “this is my model… and my data… and here is the posterior probability…” provided you didn’t make a computational error, no one should ever say “no that’s not the posterior probability…”

      it’s a mechanical process after the assumptions are made precise.

      • While true enough in itself, this does not address anon’s overall concern, because the researcher has choices about the model and priors that can be made after the data are in. In the presence of some “publication filter,” the model will be chosen to enable publication.

    • “But why isn’t there a method that is completely invariant to researcher behaviour? ”

      Because we live in the real world. And because statistical inference is only appropriate when there is uncertainty (which we cannot get rid of.)

  10. When will we see pre-registration hacking? Don’t get me wrong, I’m all for pre-registration, but ambiguous language could also be used on purpose, and while we’re at it, why not pre-register different versions of the study/experiment (I could write a script to vary sub-groups) and thus increase my chances of finding something ‘significant’. But then again, why not simply fake the data?

    On a more serious note, I’ve had a reviewer insisting that I changed the analysis I present in a paper to use a different coding scheme; I have worked assuming a linear statistical effect for (years of) education, and post-hoc discovered (unanticipated) non-linearity (which I mentioned in the paper). The reviewer suggested that these results are more ‘interesting’ and ‘should’ be reported as the main results; I insisted on reporting my intended analysis and on top of that mention my post-hoc discovery. (I haven’t heard back from the journal yet.)

    • The solution to “pre-registration” hacking is presumably public pre-registration. If you have to file your research protocol and analysis plan on some repository that shows visitors what you pre-registered, does not allow deletion and tracks any changes, then we would all be able to see, if you pre-registered 10 different version of the same proposal. Even better would of course be writing and posting the programs as part of the final pre-registration, as well as the final data at the end…

      • but this assumes a single repository; not what we have. And I could pre-register studies and then claim not having received the funds to do them… I know, easier to fake the data.

  11. Indeed, random sampling is the great untalked-about problem in social science. The vast majority of studies use convenience samples, and thus do not adequately reflect the population they are studying.

  12. Claire: You’d be cheating and you’d know it. Your collaborators/co-authors might know it. The point is to have a procedure which, if followed correctly increases confidence in your results and prevents p-hacking by a series of seemingly reasonable and justified steps you take after you’ve seen the data.

    • Of course. (and as I wrote, not what I suggest) I was just thinking that a technical solution like pre-registration can still be circumvented, even though it’s so much better than what we currently do. The key difference is indeed what you mention: you’d necessarily know that you’re cheating, whereas with current p-hacking many researchers don’t seem to understand what the problem is.

  13. > These are methods of design of a study that make analysis less sensitive to assumptions.
    Fisher wrote something late in his career along the lines that in his early work in design he had been a bit of a fool focusing primary on efficiency when the true value of design was really to make analysis less sensitive to assumptions (or make the assumptions more likely to be less wrong).

    One more think for you to write up ;-)

Leave a Reply to Martha (Smith) Cancel reply

Your email address will not be published. Required fields are marked *