Skip to content
 

I am (somewhat) in agreement with Fritz Strack regarding replications

Fritz Strack read the recent paper of McShane, Gal, Robert, Tackett, and myself and pointed out that our message—abandon statistical significance, consider null hypothesis testing as just one among many pieces of evidence, recognize that all null hypotheses are false (at least in the fields where Strack and I do our research) and don’t use significance testing as scientific decision rule—is consistent with the holistic message in Strack’s paper, “From Data to Truth in Psychological Science. A Personal Perspective,” where he writes things like, “unlike propositions, effects have no truth values” and “irrelevant factors (e.g., the context, person characteristics) determine the strength of an effect. As a consequence, its size is not a constant characteristic but is codetermined by the contextual and personal influences. Moreover, its size may vary over time.”

This is noteworthy because the last time this paper by Strack was discussed on this blog it was in a negative way (“Bigshot psychologist, unhappy when his famous finding doesn’t replicate, won’t consider that he might have been wrong; instead he scrambles furiously to preserve his theories”). So, as Strack correctly points out, it’s a bit rich to see me (appropriately) criticizing null hypothesis testing today, given that a few months ago I was using hypothesis tests to declare that certain findings did not replicate.

Strack is saying that the reports of non-replication of his study are non-rejections of hypothesis tests, and we shouldn’t overinterpret such non-rejections. A non-rejection tells us that a certain pattern in data is in a class of patterns that could be explained by the random number generator that is the null hypothesis—but who cares, given that we have little interest in the null hypothesis (of zero effect and zero systematic error) in the first place. As Strack puts it, and I agree, just because a pattern in data could be explained by chance, that doesn’t mean it’s so useful to way the pattern is explained by chance.

And, at a rhetorical level, I was endorsing the use of a p-value (in this case, a non-rejection) to make a scientific decision (that Strack’s work could be labeled as non-replicable).

Strack’s point here is similar to Deborah Mayo’s criticism of critics of hypothesis testing, that we are aghast at routine use of p-values in research decisions, but then we seem happy to cite data from hypothesis tests to demonstrate replication problems. Mayo has pointed to instances when critics use hypothesis tests to reject the hypothesis that observed p-values come from some theoretical distribution that would be expected in the absence of selection, but I think her general point holds in this case as well.

I agree with Strack that “non-replication” is a fuzzy idea and that it’s inappropriate to say that non-replication implies that an effect is not “true.” Indeed, even with power pose, which has repeatedly failed to show anticipated effects in replication studies, I don’t say the effect is zero; rather, I think the effects of power pose vary by person and situation. I don’t see any evidence that power pose has a consistent effect of the sort claimed in the original papers on the topic (notoriously, this stunner, which I guess was good enough for a Ted talk and a book contract: “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications.”)

Getting back to Strack’s experiments: I don’t think the non-statistically-significant results in replications imply that there is no effect. What I do think the totality of the evidence shows (based on my general impression, let me emphasize that I’ve not analyzed any raw data here) is that, again, there’s no good evidence for any consistent effects of Strack’s treatments. And, beyond this, I’m skeptical that Strack’s designs and data collections are sufficient to learn much about how these effects vary, for reasons discussed in section 3 of this paper.

In short: high variation in the effect plus high variation in the measurement makes it difficult to discover anything replicable, where by “replicable” I mean predictions about new data, without any particular reference to statistical significance.

So: for general reasons of statistical measurement and inference I am skeptical of Strack’s substantive claims; I suspect that the data from his experiments don’t provide useful evidence regarding those claims. Similarly, there is a limit to what can be learned from the replication studies if they were conducted in the same way.

What about replication studies? I’ve been skeptical about replication studies for a long time, on the general principle that if the original study is too noisy to work, the replication is likely to have similar problems. That said, I do see value in replication studies—not so much as a strategy for scientific learning but as a motivator for researchers. Or, maybe I should say, as a disincentive for sloppy studies. Again, consider power pose: That work was flawed in a zillion ways, some of which are clear from reading the published article and others of which were explained later in a widely-circulated recantation written by the first author on that paper. But it’s hard for a lot of people to take seriously the claims of outsiders that a research project is flawed. Non-replication is a convincer. Again, not a convincer that the effect is zero, but a convincer that the original study is too noisy to be useful.

So, replication studies can play a sort of “institutional” role in keeping science on track, even if the particular replication studies that get the most attention don’t actually give us much of any useful scientific information at all.

Strack cites Susan Fiske, Dan Schacter, and Shelley Taylor who “point out that a replication failure is not a scientific problem but an opportunity to find limiting conditions and contextual effects.” Maybe. Or, I should say, yes but only if the studies in question—both the original and the replication—are targeted enough and accurate enough to answer real questions. For example, that ovulation-and-clothing study was hopeless—any consistent signal is just tiny compared to the noise level—and replications of the same study will just supply additional noise (which, as noted above, can be valuable in confirming the point that the data are too noisy to be useful, but is not valuable in helping us learn anything at all about psychology or behavior).

So, in some ideal world there could be some sense in that claim of Fiske, Schacter, and Taylor regarding replications being an opportunity to find limiting conditions and contextual effects. But in the real world of real replication studies, I think that’s way too optimistic, and the lesson from a series of unsuccessful replications is, as with power pose, that it’s time to start over and to rethink what exactly we’re trying to learn and how we want to study it.

Finally, I agree with Strack’s point that theory is useful, that psychology, or the human sciences more generally, is not merely “a collection of effects and phenomena.” The place where I think more work is needed is in designing experiments and taking measurements that are more precise and more closely tied to theory, also doing within-person comparisons as much as possible, really taking seriously the idea of directly measuring the constructs of interest, and tying this to realism in experimental conditions.

53 Comments

  1. Mayo says:

    It is not surprising that those who declare we should abandon the testing of statistical differences (in distinguishing genuine from spurious observed effects) wind up losing the capability to criticize the very inquiries they wish to, and should, criticize–by employing the error statistical reasoning behind statistical tests. That is to lose the central critical tool of science, in my view. Failed replications, coupled with other critiques of inquiry, do permit inferring, reliably, to the lack of a genuine phenomenon. Were that not the case, we could never show effects are absent, nor could we criticize theories or inquiries as unscientific, nor could we statistically falsify. (In many fields we go further and affirm null effects, as with the equivalence principle in general relativity, and many others.) Even if one grants all hypotheses H are false* it doesn’t mean you always have evidence of a genuine anomaly for H. Otherwise, you can readily claim to have evidence of contradictory claims. You’d reject as false both mu ≥ mu’ and that mu < mu’. And if you fail to produce evidence of an effect, despite methods having a high capability of finding it, if it is present, then to claim it is genuine is falsified.

    However, it’s worth noting that the prevalent criticisms of significance tests, and the ones at the heart of today’s brouhaha, depend on a reasonably high prior probability for a point null hypotheses. So it follows from Gelman’s taking them never to be true, that none of those criticisms have weight! And surely, one would need to abandon the goal of finding probable hypotheses–though in this Gelman and I agree.

    *We should not infer from the fact that a null hypothesis may, with large enough sample size, be falsified, IF it’s false, to supposing all are false. I don't know if Gelman's (all false) claim is limited to point nulls.

    (I happen to notice this very late at night; I hope this is not too unclear.)
    My “p-values can’t be trusted except when used to argue that p-values can’t be trusted”post is here:
    https://errorstatistics.com/2013/06/14/p-values-cant-be-trusted-except-when-used-to-argue-that-p-values-cant-be-trusted/

    • Andrew says:

      Mayo:

      1. I agree that failed replications can reveal a problem: at the very least, we have a phenomenon that is not reliably measurable using a particular measurement protocol (taking the entire study as a sort of “measurement”). I don’t think that a failed replication proves, or even implies, that the underlying phenomenon does not exist. Consider the work of Kanazawa on the “generalized Trivers-Willard effect.” Kanazawa’s analyses were based on such noisy data that all his empirical results are essentially useless. And any preregistered replications would be essentially pure noise, meaning that with a simple classical analysis there’d be a 5% probability of getting “statistical significance” just by chance, and with a reasonable Bayesian analysis there’d be essentially zero probability of coming up with a 95% posterior interval that excludes zero. Nonetheless, I don’t think this means that the generalized Trivers-Willard effect doesn’t exist! Actually, I’m pretty sure the generalized Trivers-Willard effect does exist, however I could well imagine the sign of this effect being positive in some settings and negative in others—and if I wanted to say anything much about its magnitude, I’d need to do some biology; no amount of fiddling with Kanazawa’s data or the data from hypothetical replication studies would be of help.

      Or, for another example, suppose you try to measure the speed of light by using Einstein’s formula and burning a cookie on a kitchen scale. It just won’t work: the measurement is too noisy, and if you were to get lucky and find a result, there’s no reason you’d expect it to show up again under a preregistered replication. But that doesn’t mean there’s nothing happening; this is just too noisy an experiment to give any valuable data on the question.

      2. You speak of “today’s brouhaha.” I’d prefer not to use the term “brouhaha,” which is defined as “a noisy and overexcited reaction or response to something.” Replication is an important issue, and I’m writing about this in a measured and serious way, as are you and as are others in the comment thread. I don’t think this post or this thread is “overexcited”; I think we’re giving a serious treatment to an important topic.

      3. I don’t think that point hypotheses are never true; I just don’t find them interesting or appropriate in the problems in social and environmental science that I work on and which we spend a lot of time discussing on this blog. Here’s what I wrote in a recent thread:

      There are some problems where discrete models make sense. You [another commenter] give the example of a physical law; other examples are spell checking (where, at least most of the time, a person was intending to write some particular word) and genetics (to some reasonable approximation). In such problems I recommend fitting a Bayesian model for the different possibilities. I still don’t recommend hypothesis testing as a decision rule, in part because in the examples I’ve seen, the null hypothesis also bundles in a bunch of other assumptions about measurement error etc. which are not so sharply defined.

      4. As I’ve written many times, I do think there can be value in hypothesis tests. Finding out that a particular pattern of data is consistent with a particular random number generator (that is, the “null hypothesis” of zero effect and zero systematic error) can be of interest, even if that random number generator isn’t itself plausible. Here’s an example from a small problem I worked on once. The mistake would be to consider the nonrejection as evidence that the random-number-generator model is true. Rather, the nonrejection reveals to us that the data are not sufficient to rule out that model.

  2. Chris Crandall says:

    However controversial this post may be, the final paragraph of desiderata is surely correct. The combination of well-measured construct, made close to theory, in well-designed and well-powered studies is surely where we should spend most of our time, and should contribute the most to progress.

  3. Imaging guy says:

    “If you’re going to reject a different study that contradicts your original study because its conditions were almost but not quite exactly the same then it takes a bit of chutzpah to say your study has implications for people outside a lab, where there is far, far less control over conditions.” (A comment made by someone on Slate)

    • Tom Passin says:

      This was so well put that it’s hard to add to it, but it can be turned around: If your study has not been repeated with different conditions, you cannot know whether you have found anything universal, or simply limited to your one experiment with its particular conditions and participants. This is even more forcefully the case when your effect is not supported by a reliable theoretical framework.

      • Andrew says:

        Tom:

        Good point, and it’s another way that I agree with Strack on the general point (even if not on some of the implications for his particular studies) that there is a limit to what can be learned from pure replications.

      • Keith O'Rourke says:

        I perhaps should have been more explicit about that in this “anticipating how the assessment of replication (as other studies are done) is to be properly considered”.

        Design the required variation into the ensemble of studies – not too much too soon nor too little too late ;-)

        Some co-operating clinical trail groups do try to do some of that.

      • Wayne says:

        Might replication, in some sense, be a physical-world analog to cross-validation? If so, is some level of variation in the experiment — which is probably inevitable for something done by other than the original researchers in the original environment — actually a good thing, giving us a distribution of outcomes rather than a point?

        • a reader says:

          Wayne:

          I would argue that replication is a much stronger criteria than standard cross-validation. With cross validation, you know that your “out-of-sample” data is truly from the same population. With replication, unless you have way too much faith in your model, you should realize that you are sampling from a somewhat similar population, but there must be some differences. This is generally good: you want to look for effects that are relatively stable across small changes in the population.

          In fact, the recent conversations about making inferences from faces got me to thinking that a better version of cross validation might be that if you have data from say, 10 different data sources, and your cross validation sampling scheme leaves out entire data sources. This is very similar to the concept of “cluster bootstrapping”. But I think I recall that shortly after having that idea, I read some applied paper that was already doing something like this, so this may already be an established method.

        • Keith O'Rourke says:

          As David Cox once put it – you get a sample from the distribution of varying effects but whether the center and spread of that distribution is of scientific interest depends on what driving the variation. If its methodological/quality variation- neither is of much interest, it its indeterminate biological variation – mostly just the spread and determinate/designed variation – both (this actually goes back to Fisher 1935 book).

        • Joachim says:

          I believe so. We did exactly that here: https://osf.io/g84py/

          Bonus: Stan!

          • Keith O'Rourke says:

            Thanks – this does look very promising – definitely a way to go forward in my view.

            Not sure if you are pointing to same material in Fisher’s 1935 book that I was (see excerpt on that material below).
            Also, if you a re not already aware you might wish to look at this new paper from Senn – Sample size considerations for n-of-1 trials. http://journals.sagepub.com/doi/10.1177/0962280217726801

            “Fisher’s discussion was cast in the context of a particular example in a chapter entitled “Comparisons with Interactions — cases in which we compare primary effects with interaction”. In my opinion this is an extraordinarily brilliant and modern (even if somewhat vague at times) discussion of most of the critical issues in meta-analysis of randomized trials. Amongst other things, Fisher pointed out that treatments (often fertilizers) might react differently to different types of soil but that often we want to ascertain that a treatment was not merely good on aggregate of fields used but fields suitable for treatment. Where as in astronomy random effects were used to make allowances for varying amounts of measurement error, Fisher realized that in agricultural trials it was more likely that treatment effect varied. This quite different in that if you were merely interested in the aggregate of fields used – the variation need not be allowed for – were as with varying amounts of measurement error (i.e. “constant day error” or “constant study error”) it always needs to be allowed for. Some recent discussions of random effects seem to have missed this point.” from https://www.researchgate.net/publication/269005091_Meta-analytical_themes_in_the_history_of_statistics_1700_to_1938

            • Martha (Smith) says:

              “Where as in astronomy random effects were used to make allowances for varying amounts of measurement error, Fisher realized that in agricultural trials it was more likely that treatment effect varied.”

              Nice quote making an important point.

  4. Keith O'Rourke says:

    First, I should have read Strack’s paper in the original negative post – nice discussion of the inevitability of time/place variation in research and how the evaluation of replication has been poorly thought through and badly implemented.

    More importantly, I think we need more than the combination of well-measured construct, made close to theory, in well-designed and well-powered studies.

    The meta-analytical thinking needs to be made upfront and applied to the vary first study (which primarily should focus on should another study be done and if so how) anticipating how the assessment of replication (as other studies are done) is to be properly considered.

    The mantra I always tried to get across in my meta-analysis publications was “first assess replication to see if its adequate to support the combined inference for something taken to be common between the studies – then take into account the remaining real uncertainties (e.g. as reflected in the time/place variation).” Now exceptionally well designed and implemented RCTs well may support the shrinkage of the time/place variation to zero – but that argument needs to be credibly made.

    Folks can develop better methods (I prefer likelihood plots) but a Forrest plot of effect estimate confidence/credible intervals usually would give a good indication of replication (those gamed noisy studies Strack worries about would just show up as wide uninformative intervals). Some people must be doing that?

  5. Frustrated Fred says:

    I don’t even know where to start.

    How about this: I think there may be a big problem with the way that psychologists conceptualize their research questions. In particular, psychologists who view themselves as “basic scientists” often conceptualize their research questions in terms of discovering whether an effect exists or not (H0=zero effect, H1= any effect size greater than 0). Then, they find “evidence” for their effect (p<.05), but no one else can replicate it, and some even find trends in the opposite direction of the original finding. The basic scientists then argues that the effect is too small and contextually sensitive to be easily replicable, but that does not mean that the effect doesn't exist (H0). The task then becomes a search and rescue mission, trying to dig a reliable effect out of hopeless numbers of potential moderators.

    But all of this is nonsense. The real problem is that it is impossible to find unambiguous support for H0=0 because there is always going to be some difference, some "effect", that is due to some causal factors that could, potentially, be identified (and hey who knows, one of those causes might actually be the thing that you're interested in). The mind boggling thing about all of this is that there are so many apparently smart people who fail to recognize that the alternative hypothesis "any difference greater than zero" is simply not viable if the goal is to create a cumulative science. Cumulative science (defining science as the generation of knowledge) depends, fundamentally, on robust, replicable effects that allow us to make predictions about the future. This is an impossible task if we are going after effects that are so small that they cannot be reliably distinguished from sampling error. Strack says that going in the direction of trying to find larger and robust effects will be bad for psychology because "it may shift the field into a more applied direction and away from theoretical innovation". But I'm afraid that I disagree deeply with this sentiment. What we need, desperately, is to go in the direction that Strack calls "applied". What exactly is "theoretical innovation" if it is based on experiments for which the null hypothesis can never be accepted? I don't see any point in doing experiments at all if the conclusions are predetermined by the question itself (H0=0 can never be true, so, H1 wins!!–am I famous yet??).

    • I so want to get Gilbert Shelton to do a comic called “Frustrated Freddy’s Cat: A cartoon guide to HARKing”

    • Erikson says:

      Although there are some notable exceptions, I would urge the non-psychologist reader to check exactly what passes for “theory” and “measurement” in most psychological studies.

      For theory, there is hardly any formalization, at all. Meehl (I remember from those videos: http://meehl.umn.edu/recordings/philosophical-psychology-1989) argues that, although we can’t derive precise values for most phenomena in Psychology, we could, at least, derive lower and upper bounds or some sketchy functional form to test a theory with at least some severity. Most of the time, though, it’s pretty far from it: some qualitative intuition is hastily transformed in a statistical hypothesis so the researcher can apply all those canned procedures — as pointed by Frustrated Fred — without much consideration for all the contextual problems pointed in the recent paper about demoting p-values.

      For measurements, there are all sort of cool latent variable and item response theory models used to evaluate psychometric scales, but most scales are so farfetched that we could hardly claim they are a good approximation to the variables of interest. How realistically can we infer about subtle changes in funniness perception from a ten-point ordinal scale? How well can it discriminate between similar phenomena, like “funniness” and “amusement”? Having worked with those not-well-defined scales, I can tell you they usually correlate with all sort of other variables and usually have a hard time discriminating anything at all.

    • Anonymous says:

      “I don’t even know where to start”

      I have asked another Fred to write an accessible paper about “theory” and such things in the past, and publish it in a “popular” Social Psychology journal so it maximizes the chances of getting a lot of exposure, and attention, and hopefully maximizes the chance of it being useful (also see my wish for something like this in my posts below).

      I don’t know if you are the same Fred, but regardless i think it might be very useful at this point in time for someone to explain and write about the types of issues you mention in an accessible manner.

      If i were smart enough i would start with that, but i am not. I hope you are, and you will.

  6. Seth says:

    “unlike propositions, effects have no truth values” is totally nonsensical when your propositions and/or the evidence supporting them is entirely the size and direction of effects.

    • Anonymous says:

      I have read these and other sentences in the Strack paper multiple times to try and understand them. I have a hard time doing so. The sentence you picked out also stood out for me.

      I also have the feeling the some psychologists, like Strack, use “theory” to 1) try and rescue “failed” replications, and/or 2) try and rescue their conclusions made in past papers. I looked up the word “proposition” and synonyms for it included “theory” and “hypothesis”. I am not smart enough to figure it all out exactly, but i wonder if what you said applies to “theory” as well. More specifically, if a “theory” is largely 1) based on found experimental “effects” and/or 2) is/should be evaluated based on found experimental “effects”, then i don’t think it’s valid to separate “effects” and “theory” in psychology in the way i understand Strack (and others) are currently doing.

      I wish someone would write more about these issues in mainstream psychology journals, and make them clear as simple as possible.

    • Martha (Smith) says:

      The quoted sentence did make sense to me:

      A proposition (in the usual mathematical/logical use — which is how I interpreted it in the context of the quote) does have a truth value: The proposition is either true or false. (Of course, we might not know which option the truth value is).

      An effect is not a proposition; a proposition needs (by the definition I am used to – presumably because I am a mathematician) to be a statement that can be either true or false.e.g., “The effect is 2” or “The effect is greater than 0” are propositions.

    • Anonymous says:

      @ Martha:

      Thank you for your response! I am still trying to understand. From Strack’s paper:

      “As predicted, the cartoons were rated to be funnier if the pen was held between the teeth than between the lips. The effect was not strong but met the standard criteria of significance.”

      It seems to me that when he talks about “the effect” he means that cartoons were rated to be funnier if the pen was held between the teeth than between the lips. If i understood your definition correctly this is a statement that can be true or false, and can therefore be seen as a “proposition” not an “effect”(?)

      So what exactly does Strack mean when he talks about “propositions” and “effects” in the quote “unlike propositions, effects have no truth values”?

      I have the feeling that he talks about “effects” as being experimental findings based on some sort of hypothesis/theory: in this case “cartoons were rated to be funnier if the pen was held between the teeth than between the lips”. And i have the feeling that he views that hypothesis/theory as a “proposition”: in this case “facial expressions may affect emotional experiences”.

      I have expressed my wish for someone to write an accessible paper about “theory” in psychology, and this could be an example of why i think this might be important. I don’t understand what Strack means when he talks about “effects”, which to me sound really like “propositions”, which in turn sound really like “hypotheses”.

      I have yet to come across a paper which explains these things clearly, and i think this could be very useful at this point in time. What exactly is an “effect”, “hypothesis”, “theory” in psychological science and how do they relate to each other? And how do experiments, published findings, etc. in turn relate to these terms? And how should psychological science be performed in order to align with the meaning/implications of these terms, and maximize the usefulness of these terms with regard to the scientific process and its progress?

      Without further information, I still agree with the poster above when he states that “unlike propositions, effects have no truth values is totally nonsensical when your propositions and/or the evidence supporting them is entirely the size and direction of effects”.

      • I suspect what he means by “the effect” is something like “the amount by which doing X causes Y to change” in other words it’s a number on some scale, it has kind of made-up dimensions, it’s in the same class of things as “the length” and “the mass”, and a proposition would be something like “the effect was 2”

        • Martha (Smith) says:

          I think that what Anonymous’s reply has shown is that “the effect” is interpreted different ways by different people. From the strict statistical point of view, “effect” is a specific number (i.e., a specific value of some parameter.) Hence, what Andrew often says: “The effect is variable.” But I suspect that many people without a thorough understanding of statistical foundations do indeed consider “funnier” as an effect. I’m not saying they are wrong, just that this is not the definition of “effect” used in the reasoning of statistical inference. In particular, I now realize (after this discussion) that this confusion of two meanings of “effect” is something we need to emphasize more in teaching and talking about statistics. So thanks to Anonymous for bringing this up and pressing on it. I’ve learned about a confusion/misunderstanding that I wasn’t previously aware of.

  7. Ulrich Schimmack says:

    A single p .05 doesn’t prove that there is no effect.

    A credible finding should replicate with p .05 findings and publication bias. This does not mean that the effect is not there (the effect size could be very small, but in the predicted direction), but there is no reason to claim that smiling can make you happy based on the existing evidence.

    https://replicationindex.wordpress.com/2017/09/04/the-power-of-the-pen-paradigm-a-replicability-analysis/

    • Anenoeuoid says:

      A single p [greater than] .05 doesn’t prove that there is no effect.

      Neither does one million of them…

    • Andrew says:

      Ulrich:

      It’s not just that the effect size could be very small, but in the predicted direction. The effect size could also be highly variable, positive in some settings and negative in others, and measured in a noisy way so that an experiment provides very little information about it. In statistics textbooks there’s a lot of talk about randomization, blindness, etc., and recently in the psychology literature there’s been lots of talk about replication—and all these things are important, but if someone is studying a poorly conceptualized, highly variable phenomenon with noisy measurements, then not much can be learned. Hence the final paragraph of my post above.

    • Fritz Strack says:

      We have never claimed that smiling should make you happy. Instead, we have tested the hypothesis that facial actions (in this case smiling) may affect emotional judgments without inferences that are based on the meaning of this action (in this case “smiling”). Our positive finding is one empirical building block in a broader examination of the facial feedback hypothesis. I fully agree that from an applied perspective, the pen procedure is not effective enough to be used as a treatment. From a theoretical perspective, however, it is diagnostic to tell apart different underlying mechanisms.

  8. Guive says:

    I don’t really understand this argument that it is incoherent to say that P-values shouldn’t be our main tool for statistical inference based on evidence from P-values. If it was shown that thermometers allowed an enterprising researcher to publish spurious results that would often fail direct replication, that would be a reason for replacing thermometers with some other tool. Why are P-values different?

    • Suppose people rely on thermometers to diagnose illness via fever. Suppose that there are huge families of illnesses that don’t cause fever. Now, suppose doctors refused to believe you were sick unless you did have a fever. Is this the thermometer’s fault? Should we stop reporting high and low temperatures for the weather because “temperature doesn’t really mean anything”?

      What I think is true is that a p value has a certain use, and in practice almost never is this actual use what people do. The actual use is to take a well calibrated model based on real data, and determine whether some new data would be unusual to get from that calibrated model. If p is small, then you can proceed to find a different model for that new data. If p is not small, then you can make a decision about whether you want to treat the new data “as if” it came from the original model, or you want to build a new model. A not-small p value doesn’t tell you “this did come from the base model” it just says “this *could have* come from the base model”

      • Guive says:

        I guess I see your point.

        Maybe the argument is more like: if you don’t regard evidence based only on P-values as strong support for an effect, you shouldn’t regard evidence based on P-values as proof that the effect isn’t really there, you have to be consistent about it. That seems fair enough but it doesn’t in practice effect the value of PPNAS or Psychological Science type research. You go from saying that failed replications show an effect isn’t there to saying that failed replications show we have no particular reason to believe that the effect is there. And I think Andrew does sometimes talk about these power pose or whatever in this way.

        • Right, the noisy power-posey / beautiful people sex ratio / fat arms and voting / etc type study doesn’t really provide much information about anything. Just as sitting around in your office and saying “hey I wonder if embodied cognition might affect people’s choice of underwear, let’s roll 3d20 and find out” isn’t very informative.

  9. Anonymous says:

    Another part of Strack’s paper that puzzles me:

    “However, there is an important difference between these two biases in that a positive effect can only be obtained by increasing the systematic variance and/or decreasing the error variance. Typically, this requires experience with the subject matter and some effort in controlling unwanted influences, while this may also create some undesired biases. In contrast, to overturn the original result, it is sufficient to decrease the systematic variance and to increase the error. In other words, it is easier to be successful at non-replications while it takes expertise and diligence to generate a new result in a reliable fashion. If this is the case, it should be reflected in measures of academic achievement, e.g., in the h-index or in the number of previous publications. Although, the last word is not yet spoken, data from Gertler (2016) and Bench et al. (2017) suggest that this asymmetry may be empirically founded.”

    1) Didn’t the “false-positive psychology” -paper make clear that flexibility in data collection, analysis, and reporting dramatically increases the chance of finding a “positive” effect? Based on that paper, it seems to me that finding a “positive” effect does not necessarily have anything to do with “expertise”.

    2) The Gertler (2016) reference refers to a 6-hour long youtube video. I took a look at it, think i found the presentation of Gertler, and it seems to be about “code/analysis” replication in economics. It seems to not be about psychology, and not about the type of replication Strack is talking about in his paper. On top of that, i couldn’t find anything about any relation between “expertise” and (code/analysis-) replicability.

    The Bench et al. (2017) reference seems to me to conclude just about the opposite of what i understand Strack is trying to convey:

    “Using an objective measure of research expertise (number of publications), we found that expertise predicted larger replication effect sizes. The effect sizes selected and obtained by high-expertise replication teams was nearly twice as large as that obtained by low-expertise teams, particularly in replications of social psychology effects. Surprisingly, this effect seemed to be explained by experts choosing studies to replicate that had larger original effect sizes. There was little evidence that expertise predicted avoiding red flags (i.e. the troubling trio) or studies that varied in execution difficulty. However, experts did choose studies that were less context sensitive. Our results suggest that experts achieve greater replication success, in part, because they choose more robust and generalizable studies to replicate.”

    • Ben Prytherch says:

      That’s a great point on 1). Strack seems to be saying that getting significance is tough, but any old fool can get insignificance. As we know, any old fool can get either.

      • Anonymous says:

        I find it pretty hard to understand what he is trying to say in the quote i posted above. To me, he seems to be saying that it is easier for “non-experts” to be successful at non-replications, because it is somehow easier for them to decrease the systematic variance and increase the error variance.

        I suck at statistics, but don’t direct replications use (roughly) similar methods, materials, and participants as the original study, and hence can be expected to produce/have (relatively) similar systematic and error variances?

        If this is (roughly) correct, couldn’t you then state the exact opposite of what i understand Strack is trying to convey?

        Don’t direct replications make it harder to intentionally or unintentionally influence systematic and error variances in an experiment, and are therefore relatively uneffected by “expertise”?

        On top of that, couldn’t you say that (within reasonable boundaries, and following what i think Strack is trying to convey) it is actually easier for “experts” relative to “non-experts” to decrease the systematic variance and increase the error variance, and thus be successful at non-replications, because they have the “experience with the subject matter” and would therefore know how to manipulate both variances?

        • I think you have to understand the Psychology field as a whole as for the most part holding a “learned heroic expert initiated in the mysteries” view of themselves. In this view, if you want to “see an effect” it takes “the right kind of experimental mojo” and a random person, even a random psychologist at a different institution, is not going to really be able to “have that mojo” to elicit the effect reliably. Hence, there’s no problem with the fact that results don’t replicate, since they wouldn’t really be expected to without Adam Smithsonian’s Invisible Psychological Hand.

          Of course, the rest of us can look at this as “foolish experimenter is expert in fooling themselves by carefully biasing all their experiments and analyses to get what they want, what they really really want” to paraphrase the Spice Girls.

        • Ben Prytherch says:

          It could be that when Strack says “error variance” he doesn’t mean noise, but rather bias. One of the criticisms of the pen-in-mouth replications is that having cameras in the room makes people self-conscious and this counteracts the effect. And some of the criticisms of the replications from RP:P were along the lines of “the original incorporated this certain element that prevents this form of bias, but the replication didn’t”. I can’t think of a scenario in which someone’s lack of expertise in implementing a particular type of study would increase error variance in the sense of noise.

          • Anonymous says:

            “It could be that when Strack says “error variance” he doesn’t mean noise, but rather bias. One of the criticisms of the pen-in-mouth replications is that having cameras in the room makes people self-conscious and this counteracts the effect.”

            I tried to find a definition of “error variance” (like i said, i am really bad at statistics) and i found the following definition: “In statistics, the portion of the variance in a set of scores that is due to extraneous variables and measurement error.”.

            I then looked up a definition of “extraneous variables” and found the following: “Extraneous Variables are undesirable variables that influence the relationship between the variables that an experimenter is examining. Another way to think of this, is that these are variables the influence the outcome of an experiment, though they are not the variables that are actually of interest.”

            If these definitions are correct, and i am interpreting them correctly, i agree with your example of the (replication-) paper in question where the camera could have influenced the “error variance” (to me the camera could be an “extraneous variable”).

            However, i reason that this could be a valid criticism of this specific replication study (which could perhaps be seen as not a strict “direct replication”), this does not change my more general reasoning/questions written above concerning what i think Strack is saying about “expertise” and (i assume “direct”) “replications”:

            I reason that direct replications use (roughly) similar methods, materials, and participants as the original study, and hence can be expected to produce/have (relatively) similar systematic and error variances.

            If this is (roughly) correct, i would then state the exact opposite of what i understand Strack is trying to convey.

            Direct replications make it harder to intentionally or unintentionally influence systematic and error variances in an experiment, and are therefore actually relatively uneffected by “expertise”.

    • Anonymous says:

      Another thing that puzzles me in Strack’s paper:

      “In other words, it is easier to be successful at non-replications while it takes expertise and diligence to generate a new result in a reliable fashion. If this is the case, it should be reflected in measures of academic achievement, e.g., in the h-index or in the number of previous publications. Although, the last word is not yet spoken, data from Gertler (2016) and Bench et al. (2017) suggest that this asymmetry may be empirically founded.”

      1) The reasoning in this sentence does not make sense to me. He states that “it takes expertise and diligence to generate a *new* result in a reliable fashion”, followed by “if this is the case, it should be reflected in measures of academic achievement, e.g. in the h-index or in the number of previous publications”. First of all, this to me does not necessarily make sense, i highly question the validity of number of publications as a measure of “expertise”, if only for the possibility that previous publications may have nothing to do with the current publication.

      More importantly however, he refers to 2 sources “which suggest that this asymmetry may be empirically founded”. I already expressed my concerns about these sources in relation to Strack’s claims above, but they don’t even make sense to me in this quote because he starts talking about how it takes “expertise” to generate a *new* result, but the rest of his writing (including the “evidence” that is supposed to back up his reasoning) seems to be about “successful” *replications*, not *new* results. His reasoning doesn’t make any sense to me.

      2) I find it remarkable that he states these type of things when it seems that his own “pen-in-mouth” -paper is partly based on a study by a student, who i gather at the time did not have a lot of “expertise” (as measured by H-index or previous publications). From his paper: “Study 2 is based on Sabine Stepper’s Diploma thesis at the University of Mannheim.”.

      Furthermore, i reason that there is a very high chance that “non-experts” (as measured by H-index or previous publications) helped with data-collection (served as he only experimenters?). From the paper: “Appreciation is extended to Paul Kodros and Gerlinde Willy for their assistance in data collection.”

      If a) a “non-expert” (as measured by H-index or previous publications) was able to “generate a new result” in his own paper, and b) “non-experts” (as measured by H-index or previous publications) very likely conducted the actual research of his own paper, does this not largely invalidate his own reasoning?

    • Anonymous says:

      https://psyarxiv.com/4vzfs/

      “To test the hypothesis that reproducibility is a function of researcher ‘caliber’, we collected 79 replications that had been conducted of four studies. We used the four published Registered Replication Reports(RRRs),investigations where a dozen or more
      individual research teams all attempt to replicate the same study”

      “We used the h-index—a metric of researcher caliber/impact/experience (Hirsch, 2005)—of the researchers who undertook replications of the various effects. The h-index for a given scientist is a function of the number of papers that author has published and the number of times those papers have been cited”

      “Our results showed no evidence whatsoever in favor of the researcher ‘caliber’ and reproducibility hypothesis. In three of the four RRRs, there was no association between obtained effect size and the ‘caliber’ of the researcher conducting the replication. In one of the RRRs, we actually saw evidence that more experienced researchers were closer to returning the overall meta-analytic effect of zero, with less experienced researchers being the ones who found evidence for ego depletion. Collapsing across all four RRRs, the relationship was zero (b= .00003, p> .992)”

      • Anonymous says:

        Hmmm, it looks like the pre-print was posted on January 30, 2017 and Strack’s paper was received March 6, 2017.

        Too bad he didn’t see and/or included this pre-print in his paper before submitting it. I wonder if that would have changed his position.

  10. Fritz Strack says:

    Can a replication study serve as “a convincer that the original study is too noisy to be useful”?
    I’m not so sure. Noise reflects ignorance about “erroneous” influences. But this is typical for the beginning of a new research program.
    As we move along, we accumulate knowledge about relevant dispositional and contextual determinants that we want to measure or control, and thereby reduce the noise. Simply dismissing a “noisy” finding may therefore carry the risk of failing to pursue a promising research program.

    • Andrew says:

      Fritz:

      Lots of research programs are potentially worth pursuing, and statistical methods can be helpful in assessing the quality of measurements. For example, I think evolutionary psychology is worth pursuing, and one might even argue that the generalized Trivers-Willard hypothesis is worth pursuing—but pursuing it by studying patterns in sex ratios based on samples of a few thousand people is, I think, a complete waste of time, in the same way that it’s a waste of time to try to measure the speed of light by using Einstein’s formula and burning a cookie on a kitchen scale. It just won’t work: the measurement is too noisy, and if you were to get lucky and find a result, there’s no reason you’d expect it to show up again under a preregistered replication. But that doesn’t mean there’s nothing happening; this is just too noisy an experiment to give any valuable data on the question.

      I agree with you that a failed replication is not alone enough evidence to demonstrate that a particular experimental design is “too noisy to be useful.” Some theoretical understanding helps in these cases too.

      • Fritz Strack says:

        “I think evolutionary psychology is worth pursuing”

        Andrew:

        Given that facial (and bodily) feedback was originally proposed by Darwin under an evolutionary perspective, it is not surprising that subsequent research on embodiment has explicitly addressed this issue…. e.g. http://onlinelibrary.wiley.com/doi/10.1002/ejsp.664/full

        • Andrew says:

          Fritz:

          Yes, the topic is interesting. Two ways that statisticians can contribute to this sort of work are (a) addressing design and measurement issues, and (b) analyzing and presenting data that have been collected. The whole p-value thing is all about item (b), but my point in this thread is that item (a) is crucial. Without sufficiently high-quality data, researchers are mere Kanazawas, jumping up and down based on patterns that are essentially pure noise.

    • Martha (Smith) says:

      I cringe at phrases like “dispositional and contextual determinants”. They sound so, well, deterministic! Sure, there are potential influences, but I think that calling them “potential influences” helps us keep our minds on uncertainty, whereas “determinants” suggests thinking (unscientifically) in terms of certainty.

      • Fritz Strack says:

        The word “determinant” was deliberately chosen. Uncertainty means ignorance about the determinants, not their absence. Even if the outcome of a roll of dice is uncertain, it has been determined by the laws of mechanics.

        • Allan Cousins says:

          This is correct for most (pretty much all) lines of scientific inquiry (quantum mechanics aside).

          However, due to the complex nature of real world phenomenon it usually is much more appropriate to view these phenomenon from a probabilistic perspective. It’s not that one can’t go back and forth between the two perspectives (probabilistic and deterministic); it’s that when people do they usually try to impart more deterministic thinking than is usually warranted.

          I believe this is what Martha means when she says she cringes at people working in a deterministic framework. I mostly agree with her. If you stay in probability space less can go wrong.

Leave a Reply