As if the 2010s never happened

E. J. writes:

I’m sure I’m not the first to send you this beauty.

Actually, E. J., you’re the only one who sent me this!

It’s a news article, “Can the fear of death instantly make you a better athlete?”, reporting on a psychology experiment:

For the first study, 31 male undergraduates who liked basketball and valued being good at it were recruited for what they were told was a personality and sports study. The subjects were asked to play two games of one-on-one basketball against a person they thought was another subject but who was actually one of the researchers.

In between the two games, the participants were asked to fill out a questionnaire. Half of the subjects were randomly assigned questions that probed them to think about a neutral topic (playing basketball); the other half were prompted to think about their mortality with questions such as, “Please briefly describe the thoughts and emotions that the thought of your own death arouses in you” . . .

That’s right, priming! What could be more retro than that?

The news article continues:

The researchers hypothesized that according to terror management theory, those who answered the mortality questions should show an improvement in their second game. When the results of the experiment, which was videotaped, were analyzed, the researchers found out the subjects’ responses exceeded their expectations: The performance in the second game for those who had received a memento mori increased 40 percent, while the other group’s performance was unchanged.

They quoted one of the researchers as saying, “What we were surprised at was the magnitude of the effect, the size in which we saw the increases from baseline.”

I have a feeling that nobody told them about type M errors.

There’s more at the link, if you’re interested.

I feel bad for everyone involved in this one. Understanding of researcher degrees of freedom and selection bias has only gradually percolated through psychology research, and it stands to reason that there are still lots of people, young and old, left behind, still doing old-style noise-mining, tea-leaf-reading research. I can only assume these researchers are doing their best, as is the journalist reporting these results, with none of them realizing that they’re doing little more than shuffling random numbers.

One recommendation that’s sometimes given in these settings is to do preregistered replication. I don’t always like to advise this because, realistically, I expect that the replication won’t work. But preregistration can help to convince. I refer you to the famous 50 shades of gray study.

62 thoughts on “As if the 2010s never happened

  1. Reference to the study in question, which is oddly not mentioned in the news article, is:

    Zestcott*, C. A., Lifshin*, U., Helm, P., & Greenberg, J. (2016). He dies, he scores: Evidence that reminders of death motivate improved performance in basketball. Journal of Sport & Exercise Psychology, 38, 470-480.

    PDF: https://www.researchgate.net/profile/Colin_Zestcott/publication/309144536_He_Dies_He_Scores_Evidence_that_Reminders_of_Death_Motivate_Improved_Performance_in_Basketball/links/5877ea9608aebf17d3bbc9c1.pdf

  2. I can only assume these researchers are doing their best, as is the journalist reporting these results, with none of them realizing that they’re doing little more than shuffling random numbers.

    That was the most disheartening aspect for me. Otherwise good people get chewed up by the system and put on a path of pseudoscience. There should be NHST rehab, retreats, and support groups.

    Apparently AA has ~ 1.4 million members and uses ~$10 million each year, so it only costs them about ~$7-8 per member each year (this is surprisingly low to me…) .[1] To get an estimate of the number of addicts, around 90k people applied for NIH grants in 2015[1], so that would be $720k per year, or funding a single relatively big lab. This seems totally doable.

    The question is whether these people *want* to be helped though. In my experience the hardest part is getting them to admit there is a problem, especially when they see all their colleagues and coworkers doing the same thing.

    [1] https://www.aa.org/assets/en_US/en_gsofinancialinfo.pdf
    [2] https://nexus.od.nih.gov/all/2016/05/31/how-many-researchers/

  3. There is a weird non-disclosure in the paper (unless I missed it). In study 1 participants played basketball against a confederate and their scores represent the DV (post-manipulation) and control variable (pre-manipulation). It seems from at least one account of this study that the confederate was actually the lead researcher! Now, the paper claims that the confederate/researcher was blinded to which condition he was in but this is still a really weird thing to not mention in the paper, especially since he probably got better at playing one-on-one basketball as the study went on.
    https://arstechnica.com/science/2016/11/youre-all-going-to-die-a-scientifically-proven-pep-talk-for-winning/

    Of course there are some many other issue here (low power, garden-of-forking-paths) that this particular issue hardly seems worth mentioning. The number of possible alternative analyses is particularly dizzying. A multiverse analysis would have been nice.

    • “especially since he probably got better at playing one-on-one basketball as the study went on”

      Ah! — so the study needed to include a graph of his score vs time of game!

  4. “In a second, subtler experiment, participants took part in a timed basket-shooting challenge where the instructions and rules were given by a researcher wearing a T-shirt with a large skull and the word “Death” on the front. The T-shirt was visible to half of the participants (randomly selected) and was covered by a zipped jacket for the others. … Subjects who viewed the skull-and-death shirt outperformed those who did not by 30 percent.”

    Clearly, viewing a guy wearing a jacket suppresses athletic performance, as was apparently known to the Greeks (hence the nude athletes at the Olympics).

    • “Participants were randomly assigned to the MS or the control condition. Participants in the MS condition were asked the prototypical open-ended questions regarding their mortality: ‘Please briefly describe the emotions that the thought of your own death arouses in you’ and ‘Jot down as specifically as you can, what you think will happen to you as you physically die and once you are physically dead.’ […] In the control condition, participants responded to parallel questions about playing basketball: ‘Please briefly describe the thoughts and emotions that the thought of playing basketball arouses in you’ and ‘Jot down as specifically as you can, what you think will happen to you as you play basketball.'”

      Clearly, being primed to be aware of your basketball-induced emotions and what’s about to happen to you as you play basketball is going to be pretty distracting the next time you try to play basketball. They said they discarded the actual survey responses but I’d love to see what participants said they thought was going to happen to them.

    • Does eating blood pancakes remind us of our mortality? And perhaps make us perform better that way… ah, they knew what they were doing when they fed those to us in the elementary school. Bloody pancakes, death metal and Sartre – that’s the cocktail for a productive day. Or at least to becoming a Hot Topic customer.

      Well, anyway. This might’ve been brought up ad nauseam, but it never ceases to amaze me how researchers seem to have such an optimistic view on how they can just “induce” emotional states in people. Let us have a guy wear a t-shirt with a skull – and BAM: people now have been induced to tremble before their mortality. Let us play some stock sad music and BAM – sadness has been induced into the poor participant’s brain.

      • Blood:

        I have no problem with the idea that the pancake or the shirt could have this effect on some people at some times. My problem is with the idea that such an intervention would have the large and consistent effect which is what’s being assumed.

    • If you read on you will see that the experimenter with the shirt left the location after the prime. I would love to hear an alternative explanation for this effect (so you think that a priming effect that was supported in hundreds of studies is not plausible but that telepathy by the experimenter is possible?). By the way when you have a 2 study package study 1 is your precious pre registration. And also if you are doing an experiment like this you can be sure that you have a very clear hypothesis.

      • Please:

        Without getting into the details of this particular study, let me just say that, when statistically significant comparisons are reported in the absence of preregistration, it’s not necessary for there to be “an alternative explanation.” It’s enough to say that the published finding could’ve been capitalizing on chance.

        Regarding your other point: preregistration is what it is, it’s not “precious.” It requires a clear plans ahead of time regarding data collection, data processing, and data analysis. Merely being study #1 or #2 or #3 or whatever in a paper is not the same as preregistration.

        • The whole problem is that you are not getting into the details of the study and you are reiterating things that we all know about. But if you are derogating a study in a public forum, and you are a professional scientist, then perhaps you should “get in to the details” of the study. That was my whole point from the beginning. You think that if somebody does not follow your orders than their study is automatically not good, regardless of the details. There is no flexibility and no consideration of the actual article that we are talking about. You keep bringing up quasi experimental designs, or designs that do not have a premeasure and you keep avoiding the main point – that an experiment with a pre measure that was able to determine in preliminary analysis that random assignment had worked is not the same as as any other experiment, and therefore should not be judged by the same criteria. You are picking and choosing which arguments to address – while you have such dismay for people who do this in their research. Anyone who does not go by the guidelines that have been enforced on our field by the new authority figures who are in positions of power is now automatically wrong and discredited. I dont like this approach. I know you mean well, and that your goal is just, but at the end of the day i think that you are hurting rather than helping the advancement of SOCIAL PSYCHOLOGY as a science. Part of the problem is that unlike medicine of clinical psychology, in social psychology we are able to explore and investigate subtle things about human behavior that, if we get it wrong at first, there are no direct negative consequences. So if this study finds that thinking about death improves athletic performance, and the authors recommend that more studies be conducted here, no one gets hurt, and real scientific discovery can take place, if people who for example may have more resources, can replicate the effect. Cognitive dissonance theory for example, was discovered in similar ways (the classic study had about 21 people per cell). If the effect is not replicated than we know that the effect is not true. However, if people must blindly obey your commands to have say 50 or 100 participants per cell (whatever the people in power now decided in their blogs, societies and editorial address), then these initial studies would not be conducted. Further, in other cases people might not be willing to explore interactions in their data and come up with post hoc theories — THAT MAY BE CONFIRMED IN SUBSEQUENT STUDIES (by the way people would never do a highly control lab experiment with more than 2 factors with the current demands – unless they are at Columbia or some other tier 1 university). And so these studies would not be attempted and then not be supported or refuted. And yes iv just said a taboo word – post hoc theory = it has value in science as well! Sometimes you need to change your manipulation (when you are doing it with real people), sometimes you see things that you did not think about before and you modify your theory – as long as you confirm it in another study I think that helps the confidence that you are not merely finding false positives. The bottom line is that in science, and especially in social psychology, progress is made over long periods of time – and we should not expect any one publication to be perfect – especially when we are merely talking abut basketball performance and this is a pioneer experiment. Did people go head and survey what other people in their field think about the idea that now one MUST pre register their study, and analysis and whatever before it was decided to be the new norm across all sciences? No. Of course if you are now going to decided if a medicine or a treatment works, or if you want to determine that the climate is changing because of greenhouse gases and we must partake in a geoengineering project that would pollute the earth and kill millions of people you should be dame sure about your results (and even then i wouldnt do it). So not everything should be judged by the same criteria…

        • Think:

          You can feel free to believe that the above-linked paper provides convincing evidence of a large effect. Indeed, you could’ve believed in the effect without the data in the paper at all! That’s fine; I’m not telling you what to believe. For more specific comments on the paper, see this comment thread.

          You also say, “if people must blindly obey your commands to have say 50 or 100 participants per cell…” Huh? I’m not giving any commands. You can do what you want; I’m just telling you that following the forking-paths-follow-the-statistical-significance approach is a recipe for chasing noise.

          I never said that the authors of the above paper, or anyone else, must (or, as you put it, MUST) pre-register their study. I just don’t find p-values to be much evidence at all in the presence of so many possible forking paths.

          You write that I’m “reiterating things that we all know about.” Unfortunately, we (if that is taken to mean the scientific profession, or practicing statisticians, or social scientists) don’t all know these things. Top journals such as PNAS continue to publish papers whose empirical content is little more than noise.

        • Thanks. Yes I can believe what I want and you certainly are believing what you want without even looking at the data. You still, after all theses posts have not responded to my main claim about the difference between an experiment with a strong pre measure (and preliminary analyses that show that random assignment worked on this measure) and an experiment without it. And i still did not see your examples about the “many forks” of this study. As if I was the one who put this study on the stand… But you really can just keep doing what you want and addressing only the points that fit your claims or that you can respond to automatically and keep preaching what you believe. Thank you again for all your responses I do appreciate them.

        • Think:

          I did give examples of forking paths; see here and also this from another commenter.

          Regarding your other point, just about all of the studies we’ve ever seen, including those on ESP, beauty and sex ratios, 50 shades of gray, etc., were believed to have strong pre measures. The trouble is that there is a many-to-one mapping between scientific hypotheses and statistical hypotheses.

          Science is harder than you have been led to believe: having a scientific hypothesis + randomization + some data + statistical significance is not enough. I apologize on behalf of the statistics profession that we had sent the wrong message for many years, leading lots of researchers in the lurch. I’m the bearer of bad tidings, and I can understand that this is frustrating to you. It would be much easier to hear the message that everything’s ok, or that once you tick off a few boxes (randomization, some theory, statistical significance, multiple studies) that you’re fine. But I just can’t send that message, as it’s not correct.

        • Response to the responds below. Yes your right you did bring examples of possible forks. So I apologize for missing that. But as I wrote above (http://statmodeling.stat.columbia.edu/2017/09/19/2010s-never-happened/#comment-684714) i dont see this as a valid alternative in this case. The other thing is that I dont think that everything is “OK” and that this effect is “true” and in fact I dont think that i would ever know if it is. But i think that you are the one who is too certain that this effect, and all priming effects, are not true. So to me you are the one who is clinging on to certainty without reservations…. I know that you probably dont really think that this is 100% false, and that you feel like we cant know – but I think that your statements dont really reflect that and so people who read what you say, might get the wrong impression. For example they might get the impression that all priming studies are false or that all TMT studies are false. I just think that everyone should be a bit more modest – the experiments (true) and the critics. You cant expect that from journalists necessarily, but we should do this as scientists.

        • Think:

          You write, “you are the one who is too certain that this effect, and all priming effects, are not true.”

          No, I don’t think that, nor have I ever said it.

          I do think that the statistical procedures used to estimate these effects cause overestimates, thus I think that effects are much smaller, and much more context-dependent, than are often presented.

        • OK thanks for clearing that. Yes some effects may be over estimated and probably all effects are context dependent. That does not mean that any effect found in a small sample in a given context is false or that these things are not worth doing if the study may be “noisy” (whatever that means). Also not all effects should be small. If you slap a person in the face and then measure their emotional response im sure it would be an extremely large effect and im sure that this may be obtained with a very small n. But I agree with your general points. I still think that each study needs to be looked at seriously before its mocked so callously (especially by a distinguished professor). And again im sorry that you didnt get the point of the difference between random assignment into condition when there is a premeasure that can help ensure that random assignment had worked and having random assignment into condition without such measures. Im not sure why you are not able to compute this, and why you keep referring to experiments who dont do this as an example, but never-mind. Maybe its the mortality salience effect here in the background that is not allowing you to think that your statistical-worldview may have a flaw in it. Or making you a bit more reluctant to admit that perhaps you didnt think of something very carefully and that you might have been wrong (self esteem defense). Maybe as you say im doing it and im not wiling to think that statistics as i understand it is wrong. Ill take the time to learn some Bayesian stats maybe im missing something (although this was not part of the conversation).

        • Perhaps i misrepresented what you said – so i apologize. I guess this was my impression from the entire post and the “what could be more retro” statement in the beginning. Thank you for all the time and i apologize if I was giving you a hard time here.

  5. I just skimmed the paper. Absolutely everything is in there… tiny sample size, participants removed for questionable reasons (“one participant showed up with his girlfriend”), regression analyses using one of many potential DVs and different subsets of a large pool of IVs, difference-between-significant-and-not-significant-is-significant, post-hoc theorizing about why one of their many potential DVs gave a significant result while the others didn’t, post-hoc power analyses showing high power for tests that produced small p-values, heavy use of causal language in the discussion, interpreting p > 0.05 as “no effect”, and unexplained and strange data analysis choices (combinations of bootstrapping w/fully parametric methods; 5,000 bootstrap samples for study 1; 10,000 bootstrap samples for study 2).

  6. FWIW, most of the terror management theory stuff I’ve ever seen was far more plausible…things like being reminded of crime in your community may make you more intolerant of outgroups or being reminded of your mortality making you cling to local traditions (e.g., less likely to accept birth control or condoms in cultures that have social taboos about those things).

    • Really? To me, it has always seemed that to explain the results, they (TMT researchers) would not actually need the construct of “mortality salience” at all – that all results could be explained by feelings of uncertainty, lack of control, or simply negative affect (I know they often control for negative affect, but not always). And second, this research line has often used very small samples and complex manipulations, and also, really different manipulations which are considered to have the same effect. For instance, I remember one study in which looking oneself in the mirror was supposed to cause mortality salience via first causing “existential salience” and then, via association between existence and non-existence, mortality salience. These methodological issued have made me somewhat doubtful of this research line. I mean, I can believe that making people feel sort of bad/uncertain may make them more judgmental and, perhaps, to cling to their values and become more communal/less individualistic, but I’m not sure the TMT research has shown that it’s mortality salience that causes these things.

      Granted, I’m a bit out of the “loop” now – I used to follow the TMT research pretty closely 5-10 years ago. It’s possible that the researchers have since answered to these criticisms. And I’m definitely not saying they did anything wrong, just that I don’t find most of the (older) results very plausible.

      • I remember one study in which looking oneself in the mirror was supposed to cause mortality salience via first causing “existential salience” and then, via association between existence and non-existence, mortality salience.

        So looking in the mirror is treated the same as recently being beat up and mugged (or in a car accident, or nearly starving lost in the woods, etc)? Surely they do not just say mortality salience is present or not?

        • To my shame, I remembered the mirror thing wrong. It’s a separate manipulation; there is a series of studies examining mortality salience x self-awareness interaction effects on outcomes, and looking into mirror (vs. not) is the self-awareness manipulation.

      • ” To me, it has always seemed that to explain the results, they (TMT researchers) would not actually need the construct of “mortality salience” at all – that all results could be explained by feelings of uncertainty, lack of control, or simply negative affect ”

        That is basic the consensus outside of the core TMT group.

      • I would agree that there are often plausible alternative explanations for the results that otherwise seem reasonably well-founded. I can’t characterize the literature whatsoever since my knowledge of it is mostly limited to a few job talks and presentations from colleagues. In general I would say it seems difficult to isolate “mortality salience” from other, related constructs and I’m not sure whether/how this area has done so.

        • TMT research has focused more heavily nowadays on the DTA concept – which looks at the number of death-related constructs that become accessible after threat. These findings typically show that MS effects that are commonly reported are the result of increases in DTA. That is, for example, defence of worldviews or bolstering self-esteem decreases DTA after an MS manipulation. In addition, threats to self-esteem and worldviews increases DTA. Some research has been done exploring why DTA increases when exposed to these threats and the evidence again suggests it is not to do with an increase in negative thoughts or anger.

  7. With all due respect, I think that perhaps you should also take a minute to consider the fact that there is a very big difference between a correlational study and experiment, and there is also a big difference between an experiment with a premeasure that allows the researchers to test if random assignment has worked, and an experiment without a premeasure. When you have a premeasure, and you see that random assignment had worked, your chances of a type 1 error is reduced dramatically (something that non-experimenters rarely think of or acknowledge). I argue that the result of study 1 – which was based on previous a clear theory and previous findings, and was clearly pre hypothesized (what else could one expect from this study ?!?), in which there was no difference in the pre manipulation score, but there was a difference in the post manipulation score is very solid. I dont see any serious alternative explanations proposed here. The fact that this study had another experiment after it, which supported the same finding (replication) makes the case here even much stronger. i challenge you to provide a valid and well thought-out alternative explanation (that holds considering both studies). I also wonder if you generally think that these types of experiments should not even be attempted at all? The point here is that people like you, or perhaps someone who conducts actual experiments, could read about these effects and then also try to replicate this and then we can all see if its really true or not. Scientific progress is made over time – over multiple studies and a single study is never going to be perfect. If the first study would not be conducted than nobody could ever attempt to replicate these things. Have you ever thought about that? have you ever conducted an experiment to test a hypothesis with real people? (A real experiment with people, takes a lot of effort –its much more than “shuffling numbers”). When you conduct an experiment based on a theory like TMT you ALWAYS have to have a hypothesis before hand unless you really dont know what the hell you are doing and you want to just waste of lots of time and resources. And guess what – you also need to have more than one study in the package (in this case Study 1 is the pre registration of Study 2!). Or maybe you think that we should all just stay in the office and look at correlations instead of testing theories and trying new things. I wonder if you have ever thought critically about your own arguments, and if you have ever thought critically about the poorly conducted replication studies (your 2010 reference?) that had brought forward this replication “crisis” (e.g., The “many labs” study that had 17 experiments in one link in a row, and consequently more than 30 cells and perhaps the lowest power of any study that was ever conducted in psychology). Getting a null result by conducting a poorly designed study is really not hard at all…

    • Think:

      1. No, testing if random assignment has worked is not the same as preregistration of data collection, data processing, and data analysis choices.

      2. No, having a clear theory and previous findings is not the same as preregistration of data collection, data processing, and data analysis choices. A single theory can map to many many many different decisions in data processing and analysis.

      3. The second study may seem like a replication but it’s not preregistered. Remember: Bem’s ESP paper had 9 different studies, but none of them were preregistered replications, and they had lots and lots of forking paths.

      4. Having a hypothesis in hand is is not the same as preregistration of data collection, data processing, and data analysis choices. See, for example, the wonderful paper discussed here.

      5. A real experiment can indeed take a lot of effort; I agree that it’s much more than “shuffling numbers.” Nonetheless, if you take a real experiment and you have an analysis plan that has many researcher degrees of freedom, you can get statistical significance out of pure noise.

      6. No, having more than one study in a paper is not the same as preregistration.

      7. No, of course I don’t think we should all just stay in the office and look at correlations instead of testing theories and trying new things. I also don’t think that trying new things will help much if our measurements are too noisy.

      8. Yes, I’ve often thought critically about my own arguments, as you’ll see if you read some of my published papers.

      Points 1-7 above are subtle. That’s why the title of this post is “As if the 2010s never happened.” As of fifteen years ago, lots of researchers—including me—weren’t aware of how much published work was compromised because of uncontrolled researcher degrees of freedom (to use the term of Simmons, Nelson, and Simonsohn, 2011). And lots of researchers—including me—weren’t so aware of the importance of accurate measurement. We had this naive view that, as long as you had causal identification (e.g., randomized treatment assignment), then all was cool. But we gradually realized that wasn’t enough; some of the history is here. What I’m saying is: your view might sound reasonable, and it’s how a lot of people used to think, it just turns out to be wrong. Or, to be more precise, your view is correct in a context of large consistent effects and precise measurements—but it falls apart when effects are small or highly variable and when measurements are noisy. That’s what it took us all a decade to understand.

    • Think:

      Just to elaborate: I blame the statistics profession for many of these misunderstandings. We’ve written textbooks emphasizing randomization but with very little about measurement; we’ve written textbooks in which examples culminate in statistical significance and success; we haven’t made it clear how these ideas can go wrong.

      I don’t think the psychologists who design and analyze these studies are fools, or that they’re bad people. I think they’re doing their best, and they’re following the instructions they’ve been given in their statistics and research methods training, which focuses on randomization and statistical significance, with not much attention to measurement or the problems of statistical significance in the context of small or highly variable effects and noisy measurements.

      There’s been a change in statistical research methods, but that change has come first to the blogs and to the journals (for example here and here, just to mention a couple of my papers). But it hasn’t made its way into the textbooks, nor has it completely made its way into training and practice.

      I appreciate that you and others present your views here on the blog, as it gives us an opportunity to explore and explain these ideas. For your own research projects, I hope you can draw the lessons of careful measurement and within-person comparisons. Preregistration is fine but it’s not a solution in itself. As discussed in some recent blog posts, all the preregistration in the world won’t save you if you’re studying small or highly variable effects with noisy measurements.

  8. I really appreciate your fast response. But i wish that you had taken some more time more before you responded… Ill respond to your points below.

    1. I did not say that random assignment has anything to do with pre registration. My point was that the criticism about the power of a study should in fact take these factors in to account – but people do not do this unfortunately. Statisticians like your self and many others (usually personality psychologists) do not even consider if a study is experimental or correlational when thinking about power (which is a shame) and furthermore, they also do not consider if there was a premeasure or not (and the strength of the relationship between the premeasure and the DV can also help estimate the degree of noise). To make this clearer for you, think of the difference between a an experiment measuring some physiological response with a baseline and without a baseline…. it makes a big difference.

    2. Yes having EXPERIMENT 1 based on a theory, and previous related studies (e.g., Peters et al., 2005) with 1 manipulation does make a very clear case that this was pre- hypothesized. Do you think that this study was testing anything else? what could it possibly be i wonder.

    3. Yes having EXPERIMENT 1 in the same package as EXPERIMENT 2 is like having a pre registration if the design is this close. The point is that asking per-regestration in this case is absurd. Of course you are forcing people to do this but the point is that its not always nessisary – not when you have a clear EXPERIMENT testing a theory with a replication. Note that a related effect was also found in a previous study.

    4. See point 2

    5. The chance of getting an an effect in a study where random assignment worked (meaning its not a type 1 error) based on noise is very low (p was actually less than .005 in this case). I wonder why, as a statistician you are not more modest when you take an argument that has a 95% chance or less to be false. At least do this more carefully… Furthermore the chance of having noise create this (large) effect is less than 5%* 5% = 0.25% (and it was lower in this case). So I wonder who is making unsubstantiated claims here.

    6. See point 3 (not sure why you repeated yourself).

    7. You dont think that trying new things is beneficial? seriously? take a few minutes or perhaps have a cool drink and rethink this. Again study 1 was not noisy there was a lot of control with the premeasure. And again, if no one takes the first step who will replicate it? I know that you have written many books and articles, and perhaps you dont want to bother yourself with the idea that you may have been wrong in regard to something. But considering that your whole point is that someone else is usually wrong – you owe it to youreslf, you owe it to your readers, you owe it to the field and you owe it to all the people that you so carelessly criticize and use your power and authority to discredit their hard work. If you truly care about science, than take a minute or two to think about this point really. Im sure you can do better than this.

    8. OK ill read some more of your papers. I dont see much critical thinking here though – especially in the last point that would seriously discourage scientists to try and explore new things and do interesting stuff with real people out in the field. I actually did read a 2010 article from you were you applauded such efforts….

    9 (the second response). This study actually does have a within subject comparison (which produced a very large effect that was highly statistically sig.). I suggest that you carefully read the studies before you butcher them. But within subject designs also have many weaknesses when you are conducting experiments with deception… These are things that experimenters always think about, but statisticians dont. If you want to criticize a field (so harshly) you may want to actually know all the ins and outs. I think that it could be a good experience for you to to actually try these things for yoursef. After you train an RA for 3 months to be able to run the study properly, you will perhaps better understand how much thought is put into these things, and how absurd it is to think that people do these things without having an a priori hypothesis (im talking about a clear design like in this study). Not everyone agrees with the changes that were forced on the field. There was a recent issue in JESP on the subject that highlighted diverse views that may be helpful: Volume 66, Pages 1-166 (September 2016) and also the publication in Science by Gilbert, King, Pettigrew, and Wilson (2016) seems important to me. Perhaps everyone should be more responsible – the experimenters AND the critics… By the way Nosak’s work in the replication efforts are below criticism, again take a look at the “many labs” papers and their data and read how in the methods they report that they HAD ALL THE EXPERIMENTS IN ONE LINK! ITS SIMPLY INSANE! so you do a prejudice study and then a flag prime and then ……. and all these in different environments and then you find a null result and you claim that the original studies are false. PLEASE download the many studies file and calculate how many people they have per cell if you take into account all the independent variables and environments (e.g., in lab, online), and this is not even talking about the 13 different orders. I think that their n would be around 1 or 2 people per cell. Please do this for yourself.

    So i do appreciate your response, but I am not impressed by your proposed alternative explanation — that noise caused the effect, and the fact that you can back your claim with 0.25% confidence is actually pretty disturbing to me. With all due respect, I hope that you will consider these things very seriously, as you are a person with much influence. We learn from our mistakes just as much as we learn from our successes. But if you dont acknowledge that you might be wrong you wont learn.

    I wish you all the very best and thank you for your responses!

    works cited
    Peters, H. J., Greenberg, J., Williams, J. M., & Schneider, N. R. (2005). Applying terror management theory to performance: Can reminding individuals of their mortality increase strength output? Journal of Sport & Exercise Psychology, 27, 111-116.

    • Think:

      I’ll just respond to your key points:

      5. No no no. As Simmons, Nelson, and Simonsohn (2011) put it, “Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” Randomization has nothing to do with it. You further write, “the chance of having noise create this (large) effect is less than 5%* 5% = 0.25% ” No, that’s not the case if there are forking paths. There have been many many demonstrations of this point, both theoretical and practical. One compelling example is the paper by Bem that purports to present strong evidence for ESP, but really just demonstrates that it’s possible to repeatedly attain statistical significance by taking advantage of researcher degrees of freedom in data processing and analysis.

      7. You write, “You don’t think that trying new things is beneficial?” I never said that I don’t think trying new things is beneficial. Of course trying new thing can be beneficial. It depends. Sometimes trying new things can be beneficial, not always. If your measurements are too noisy then I don’t think there’s much benefit. Science is about measurement, not just trying things.

      8. You write, “I am not impressed by your proposed alternative explanation — that noise caused the effect, and the fact that you can back your claim with 0.25% confidence…” You’re misunderstanding. I’m not saying “noise caused the effect” and I’m certainly not saying I can back a claim with 0.25% confidence, whatever that means. I’m saying that, just as in the ESP experiments, the beauty-and-sex-ratio study, and many many many other examples, it’s possible to get these patterns by chance alone. I’m not saying there’s nothing there; I’m saying there’s no good evidence for anything.

        • Think:

          Lots of forking paths here, including coding of data, decisions of what data to exclude, decisions of what outcomes to look at, decisions on what to control for, decisions on what interactions to look at, decisions on what to include in the path analysis, probably some other things I didn’t notice. Simmons, Nelson, and Simonsohn talk about “p-hacking” but I think this is misleading in that it suggests that researchers are deliberately trying out lots of different specifications in order to find statistical significance. I think the problems are often more subtle, that researchers are doing reasonable-seeming analyses of their data, but these analyses don’t have the statistical properties their users think they have.

          It’s nothing special about this particular paper. We’ve learned from a decade of bitter experience that lots of research claims which seem airtight because of low p-values, don’t make sense or don’t replicate. The ESP and 50-shades-of-gray studies are just two examples out of many.

          I’m not saying the claims in this particular paper are wrong, just that I have no idea. That’s why we don’t believe these sorts of claims in the absence of external replication. It took us awhile to get to this point, but that’s where we are now. Again, I know it’s a hard pill to swallow, but it’s something we’ve learned, that even a seemingly clear scientific hypothesis can map to many many different choices in data processing and analysis.

        • I do agree with the claim that there could have been other ways to analyze the data. But given that this effect is not marginal i dont think it would have made a big difference. Also if you dont support the null hypothesis testing and are more flexible than what do you find so concerning. How would you analyze these data. How would that be different?
          Thank you again. I see you issue with the many forks now, but i dont agree that its a problem in this case. Also there were not so many forks in the second study here. And again i think that time would tell if the effect is right or not. Overly conservative analyses may also induce more type 2 errors and thats not ideal as well. I know you like other types of analyses, and i admit that i need to learn more about these, but i just did not see any convincing and concrete alternative explanations proposed here.

      • Who gets to decide whats noisy and what is not. Im sure that the error variance in this design, which has a pre and post manipulation measurement that was highly inter correlated could be construed as much less noisy than some of the high powered studies out there – especially those which are correlational or quasi experimental.

  9. Ah and one more thing. The 50 shades of gray study is NOT an experiment with random assignment. So I can easily think about 50 alternative explanations for why it did not replicate beyond that sample (third variable problem). So no its not a good example.
    Thank you again for your time. I really do appreciate it.

      • I’m not saying it’s never a problem. I’m saying it’s not always a problem. Not in this case where there is a very clear theory and only one manipulation and 2 studies. You ate judging everything with the same criteria and refusing to address the specific cercumstances of each study. there is already a lot of bericracy and hoops experimenters have to go through. The fact that someone is not dancing to your flute does not automatically make them wrong. Not all cases are the same. Yes if you are doing a 1 shot study or have multiple forks I agree that you should pre register. But I think that you have killed this study here with prejudice – without really looking into it. And your arguments are framed with excessive confidence and non-specific claims. Research suffers from these attitudes. I’m not going to go into the possible disatvantages of pre registration here, and there are a few. I’m just noticing that you don’t have a case here. I also don’t think it would even matter to you if this was in fact pre registerd. And you suggested this above.
        Thanks

        • I apologize for all the terrible typos – I meant bureaucracy. I’m using my phone here… Please excuse me.

    • Think again: Sorry but the this phrase of yours “study where random assignment worked (meaning its not a type 1 error)” suggests you are not understanding how random assignment works – it _works_ by making the type 1 error rate equal to the nominal say 5%.

      Now the oversight Andrew is bringing up here is the it involves more than just random assignment but things like ensuring compliance, blinding participants and assessors and other procedural maneuvers in the experiment (which many are aware of). What was less appreciated is the whole process of data collection, recording, analysis, reporting/publication and then locating and reassembling all relevant findings completely and accurately. And importantly (and very subtly) its not just the whole process as it did unfold but how it could have unfolded.

      Its all reasoning about collectives – in distribution not in occurrence.

      • Im sorry for not explaining myself clearly enough. If you have a pre measure you can check if random assignment had worked by conducting preliminary analyses (as was done in this study) and insuring that there are no differences between the groups in the pre manipulation measure. If your premeasure is highly related to the DV (as it was in this case) then your chances of getting a false positive due to type 1 error (or type s error) is reduced. My point is that this reduces your chances of type 1 error further, and that a study that does this should not be judged in the same fashion as a study that does not do this (like the ESP study). And this is all at a very different level than the 50 shades study. Maybe you are not getting this because you are not experimentalists. Maybe im not explaining myself clearly enough.
        In any case thank you for your time.

        • Think again: This is all tricky stuff and I have been working with and training clinical trialists for a long time.

          In particular, https://en.wikipedia.org/wiki/Alvan_Feinstein had similar insights and past them on to his research fellows that then became my responsibility with regard to adequately grasping statistical methods – especially in randomized clinical trails. It took a number of face to face meetings.

          It’s subtle and true that differences between highly related premeasures will effect the type 1 error rate if you do the analysis incorrectly. In fact, that the type 1 error rate varies predictable tells the analysis is wrong (technically called relevant subsets).

          To correct for this deficiency one should do an analysis of covariance (which adjusts away the observed baseline imbalance) to get a fixed type 1 error rate (google stephen senn randomization and analysis of covariance). Now, if you want a different error rate just choose a different error rate.

          Now Alvin had thought that if he get _all_ related premeasures to the DV and adjusted for those, the type 1 error would be close to zero. They had tried to simulate this and were using analysis of covariance and were puzzled why they always got close 5% significant effects. That was the dilemma I was asked to sort out – what is the problem with our simulation program?

  10. Thanks Keith. Do you mind if I email you my response to your question? I’m currently working on this myself and would be honored to collaborate with you if you would be interested. I did not identify myself by name in this blog because I felt that the posts here were rather offensive and disrespectful and it was late at night and I had a drink and might have also responded in an inconsiderate manner – and I apologize to everyone here if I had done so.

    • Hopefully this will not seem inconsiderate but there are reasons for me not wanting to do that.

      1. I comment here as its public and everyone gets an opportunity to learn – especially me – I feel this public practice space is a key resource for me.

      2. Any collaboration would have to go through a conflict of interest vetting of some sort which errors of the side of avoiding them.

      3. Its not a topic I am motivated enough to spend scarce time on.

      I doubt if anyone here needs an apology.

      • OK Keith thank you very much for you honest response. They key is to toss out the data if random assignment does not work, or to collect more data until it does, and only then continue to hypothesis testing (and possibly using statistical control). Its a two phase process thats the point.

Leave a Reply

Your email address will not be published. Required fields are marked *