Too Linear To Be True: The curious case of Jens Forster

Yup, another social psychology researcher from northwestern Europe who got results that people just don’t believe.

I’m a fan of Retraction Watch but not a regular reader so I actually heard about this one indirectly, via this email from Baruch Eitam which contained the above link and the following note:

Of the latest troubles in Social psychology you probably heard. Now, about the others I wasn’t surprised, I “grew up” in this climate. I know these practices and in a real sense the manner my lab works is molded as a negative to these practices (mostly, shared scripts and datafiles and replications, replications, replications). I personally wasted ~1.5 years of work and 600(!) participants trying to replicate one of Bargh’s experiments (at Columbia btw). Jens’ case suprised me a great deal. I met the guy and he seemed to me the most passionate and honest person. Also not one of these hotshots or ideologists that treat the data as ornamentation to their great ideas. Orthogonally to my prior I am also disgusted by the manner of the discussion. I find that the whistleblowers have turned into a bunch of bloodthirsty hunters that fail to show even a shred of doubt when at stake (with all due respect to science) is a persons’ life’s work, integrity and livelihood. I know you pitched in the discussion and we had an excellent lab around you recent paper on the non-fishing version of “researchers degrees of freedom”. I strongly resonate with this rigor squeeze and have intuitively applied many of these procedures, without much guidance, as early as my second year of doctoral studies. Still to what degree of certainty can we be sure about the conclusions based on the (weird, i must admit) linear pattern found in Forster’s data? Can we quantify the certainty of the conclusion itself? Shouldn’t at least two different statistical approaches be applied when the future of a person is at stake here? I also asked a collaborator of mine Zoltan Dienes to look into the “statistical accusations” but I would love to hear your opinion and would be more than glad if you address this on your blog as i feel it is an ethical matter at least as it is a professional one (giving the dude a fair hearing that is).

I have tried to voice this opinion on retraction watch but was attacked by a bunch of angry men and lost interest (I am from Israel, there are more serious battles to fight here…)

My reply: Figure 3 on page 9 of this report looks pretty convincing no?

The only weird thing is, why would someone make such data so linear on purpose? It makes me suspect that perhaps what Forster reported as raw data were not raw data but rather were derived quantities based on a fitted model (perhaps with the rare departures from perfect linearity arising from roundoff issues). I could imagine this happening in a collaborative project where data and analyses are passed around. Forster wrote that he “never manipulated data” but later he said “if data manipulation took place, something that cannot even be decided on the basis of the available data, it cannot be said who did it and how it was done” which makes me wonder whether there were some holes in the collaboration.

At this point, one possible scenario for Forster seems to be that something is very wrong in these papers, he knows it, and he’s trying to give denials that are misleading but literally true. I don’t know for sure, but it does seem like (a) the data are fishy and (b) he must know that these aren’t the raw data. Another possibility is that, if the raw data ever were recovered, the ultimate research conclusions might not change much. I can’t really tell, having not looked at the original papers and indeed having only skimmed the report from the university committee.

It’s hard for me to picture someone going to the trouble of making up data to look exactly linear, but I could easily see a research team, through a mix of sloppiness and misunderstanding of statistics, taking the estimates from a linear model and then plugging them in a later analysis. Sometimes this sort of post-processing makes sense in an analysis, other times it’s not the right thing to do but people do it without realizing the problem.

71 thoughts on “Too Linear To Be True: The curious case of Jens Forster

        • Yeah, second stringers only do “duplication”, as they appear unable to make any new theoretical and empirical discovery on their own.

          Moreover, if a new and innovative study does not replicate then this is probably because it is such a new and innovative study.

          There is another unintended effect of the replication movement, namely that it places too much emphasis on duplication and not enough on discovering new and interesting things about human behavior, which is, after all, why most of us got into the field in the first place. As noted by Jim Coan, the field has become preoccupied with prevention and error detection—negative psychology—at the expense of exploration and discovery. The biggest scientific advances are usually made by researchers who pursue unorthodox ideas, invent new methods, and take chances. Almost by definition, researchers who adopt this approach will produce findings that are less replicable than ones by researchers who conduct small extensions of established methodologies, at least at first, because the moderator variables and causal mechanisms of novel phenomena are not as well understood. I fear that in the current atmosphere, many researchers will gravitate to safe, easily replicable projects and away from novel, creative ones that may not be easily replicable at first but could lead to revolutionary advances.

          Source

          But for some nasty bullies all is fine and good in the social sciences.

          PS I wonder if the replicators don’t feel bullied by so much “bullying” epithets thrown at them.

    • I’m guessing it’s the Dutch scientific integrity board (LOWI), per this: http://retractionwatch.com/2014/05/07/forster-report-cites-unavoidable-conclusion-of-data-manipulation/

      Though on second glance, that specific document might be the complaint that triggered the investigation, which perhaps was made anonymously (or at least anonymously to outside observers).

      Regardless, here’s another look at one of the papers in question: http://datacolada.org/2014/05/08/21-fake-data-colada/

      • Maybe. On closer reading I was a bit surprised by the date on this report. September, 3, 2012

        What’s been going on for almost two years? Some of these things really move at a glacial pace. They’d put the Indian court system to shame.

  1. Edward Tufte has an example in one of his books (VDoQD I think) in which a plot shows up with the raw data as little pinpricks, almost unnoticeable beneath a heavy grid of lines and a line that shows the (linear) model fit, and in a subsequent version of the plot the data are altogether missing and all that remains is the perfectly straight model fit line.

    I can see how that kind of thing can happen — it could be simply a matter of referring to the wrong column of a table, thereby plotting (or even analyzing) the predicted values rather than the actual values. I’m not saying this is what happened here, but it could be. You’d think someone would notice, though.

    We are faced with two possibilities, each seemingly very implausible: (1) someone deliberately faked the data, but in a way that looks way too good to be true, or (2) someone made some sort of extremely boneheaded mistake and didn’t take a closer look when the results seemed too good to be true.

    In a funny way I would consider each to be extremely unlikely, but if I knew that it has to be one or the other I’d put them at about equally (un)likely. But maybe there are other possibilities, one of which is plausible.

    • I find your (2) more likely. I’ve done something similar in the past (by referring to the right sequence of excel cells ending up with a trivial tautalogical straight line that one mistakes for a fantastic correlation) but luckily discovered it pretty early.

      • I was once wildly impressed with an r = 1.0 until I realised I was regressing line number of the data file with a sequential id number for the subjects. Oops.

    • Phil: “We are faced with two possibilities, each seemingly very implausible: ”

      I think this misreads simple naive Bayes. You have a bunch of very straight lines across a bunch of studies. Either Nature did it or not. If not it could be manipulation or error say.

      Now, the tests suggests that the chance that it was Nature is pretty small. If so “explaining way” (cf. probabilisitic reasoning) should lead us to update and place stronger belief on Nature did not do it.

      And if Nature is not responsible then it could be fabrication or mistakes. But note that mistakes would have needed to be made across several studies in exactly the same way. I think that is very unlikely so, by explaining away, we get to the most likely outcome.

      Even so, this conclusion is not certain. The issue is whether it lies within reasonable doubt.

      • To be clear: Conditional on the published straight lines, and the tests, the probability that Nature did not dot is is very high, presumably more than 50% (by “explaining away”). By implication if “error” or “fabrication” is a partition of the “Nature did not do it” set, then each has a pretty high probability. So I would not say each is “seemingly very implausible”.

        • You clearly misread me. The idea “nature did it” is not one of my two options. I think we can dismiss that one altogether. There are two remaining possibilities that I can think of, each of which is very unlikely.

  2. There are two questions here:
    1) Are the data real?
    2) Did JF manipulate or fabricate the data?

    The answer to 1) is clear.
    Answering 2) is difficult, given the possibility of alternative explanations.
    This leads to a third question:
    3) Is there enough evidence to seriously damage JF’s career?

    • Anon,

      I disagree. The only question is whether the results are replicated by others. I see little difference between making up data and incompetent p-hacking when it comes to diminishing the “integrity” of scientific literature. The latter is pretty much socially acceptable these days (and seems to have been for decades). Of course making up data is bad, but if we are honest with ourselves it is a drop in the bucket compared to other problems. I don’t think it is really worth developing other methods of addressing outright fraud when we already have replication to weed this stuff out.

      • Also, a good filter is the presence of dynamite plots* which are the way data was presented in the offending paper**. Personally, I don’t bother reading past the abstract once I see those unless the paper is closely related to my own work or the raw data is made available.

        *http://biostat.mc.vanderbilt.edu/twiki/pub/Main/TatsukiRcode/Poster3.pdf
        **http://spp.sagepub.com/content/3/1/108.abstract

  3. I like the sentence “The combined lefttailed p-value of the entire set is p= 1.96 * 10-21, which corresponds to finding such
    consistent results (or more consistent results) in one out of 508 trillion (508,000,000,000,000,000,000).”

    Are there people in the target readership who think ‘OMG! Wow! Now that they write it out, a trillion is a LOT!’

    Though, they should have written ‘(…) entire set is p= 1.96 * 10-21 (0.00000000000000000000196), which (…)’

    • Martin,

      Look at their figure 2. It appears 3/16 (Lerouge_2, Malkoc, and Smith_4) of the “control” results were very close to a linear trend. From this we could estimate that ~18.75% of the Forster results would be close to linear if each experiment was independent of the last. For 42 outcomes we would expect ~8 to be linear. Of course the experiments were measuring similar phenomena so that number should be somewhat higher if the linear effect did exist. I suspect the 1 in 508 trillion claim is overblown, maybe someone can figure out the right stats to do using that line of reasoning.

      I didn’t look at the critique that closely but figured I’d throw some criticism their way.

      • question,

        My “criticim” wasn’t aimed at the substabce of the linked PDF. It just seems a bit odd to me to write out such a large number in full. I first thought they did it for clarification because there might be some confusion between the British and American “trillion”, but then they could just have used floating point notation, as they did with the directly preceding number.

        But again, this is not serious, or nothing concerning substance, at all. It just made me giggle to see such a large number written out, as it is usually seen in popular outlets or even introductory textbooks to get across some sort of “feeling” just “how” big these numbers are – which is probably not an issue with the target audience here.

        • Martin,

          Sorry for the confusion. I gathered your intent. I believe the written out number was an example of rhetorical flair, which suggested to me there may be some problem with what they did to calculate it.

        • Ah, OK. Yes, I agree that looking for another explanation than my automatic assumption that they just wanted to awe their readership with a big number is probably a more intelligent approach…

  4. Eitam: “I find that the whistleblowers have turned into a bunch of bloodthirsty hunters that fail to show even a shred of doubt when at stake (with all due respect to science) is a persons’ life’s work, integrity and livelihood”

    Jen Foster’s is not the only career at stake. The published record, including bogus findings, affects journal publication decisions, allocation of grants, promotions, etc.

    Whether by error or fabrication I am sure Foster’s behavior has had any number of unintended consequences. Please spare a moment to recognize the silent victims.

    • It isn’t just random chance that so many of the recent scandals are coming from certain narrow areas. My suspicion is that some of these fields (e.g. priming) have systematic and endemic problems.

      So the spillover victims may not all be that innocent.

  5. That report AG cites keeps harping on the fact that the sex distribution of the sample (slightly more women) does not exactly match the sex distribution of psychology students (72% women). But who says that propensity to participate in experiments is orthogonal to gender? Why is this the default assumption?
    If the gender ratios were always 50/50 or nearly that, it would be one thing, but the fact that this is point #1 in all the additional comments in the report gives me pause.

    • Yes, that’s a weak argument. It’s possible that there were other experiments run that semester looking for volunteers that wanted entirely females, so the remaining pool was skewed. I don’t know why there should be any expectation of random sampling from this pool.

      If I take the most innocent possible interpretation, I see someone running a linear model on the data, saving the predicted values of the data, and then running subsequent analysis on the predicted values. This might occur if there were handoffs from one research assistant to the next, neither knowing the basics of describing what data is in what column.

      This shaves the issue with Hanlon’s razor: “Never attribute to malice that which is adequately explained by stupidity.”

      • Yes, I think the authors realize that the graphs were SO absurdly contrived, that they had to go to extra lengths to show this was fraud, and not stupidity/programming incompetence. And the route they chose seems to be accusations that the experiments did not take place, hence their highlighting of weak innuendo about gender distributions. But Forster’s replies to accusations of not actually running the experiments are much stronger than his explanations for why the data look so crazy. But since he has nothing to say about the latter other than, ‘lottery winners aren’t fraudsters’, I’d put my money on some does of funny business mixed in with gross incompetence.

  6. Low = -1 ; Medium = 0; High=1. I get that M should be between L/H in theory – but honestly, I can’t believe that M falls between L/H for ALL the papers they look at, including the controls. That seems really unlikely to me. I’m beginning to wonder if we should trust ANY of these papers, including the “controls.”

    The only thing I can think of is that somehow someone re-normalized L/M/H in some way that forces linearity (and/or forces M to be between L/H). Because I’m pretty sure that there is no sense in which L/M/H are all “the same distance apart” on any naturally occurring metric. The Forster results are “more” perfect than the other results, but they all still seem way too good to me, and it makes me wonder if we’re missing some part of the data manipulation in the analysis (original research papers or this current report).

      • Question,

        That would make sense, but the place I see them discuss it is on page 3 where they write:

        “in which low, medium, and high levels of the independent variable are coded as (-1, 0, 1)”

        So that says to me that they are determining L/M/H according to the treatment, and not the outcome. Where is the part I missed?

        • I see what you are saying regarding the Forster papers, but what about the other ones? Just glancing at the abstracts, either they were doing what you say – ranking outcomes and then assigning L/M/H – in which case the “linearity” of those things is totally meaningless; or they are providing evidence that the entire field is bad (because they are too lucky to always get M between L and H) and focusing on the “worst offender”.

          I still think something is fishy about this analysis. Here’s a link to a few of the “control” papers, none of which seem to have obvious L/M/H treatment levels (even less obvious than “local”, “control”, “global”.)

          http://www.jstor.org/discover/10.1086/599047?uid=2&uid=4&sid=21104363964953

          http://www.ncbi.nlm.nih.gov/pubmed/21807953

          http://psycnet.apa.org/psycinfo/2006-05169-004

          http://www.researchgate.net/publication/251472609_The_effect_of_construal_level_on_predictions_of_task_duration

          Just to be clear – I’m not trying to say that I think this guy is being framed, I’m just saying that the analysis provided in that document doesn’t make tons of sense to me, and is at least somewhat misleading in the whole L/M/H thing and why anyone should believe there would be linearity, or what linearity could possibly mean in these cases. I don’t know if it is a data manipulation thing (your suggestion of stratifying on Y ex post), a normalization of some kind, a statistical artifact, proof of data-manipulation in the entire field (M always between L/H), or just some combination of luck (L<M<H), disciplinary insight (they were right so that's why L<M<H), and a cheating professor who used a lame linear model to generate the results they wanted – which is sorta what it looks like, even in the raw data on the published article where you can basically predict the third number by looking at any other two.

        • JRC,

          If I understand you correctly I think we are in agreement that something is odd about the analysis. If you force the three categories into low/medium/high, and then fit a line, it seems like the “data” is much more likely to appear linear. However, I admit that I did not look closely on what they actually did. Perhaps it took this into account.

        • Isn’t the key surprise that the response variable for M is always *exactly* halfway between L & H? To me that’s the crux.

          The ordering of L < M < H is to be expected I thought?

        • Rahul,

          Not in every single study. No way. Not unless they are choosing what constitutes L/M/H “treatment” by stratifying on the outcome. If that’s the case, I don’t know what the “placebo” or “comparison” tests are really telling us.

        • The analysis of the linear anomaly takes as its starting point that the true effect is linear (i.e., giving maximum benefit to Forster). If you do this either using Fisher’s method or by simulation methods the degree of linearity is too perfect in the data sets examined. This is discussed in more detail by Neuroskeptic. See:

          http://blogs.discovermagazine.com/neuroskeptic/2014/05/28/explaining-jens-data/

          A simpler approach is to just look at the reported F ratios (this isn’t a distinct anomaly as the execess linearity naturally produces fairly similar F ratios if the SDs of the groups are fairly constant given that n is constant). They are F(2,57) = 8.93, 9.15, 10.02, 9.85 and 9.52 respectively. This clustering around 9.5 or so is too tight to plausible without something very fishy going on. The discussion at Neuroskeptic’s blog also considered QRPs, but there any obvious mechanism for QRPs to produce the result (given that linearity isn’t a selected for by p hacking etc.). Thus the most plausible account so far is that one or more people adjusted the data in some way.

        • Thom,

          I get that there are many ways to show that those results are “too linear”, but I still don’t understand how anything can be linear in anything when the X-variable has no meaningful units. Furthermore, I don’t understand how, in apparently every single experiment in the placebo group, the effect of “medium level” treatment is always between the effect of “low” and “high” treatment. Sure, if experiments are careful and the theories are true, MOST M group effects should be between L and H group effects, but no way they all are (not with N =20 per cell).

          Furthermore, if this guy is cheating, then he had to think to himself: OK – I have my regular control, I’ll think of that as 0. Then I want a “good” treatment, so I’ll give that an effect of E. And so then I’ll do a “bad” treatment and give it – well, hell, why not -E? It just seems insane. There is no underlying continuous or discretized X variable (say treatment level) on which something \emph{could be linear even in theory}. So there is no inherent symmetry in the experimental design – why would someone fake introduce symmetry?

          Is this idea of thinking of L as -1, M as 0, and H as 1 something inherent in the field, such that a poorly trained researcher would think to themselves that that would be a good way to fake their data? Otherwise, I don’t understand why anyone even went looking for linearity along this dimension.

          Ultimately, I just don’t understand the placebo test (due to why L<M<H for all) nor how someone thought "linear over domain of -1 to 1" was something to go looking for. I mean, other than because they looked at those results and thought "hmmm.. the middle effect is always halfway between the L/H effects, that's odd." In which case, the Placebo test teaches us nothing, and all we need to know is "there is this perfect linearity where none should exist, because there is no reason to think there even could be linearity along this dimension."

          I know, I know: tl;dr. Sorry – I should just let this go.

        • @jrc

          Here is how I picture your argument (caveat, not having read any of the underlying articles tl;dt)

          We want to estimate the effect of X ➝ Y but cannot observe/manipulate X directly. Instead we manipulate it indirectly using stimuli S s.t. S ➝ X ➝ Y, and where S ➝ X ← U; U an error term. S only has an ordinal scale s.t. S in (L,M,H) where H > M > L.

          One simplistic way to make sense of the linearity is to assume X is cardinal and has three levels, say 1,2,3; and that P(X=1|S=L) = 1 and so on. Alternatively X is ordinal but also P(X=1|S=L)=1, and on. In both these cases there is a one to one correspondence. If we relax the 1-to-1 correspondence then we have a 4-way table with the following implied associations:

          (X ordinal; 1-to-1) — linearity not make sense, strict monotonicity yes;
          (X ordinal; not 1-to-1) — neither linearity nor strict monotonicity can be expected;
          (X cardinal; 1-to-1) — linearity and strict monotonicity make sense, is possible but not necessary (presumably this is the test);
          (X cardinal; not 1-to-1) — neither linearity nor strict monotonicity can be expected;;

          From this I think the replicator is giving Jens Foster the benefit of the doubt by making the strongest possible assumption: X is cardinal, has three levels, and we can manipulate it 1-to-1. But even under those assumptions the results are unlikely (in effect, the reported results imply a similarly strong assumption about the effect of X on Y). Relaxing the manipulation assumptions makes the results even more unlikely.

          At the same time I think you also have a point about the placebos, for strict monotonicity implies a 1-to-1 manipulation. Possible, but in my view unlikely.

        • PS In above I assume effect of X on Y is monotonic and, where it makes sense, linear. If we relax this then the implications of the 2 x 2 table need not follow and the analysis becomes too long for a blog.

  7. Watching all these guys’ careers crash and burn makes me think that we should be explicitly teaching our students not to stand for any one thing. If you end up in a position where you have to “take a stance” and defend a particular idea for the rest of your life, you will eventually succumb to the temptation of faking your result one way or another.

    It has always surprised me that so many researchers get through decades of research without ever, not once, finding evidence against one of their pet theories. This always makes me suspicious. How can anyone be right all the time?

  8. “In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological net-work, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program,” without ever once refuting or corroborating so much as a single strand of the network. Some of the more horrible examples of this process would require the combined analytic and reconstructive efforts of Carnap, Hempel, and Popper to unscramble the logical relation-ships of theories and hypotheses to evidence. Meanwhile our eager-beaver re- searcher, undismayed by logic-of-science considerations and relying blissfully on the “exactitude” of modem statistical hypothesis-testing, has produced a long publica- tion list and been promoted to a full professorship. In terms of his contribution to the enduring body of psychological knowledge, he has done hardly anything. His true position is that of a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring.”

    Meehl, Paul E. (1967). “Theory-Testing in Psychology and Physics: A Methodological Paradox”. Philosophy of Science 34 (2): 103–115. doi:10.1086/288135
    http://mres.gmu.edu/pmwiki/uploads/Main/Meehl1967.pdf

    • [My comment is only about experimentally driven work in linguistics and psychology; I have no knowledge of how things work in the hard sciences]

      So what does this entail? It seems to me to be a near certainty that anyone people with a clear scientific “vision” and consistent/unequivocal support for their position over decades of work is engaged in faking stuff at some level.

      Also (perhaps more importantly?), if you are going to do it right and not make the mistakes Meehl points to, you will probably not get the kind of funding that Jens Foerster got (a 5 million Euro award from the Humboldt Foundation in Germany—which I think they have put on hold since), because your career will look like a hodge-podge of results that sometimes pan out and sometimes don’t, and are sometimes (oftentimes?) simply wrong.

      If someone is ambitious and aiming for these high-profile awards, is it a logical consequence that one has to engage in faking stuff at some level? I should mention that there is no danger of my getting a Humboldt, since I already live and work in Germany and am therefore worthless in the eyes of German funding agencies ;)

      • My assessment is that intelligent people with integrity are being driven out of “science”. This has been going on since the 1940s. It doesn’t take much conspiracy to make it happen, just over-fund the research and the incompetent will take care of the rest. This idea is similar to the Laffer curve. It may be for the best, IDK.

  9. For me this blogpost by Neuroskeptic made a lot of sense:
    http://blogs.discovermagazine.com/neuroskeptic/2014/05/28/explaining-jens-data/#.U6rPMo2Swah

    It concludes that fabrication of the data is the most likely scenario and explains how data fabrication can cause the superlinearity as by-product. It also says that an independent expert checked the data, which would (I assume) rule out the possibility of the raw data not actually being raw data.
    Still, this does not mean Forster was the one who fabricated the data.

    • That blog post doesn’t exactly show, in any convincing way, that a human typing in random numbers would get such extreme linearity, and that this would somehow not show up in subgroups. It is certainly a possibility, but that is a lot of numbers to type in.

      • It seems unlikely that people would type in all the numbers. (An even if they did, humans can’t generate random number sequences very easily). All the studies have the same n with no missing data. The most likely way to adjust this is to take one or more real data sets and tweak them. For example, a real data set with by chance linear-ish results might be cut and pasted and then tweaked to have slightly different means. Alternatively a few real subjects could be cut and pasted and multiple times and then tweaked to look slightly different. It is quite a bit of work but not actually much more time consuming than data entry for a few hundred questionnaires. A more sophisticated faker could also automate some parts of the process.

        If such a cut, paste and tweak method were used it might explain the linearity if the original data being cut and pasted were (accidentally) fairly linear.

        • “All the studies have the same n with no missing data”

          Why would you expect a study like this to contain missing data? It seems pretty straightforward to get the sample size you want:

          “For each of the 10 main studies, 60
          different undergraduate students (number of females in the
          studies: Study 1: 39; Study 2: 30; Study 3: 29; Study 4: 26;
          Study 5: 38; Study 6: 32; Study 7: 30; Study 8: 30; Study 9a:
          35; and Study 10a: 28) were recruited for a 1-hour experimental
          session including ‘‘diverse psychological tasks.’’ In Studies 9b
          (31 females) and 10b (25 females), 45 undergraduates took
          part. Gender had no effects. Participants were paid 7 Euros
          or received course credit.”

        • The point about equal n was merely that if you have data from 60 participants it is trivial to paste and edit 60 rows of a data sheet for editing rather than create new data one observation at a time.

          In terms of missing data, it depends precisely how the data were collected, but it isn’t unusual to have missing responses if questionnaires are used or if there is drop-out (but I didn’t mean to imply it was particularly important here). Apparently the data were collected by numerous volunteers so I wouldn’t be surprised if they sometimes ended up with unequal n under these circumstances.

        • @b, I agree that the blog doesn’t provide convincing evidence that all numbers were typed in manually, but I did found the arguments against other possible scenarios convincing which leaves fabrication as most likely scenario.

          @Thom, the fact that humans are not very good random number generators is in the blog actually used as possible explanation for the superlinearity. However, I agree with you that it is more likely that existing data sets were tweaked than that all numbers were typed in manually.

        • That, “humans are not very good random number generators” is a great point. There’s an online Rock Paper Scissors contest & the best engines consistently defeat a human opponent even if he tries to play “randomly”, a strategy that, in theory, can never lose, on average in the long run.

        • “the fact that humans are not very good random number generators is in the blog actually used as possible explanation for the superlinearity” – I think we’d have to reject that based on what is known about random generation tasks. If people do it quickly or under cognitive load they have lots of repeats and close numbers (too many), but typically deliberate random number generation produces numbers that are too evenly spread (because people monitor for “non-random” patterns and suppress them).

        • I did not know that. It does make me wonder, if someone would deliberately tweak an existing data set, wouldn’t he or she try to monitor the resulting data for non-random patterns and be able to suppress those patterns in that case as well?

  10. This is a reply to jrc above.

    The analysis of Forster’s data is really a variant on the Simonsohn-style anomaly detection. The idea is always to look for properties that real data from the context in question have that are not explicitly being selected for by p-hacking (e.g., SDs that are too similar across groups). As you note, the linearity here is an a label for a pattern where the control group lies more or less exactly between the other groups and the label is a bit confusing. (I presume they borrow the label from a linear contrast in ANOVA where the groups are coded -1, 0 and 1 to detect a linear trend in means.)

    The puzzling thing is that it hard to imagine a questionable research process that selects for this pattern and it is also not obvious to determine what procedure would be used to fake data with these properties (as opposed to faking data with SDs that are too similar).

    As to why the original complainants did this particular analysis, I am not sure. I did not notice the pattern straight off. I am guessing that they were alerted by the five very similar F ratios – which did strike me as odd when I first looked at the paper (a highly unusual pattern if you are familiar with these statistics). On the other hand when you re-order the conditions and plot them as a line plot (as on the Mayo and the Neuroskeptic blog) the “linearity” does leap out at you.

    • @Fernando and @Thom,

      @Fernando: I agree with everything you write above where there is no more room to comment, and that was a really helpful formulation of the problem. But I want to push on why I think that either the placebo test is bad, or there is something else going on.

      @BothOfYou

      A little nomenclature: Consider some stimuli called S+, S-, So.
      These stimuli induce “treatments” of T+, T-, To (@F’s “X”). These treatments have effects of size Beta+, Beta-, Beta0 on outcome Y. Points in order of importance (to me):

      First – We agree on the author’s metaphysical/meta-statistical perspective: they are assuming the T+ = 1, To = 0, T- = -1. And we agree that, in reality, the treatments are at best ordinal (if that). The author’s entire exercise seems convincing because -1, 0, and 1 seem like natural numbers to use. But, given our agreements above, we can always pick *some* x-scaling such that three points appear to be on a line – the fact that it is -1, 0, and 1 for that paper is weird (the kind of weird that statistical bugs tend to generate, or that really, really lousy cheaters might generate). I could see someone accidentally scaling the x-axis
      such that it generated a perfect line through the three observations. This could have been in the review article, and that’s why I wondered about “normalization” earlier – but it doesn’t seem like that’s the case. Or It could also be a problem from the original analysis, say by regressing on a linear trend instead of on dummy variables (“reg Y T” v. “reg Y i.T”) and graphing the predicted values. That could happen if this whole -1, 0, 1 thing were a real part of people’s thinking in this field and someone’s (repeatedly re-used) source code caught a bug at some point.

      Second – it is not clear to me that the authors actually look at which arm is S+, S-, and So in the original papers, and by consequence don’t know T. I think they only look at Y and group assignment (without knowing the +/- part). So I think what they are doing is, experiment by experiment (within studies), choosing S+ as the highest Y group, and S- as the lowest, and redefining as they go. If they did in fact know S and they are representing cuts on S in their graphs, I think we could reasonably wonder if all of the results shown for all of the studies are “too linear”. Assuming they didn’t, why not? It makes their case stronger if they use S instead of Y, because using Y makes the placebo group itself “too linear”. But the fact that they seem to stratify on Y while representing as though the cut were on S/T gives me more pause than it seems to give other people here.

      I think the contribution of this paper is mostly to be like: Hey, look here, these results are totally linear in a totally weird way that doesn’t make any sense. But after that the whole metaphsyics of the paper just sort of falls apart.

      • I’m not sure quite you are referring to by “the paper”. Foerster’s paper doesn’t rely on linearity at all. The original complaint (it isn’t a paper) uses linearity as label for the situation where the means of the conditions are evenly spaced. I can sort of see you argument about the scaling of x, but the -1, 0, 1 coding is the obvious coding to use for this situation given that is the default for a linear contrast in ANOVA (anything else would be a bizarre choice based on the background of the authors). On reflection, one aspect of the pattern that might be appealing in this instance is that it maximizes the chance of significance of the two crucial pairwise tests (local vs. control and global vs. control). These tests weren’t reported (but is implicit in the plots given the size of the SE and is often reported in similar studies or requested by reviewers).

        I’m not sure I understand your second point. The complainants and subsequent analyses by others use the ordering of the predictions in the original study (which are also the observed orderings of means): “Global processing increased category breadth and creative relative to analytic performance, whereas for local processing the opposite was true.” Are you saying that they could derive the optimum spacing and use that? I think that might be better statistically, but would be harder to explain and justify to a non-statistical audience. They already lost people who assumed that their analysis was biased against Foerster by assuming that the true means were evenly spaced.

        • Thom,

          First – All of my comments are just about the report linked by Andrew (and pasted below), sorry about the confusion:

          http://retractionwatch.files.wordpress.com/2014/04/report_foerster.pdf

          And mostly I’m referring to the “placebo” test part, where they show us what is supposed to be a counter-factual: “here is how ‘linear’ these means ‘should’ be.” That exercise is supposed to give us some empirical contrast to the patterns found in Forester. I’m arguing it provides a bad counter-factual.

          It provides bad a counter-factual for one of two reasons: 1) for the non-Forester papers, they choose what experimental assignment is L/M/H based on ranking the outcome means in each experiment. If this is the case, then -1,0,1 might be different for two experiments, even if the experiments had the same “treatment” or very similar “experimental manipulations.” It just isn’t showing us what results really “should” look like.

          Or, 2) They ranked L/M/H by some “a priori” criteria in the original articles (as they did for Forester). In this case, the entire field appears to me to be far, far too linear. There is no reason that 21 “independent samples from control papers” should all get results where the mean of the “medium treatment” group falls between the means for the “low treatment” and “high treatment” group. Not with a cell size of 20.

          Regarding your point 2 (which I basically concede – they do use Forester’s “a priori” treatment group classifications), I do not want them to manipulate the x-axis by scaling. I want them to think about how useful the placebo test is given the fact that it doesn’t seem to make any sense at all as a compelling counter-factual.

          Finally – Did Forester use a “linear contrast ANOVA” to do his analysis? If so, doesn’t that seem like a place to look for the problem, at least to rule it out? If he had it coded as -1,0,1, and made a coding error, is there any way it could have produced his tables? I’m not trying to be a conspiracy theorist – maybe I’m in the minority of people here, but I believe that quantitative researchers at good universities should probably be better at faking analyses or data (even those who would do so for evil – I mean, a lot of us fake data as part of our legitimate research (simulations)).

          If he’s guilty of inventing data, he’s a blight on the profession and he’s also an almost unbelievably ignorant quantitative researcher. If he’s guilty of a coding error, he’s incredibly sloppy, but not a fraud. If he used linear projections knowingly pretending they were real means because they made his results “look a little better” – then he’s somewhere in between.

        • Interesting point about the control papers – it is possible they are a biased subset. Foerster did a regular ANOVA (with 2 df in the numerator of the F ratio). These F ratios are very similar (all around 9 or 10) – though that is confounded with the linearity issue. In papers I normally read the F ratios jump around much more. I don’t see how this could conceivably be a coding error.

          Faking and simulating is very different (except in very simple cases). If I collect data with more than three variables I cant even begin to accurately guess the structure of the covariance matrix (and three would be hard). It becomes easier if you have the real data, but then you don’t need to fake it. Plus I don’t think that the fraudsters are detailed-oriented people who bothered to learn statistical modeling in UG and PG classes – they probably faked their way through. Responses by Smeesters et al. to queries about their data suggest this.

  11. Sara: “I did not know that. It does make me wonder, if someone would deliberately tweak an existing data set, wouldn’t he or she try to monitor the resulting data for non-random patterns and be able to suppress those patterns in that case as well?”

    I don’t think we know enough about data faking. Anecdotally a common method in project students is to run oneself multiple times (reported by a senior colleague at previous institution). This always struck me as rather hard work (why not just collect the data from real participants?). I think the only think we know for sure are that real data have complex microstructure that is hard to fake (in the absence of the real data) except in very simple designs.

  12. New Stuff – a report looking at “super linearity” in more Jens Forster articles: https://drive.google.com/file/d/0B5Lm6NdvGIQbamlhVlpESmQwZTA/view

    I maintain that coding “low” as “-1”, “medium” as “0”, and “high” as “1” makes no sense at all, and any “linearity” in that is a bizarre metric to analyze. Without getting into what is/isn’t real in Forster’s work, let me just plead with empirical researchers to never, ever code what should be 3 dummy variables for group membership into -1, 0 and 1. It makes no sense at all and just leaves you open to accidentally displaying “predicted” values treating L/M/H as a continuous variable instead of place holders for three separate groups.

    I also maintain that since “low”, “medium” and “high” treatments ALWAYS correspond with increasing measures of Y, that either the entire field should be suspect or the file drawer problem is bigger than anyone ever expected. Because that seems to happen in all the “control” papers too from the previous round of Forster-investigating, and probability, variation, and the real world never yield anything so tidy.

Leave a Reply

Your email address will not be published. Required fields are marked *