Skip to content

Jeremy Freese was ahead of the curve

Here’s sociologist Jeremy Freese writing, back in 2008:

Key findings in quantitative social science are often interaction effects in which the estimated “effect” of a continuous variable on an outcome for one group is found to differ from the estimated effect for another group. An example I use when teaching is that the relationship between high school test scores and earnings is stronger for men than for women. Interaction effects are notorious for being much easier to publish than to replicate, partly because it is easy for researchers to forget (?) how they tested many dozens of possible interactions before finding one that is statistically significant and can be presented as though it was hypothesized by the researchers all along.

Various things ought to heighten suspicion that a statistically significant interaction effect has a strong likelihood of not being “real.” Results that imply a plot like the one above practically scream “THIS RESULT WILL NOT REPLICATE.” There are so many ways of dividing a sample into subgroups, and there are so many variables in a typical dataset that have low correlation with an outcome, that it is inevitable that there will be all kinds of little pockets for high correlation for some subgroup just by chance.

Examples of such findings in the published literature are left as an exercise for the reader.

Interesting to see this awareness expressed so clearly way back when, at the very beginning of what we now call the replication crisis in quantitative research. I noticed Freese’s post when it appeared but at the time I didn’t fully recognize the importance of his points.


  1. Dean Eckles says:

    It is too bad Freese has seemingly stopped blogging.

    Theorizing about non-crossover (i.e., eliminable) interactions is often premature. Sometimes these interaction effects are just best explained by, say, the model being multiplicative. Cf.

  2. Jonathan says:

    Think about the words ‘interaction effects’ and how this has become the larger, even less valid concept of ‘intersectionality’. By this I mean the idea that you can pick a connection, an interaction or intersection, and have that stand as a true portrait of complexity is absurd. In both cases, they take a point – that it may be disputed or even false is not necessarily material – and then they do the scientifically and logically unpardonable: they take one interpretation that fits one entrance or path to this point and extrapolate that it applies to the other roads in and out of the point. This kind of idea becomes a means by which you assert your state of belief as applying across an interaction or intersection, thus allowing you to characterize way beyond the intersection. In many hands, it becomes a weird sort of reversal of causality: we believe this and we see this intersection, so therefore what’s down this other road coming at us is actually causing the effects we believe exist. It’s not mathematically sound. It isn’t logically sound. Imagine you come to a street crossing your street. You may know the intersection so you know what’s down each street. But that’s actual knowledge. These methods of interaction and intersectionality say I have a set of beliefs about my street and at this intersection I’ll impose that set of beliefs to tell you what’s down the other street, though I have no idea what’s actually there. It’s like you come to a road crossing in the dark and the roads disappear quickly into darkness. You could say I have no idea where to go or you could say I hereby determine that my reading at this point tells me accurately what the meaning of this road is down its length. You don’t know that it ends around the bend. Or that in the morning it’s a major truck route. But you believe you know and that’s what matters to you. Or like in My Cousin Vinny when he asks if the train comes through often and the answer is no but of course it comes through in the middle of the night, which is the only thing important to sleep. Or in Rocket Boys when the kids are pulling up rails to sell to buy better steel and they see a train coming and they run to warn it only to find they’re on an unused spur and the train isn’t coming at them at all. The intersection or interaction in both cases was totally erroneous in the crucial ways.

    • Martha (Smith) says:

      This sounds pretty fuzzy to me.

    • Elin says:

      Intersectionality is different than interaction even though I know why on the surface they seem similar. But the underlying idea e.g. that at least in the US being a Black woman or being a White man is associated with experiences and outcomes that are distinct from being a specific gender and race alone (and similarly for all the other groups). They are so entwined from a the day we are born until the day we die that they cannot be separated. So it is the combinations that should be at the forefront and the single variables are the secondary ones. From this perspective the thing to do is to always start with the combination. Of course among scholars there are serious arguments about this, but basically none of them are interested in the kind of causal analysis you are discussing. However it has nothing to do with streets intersecting.

      • Martha (Smith) says:

        (However, I think there is a terminological/etymological connection with streets intersecting — the connection being with Venn diagrams, and the set-thoretic meaning of intersection.)

  3. Thomas says:

    Agreed with Dean: interactions are scale-dependent. In epidemiology it is quite common to find interactions that go in opposite directions on an absolute scale (e.g., incidence rates of heart disease increase more with age in men than in women, in cases per population*time) and on a relative scale (relative risks of heart disease for older age are lower in men than in women). It only means that the data don’t fit either model.
    For a worked example, years ago I wrote a comment about the risk of bladder cancer associated with smoking in women and men, here: .
    The relative risk of smoking was greater in women than in men (positive interaction), the absolute attributable risk was lower in women than in men (negative interaction), and with a power transformation of 0.12 the interaction was eliminated, the model fit.
    With any data, one could play with power transformations until something significant occurred, and then misinterpret it as a biological or social phenomenon.

    • I’m not entirely clear on your heart disease example, but isn’t that actually just the expected consequences of survivorship bias? Heart disease kills men earlier, so if you look at older people, it’s the women who have heart disease?

      • Thomas says:

        It’s not survivorship, these are rates of disease among survivors.
        Say among the young (the numbers are made up):
        1 per 100 person-years in women
        3 per 100 person-years in men

        among the elderly (made up):
        4 per 100 person-years in women
        8 per 100 person-years in men

        On an absolute scale, the effect of being elderly is
        +3 per 1000 person-years in women
        +5 per 1000 person-years in men

        so age has a stronger effect in men, right?

        On a relative scale, being elderly
        Quadruples the risk among women (4/1)
        Less than triples the risk among men (8/3)

        so age has a stronger effect in women, right?

        It depends on the scale.

        • Thomas says:

          …esprit d’escalier

          it works the other way too.

          Among the young, male sex adds 2 per 100 person-years
          Among the elderly, male sex adds 4 per 100 person-years
          So the effect of male sex is stronger in the elderly.

          Among the young, male sex triples the risk of heart disease (3/1)
          Among the elderly, male sex adds doubles the risk of heart disease (8/4)
          So the effect of male sex is weaker in the elderly.

        • “It’s not survivorship, these are rates of disease among survivors.”

          Exactly, so when you look at survivors, they’re more likely to have … survived. If they survived, then they didn’t die of heart disease earlier, so having survived to say age 70 means automatically that you’re less likely to have heart disease than the average person born 70 years ago (many of whom may have died at age say 50 or 60 or 65 of heart disease).

          If instead of calculating “among men (surviving today)” you calculated “among men born 70 years ago living or dead” your mysterious confusion of numbers might well disappear.

          • Or put another way, which is maybe more relevant and might be related to your point: heart disease is a progressive time-dependent survivorship process, treating it as a two-by-two cross-tab (women/men x young/elderly) isn’t even approximately a good model and so the examples you’re giving are more about how poorly people, especially commonly in medicine, think and analyze data rather than any kind of truly confusing facts about the world.

            • Anoneuoid says:

              Yea shouldn’t it be more like disease (d) is some function (f) of age (a)? Then you draw a curve of a vs d and see what known functions it looks like, then you try to figure out how to derive one of those functions or a new one of similar shape in a way that makes theoretical sense? Then once you have that you see what else you can derive from the same assumptions and collect data to check that?

          • Thomas says:

            The elderly still have higher rates of disease, despite having survived (and that is the reality).
            But age/survivorship isn’t the point. Replace “elderly” by “smoker” and see if it makes sense.
            The point is that you can have an interaction in one direction, or in the other, using the same data, depending on the model used for the analysis.

    • Martha (Smith) says:

      The interpretation/modeling of “interaction” via a cross-term in a linear model has inherent problems — since one can sometimes “eliminate” the interaction by working log scale, while working log scale may not best represent what is of interest in the study (i.e., the intuitive idea of “interaction” in context). The area is really problematical in deciding on an appropriate model to study what is scientifically/practically of interest/importance.

      • Jens Åström says:


        I think this is relevant to ecology, where you often model both abundance (population size), and species richness (no of distinct species) of the same biological community. One or the other of these models might be additive, multiplicative, non-transformed or log-transformed, depending on the researcher’s tradition and the particular sample. I wonder if the two variables might show different interaction effects, where the difference is just an artifact of the choice of models or pre-modelling data treatment. Such a discrepancy and the found interactions may be a major point of a paper.

        I guess your point is related to the comment by Dean Eckles above.

        • Martha (Smith) says:

          “One or the other of these models might be additive, multiplicative, non-transformed or log-transformed, depending on the researcher’s tradition and the particular sample.”

          I would hope that decisions on whether or not to use an additive or multiplicative model (or non-tranformed or log-transofrmed) are made on the basis of careful consideration of the variables involved, and how they arise in nature, rather than on “the researcher’s tradition and the particular sample.”

  4. Zad Chow says:

    This contrasts nicely with this recent post about the humanities,

    “The humanities should take responsibility for quality in the same way the sciences do, argue Rik Peels and Lex Bouter, through the pursuit and institutionalization of replicability. We disagree: quality criteria are crucially different in the humanities and the sciences.

    The humanities pursue meaning beyond truth. Confirming that Van Gogh painted Sunset at Montmajour (truth) is only the beginning. Unearthing the cultural meaning of the work requires historical context and theorizing on its message, style, aesthetics — and what the work can tell us about the artist and his world (view). The coexistence of multiple valid answers and the value of their interaction disqualify replication as a viable quality criterion.”

    • Martha (Smith) says:

      I’ve never understood phrases like “the cultural meaning of the work”. I can see a concept of what something “means” to an individual, but “cultural meaning” mystifies me (although I think I can see how an event or custom might have “meaning” in a culture).

      • Guive says:

        I think its pretty similar to the cultural meaning of an event or custom. A big part of humanities scholarship is recovering context for cultural products from the distant past that is not obvious to people anymore but would have been obvious to the author or the intended audience. For example, when Rousseau talked about “the general will,” he was referencing a concept that was previously understood to be theological, the idea that the general will of god expressed in natural laws differed from his particular will expressed through miracles. Rousseau made this a political concept, but clearly he was intentionally using the theological resonances. This should change how we read Rousseau, but nobody today uses the phrase general will in a theological sense, so some historian had to go dig up this reference for us to understand the cultural meaning of Rousseau properly.

        About Zad’s quote:
        wouldn’t the fact that humanities scholarship deals mainly with unrepeatable events that have already happened already “disqualify replication as a viable quality criterion”? You can’t re-run WWI with perturbed conditions to see if your theories about it still hold, so it is unclear to me what replication would even mean in a humanities context.

  5. Elin says:

    I went to sociology graduate school way, way earlier than that and caution about interactions and not cherry picking dummy variables is something that was really emphasized at least in my research group, although maybe not with the same vocabulary. This is an even older article that is relevant Zeisel, Hans. “Disagreement over the Evaluation of a Controlled Experiment.” American Journal of Sociology 88, no. 2 (1982): 378-89. I do think that the usage of the term “reproducible” (and variations) is newer.

    • Andrew says:


      I remember the name Zeisel because he had an intro statistics book that had an interesting example, so interesting that I followed the references and read the original article, and it turned out that Zeisel had completely bungled the description of the study: the data looked nothing like what Zeisel had graphed. So this makes me distrust anything he writes, in that he really didn’t seem to care about the details on an example he bothered to include in his book.

Leave a Reply