Skip to content

“Beyond Heterogeneity of Effect Sizes”

Piers Steel writes:

One of the primary benefits of meta-analytic syntheses of research findings is that researchers are provided with an estimate of the heterogeneity of effect sizes. . . . Low values for this estimate are typically interpreted as indicating that the strength of an effect generalizes across situations . . .

Some have argued that many of the relationships studied in I-O [industrial-organizsational] psychology are characterized by little or no residual heterogeneity. . . . Schmidt writes “The evidence suggests that moderators are often solipsistic: they exist in the minds of researchers but not in real populations” . . . On the other hand, as Tett, Hundley, and Christiansen’s article emphasizes, there are sound reasons for suspecting that many relationships studied in I-O psychology are, in fact, characterized by non-trivial amounts of residual effect size heterogeneity. . . .

Generalizability is context dependent . . . you can comfortably generalize when extending the findings to the range of samples, settings and measures comprised in the meta- analysis itself. However, as one tries to extend these findings further, to dissimilar measures or participants, confidence in generalizability should wane. . . . Meta-analysis perpetually struggles with the issue of external validity . . . most studies in management science don’t have a clear description of what the population of interest is . . .

To more adeptly and precisely argue for generalizability, it would help if we accepted that moderators and heterogeneity are the rule, not the exception. . . .

Steel continues:

It is important to reflect what an incredibly bold statement it is to profess perfect generalizability. It is essentially putting a social science relationship on par with the speed of light, as both would be considered constants of the universe. And yet we do this often and with a straight face.

We often see researchers playing both sides of the street on this one. When selling their results, they downplay variation and they imply or even explicitly state that their claims generalize broadly. But when then these results fail to replicate, there’s a tendency to reach for heterogeneity as an explanation: the power pose replication failed because of a change in the length of time that people were standing; the ovulation and clothing replication failed because the weather was different; etc. Psychology researcher John Bargh illustrated this principle in an even more stark way by framing failed replications as successes because they revealed interesting interactions.

Don’t get me wrong. I agree with Steel that interactions are important and that generality is a key concern. I just wish that researchers would recognize this right away, rather than using it as an alibi for failure.

Back to Steel:

Once we accept the inherent complexity of the world, that findings are not hermetically sealed from outside influences, our science can progress. We encourage researchers to be more thorough in their exploration of potential moderating variables and to properly contextualize their studies. Too often, when looking for potential moderators, we are limited to common demographic variables, such as sex or age, as this is the only characteristics that are recording. If we could describe our studies more completely, what strides we could make in understanding variation. At a minimum, we should be commonly describing job’s with ONET codes and industries with SEC codes. . . .

I like that he got specific there.

I would add just one thing to Steel’s article, and it’s an important thing. Interactions are important, and they’re real, but they’re hard to estimate.

Here’s a simple calculation demonstrating the difficulty of estimating interactions:

Suppose for simplicity you’re studying a binary outcome with a between-person experiment with 400 participants, 200 getting the treatment and 200 getting the control. And also assume that the frequency of yes and no responses is close to 1/2. Then the standard error of the simple treatment effect estimate is simply sqrt(0.5^2/200 + 0.5^2/200) = 0.05. The experiment allows you to estimate the treatment effect to within an accuracy of about 0.05.

Now suppose you’re looking at an interaction between treatment effect and job category, and each of your two groups has 100 people in supervisory roles and 100 in non-supervisory roles. Then the standard error of the interaction estimate is sqrt(0.5^2/100 + 0.5^2/100 + 0.5^2/100 + 0.5^2/100) = 0.10.

That’s right—the standard error for the interaction is twice that for the main effect.

Now suppose that your study has been powered to just be able to estimate the main effect, and suppose that the interaction, while important, is not quite as large as the main effect. Then you’re in big trouble. Really the only way to estimate this interaction from data is if you have a lot of data. This connects to Steel’s theme of meta-analysis.

Steel’s paper linked above is a draft reply to the article, From Handmaidens to POSH Humanitarians: The Case for Making Human Capabilities the Business of I-O Psychology, by Alexander Gloss, Stuart Carr, Walter Reichman, Inusah Abdul-Nasiru, and W. Trevor Oestereich.


  1. Marcus says:

    I am in broad agreement with Piers about the need to better explore the influence of situational characteristics on effect sizes (and I’ve tried to do this in a couple of papers) but it is worth remembering that the typical meta-analysis in IO psychology accounts for three sources of variability in effect size estimates across studies: sampling error and unreliability in the measurement of the dependent and independent variables. Other study artifacts such as source effects (e.g., all data coming from self-reports versus more objective data or other-reports) can also be examined when data is split out. Additional study artifacts such as differential range restriction and differences in the construct validity of scores on measures can also cause effect size heterogeneity but most meta-analyses don’t take these into account because the information necessary to make these corrections are typically unavailable. The field is also still learning how to estimate and deal with heterogeneity due to file-drawer effects. So, a meta-analyst has to take these unaccounted for sources of variability into account in some way when interpreting heterogeneity statistics such as SDrho estimates.

    • Keith O'Rourke says:

      Agree it is a mess and has been for some time.

      Its less challenging in clinical research at least when all studies directly assessed the same effects but the challenges have been know for some time and little progress seems to have been made

      Maybe this is reason many continually downplay the need to thoroughly investigate replication of results prior to any thought about pooling or partial-pooling anything. Most common for folks to talk about pooling results as the objective and then addressing apparent heterogeneity afterwards as a check on assumptions.

      (Of course, meta-analysis is now meta-Big-Business especially in China – The Mass Production of Redundant, Misleading, and Conflicted Systematic Reviews and Meta-analyses. )

      • Martha (Smith) says:

        Thanks for the links, especially the pub med one. This seems in large part to be one more instance of people wanting and trusting simple “guidelines” or “checklists” that make them feel they are doing good science. But that’s not what good science is.

Leave a Reply