Piers Steel writes:
One of the primary benefits of meta-analytic syntheses of research findings is that researchers are provided with an estimate of the heterogeneity of effect sizes. . . . Low values for this estimate are typically interpreted as indicating that the strength of an effect generalizes across situations . . .
Some have argued that many of the relationships studied in I-O [industrial-organizsational] psychology are characterized by little or no residual heterogeneity. . . . Schmidt writes “The evidence suggests that moderators are often solipsistic: they exist in the minds of researchers but not in real populations” . . . On the other hand, as Tett, Hundley, and Christiansen’s article emphasizes, there are sound reasons for suspecting that many relationships studied in I-O psychology are, in fact, characterized by non-trivial amounts of residual effect size heterogeneity. . . .
Generalizability is context dependent . . . you can comfortably generalize when extending the findings to the range of samples, settings and measures comprised in the meta- analysis itself. However, as one tries to extend these findings further, to dissimilar measures or participants, confidence in generalizability should wane. . . . Meta-analysis perpetually struggles with the issue of external validity . . . most studies in management science don’t have a clear description of what the population of interest is . . .
To more adeptly and precisely argue for generalizability, it would help if we accepted that moderators and heterogeneity are the rule, not the exception. . . .
It is important to reflect what an incredibly bold statement it is to profess perfect generalizability. It is essentially putting a social science relationship on par with the speed of light, as both would be considered constants of the universe. And yet we do this often and with a straight face.
We often see researchers playing both sides of the street on this one. When selling their results, they downplay variation and they imply or even explicitly state that their claims generalize broadly. But when then these results fail to replicate, there’s a tendency to reach for heterogeneity as an explanation: the power pose replication failed because of a change in the length of time that people were standing; the ovulation and clothing replication failed because the weather was different; etc. Psychology researcher John Bargh illustrated this principle in an even more stark way by framing failed replications as successes because they revealed interesting interactions.
Don’t get me wrong. I agree with Steel that interactions are important and that generality is a key concern. I just wish that researchers would recognize this right away, rather than using it as an alibi for failure.
Back to Steel:
Once we accept the inherent complexity of the world, that findings are not hermetically sealed from outside influences, our science can progress. We encourage researchers to be more thorough in their exploration of potential moderating variables and to properly contextualize their studies. Too often, when looking for potential moderators, we are limited to common demographic variables, such as sex or age, as this is the only characteristics that are recording. If we could describe our studies more completely, what strides we could make in understanding variation. At a minimum, we should be commonly describing job’s with ONET codes and industries with SEC codes. . . .
I like that he got specific there.
I would add just one thing to Steel’s article, and it’s an important thing. Interactions are important, and they’re real, but they’re hard to estimate.
Here’s a simple calculation demonstrating the difficulty of estimating interactions:
Suppose for simplicity you’re studying a binary outcome with a between-person experiment with 400 participants, 200 getting the treatment and 200 getting the control. And also assume that the frequency of yes and no responses is close to 1/2. Then the standard error of the simple treatment effect estimate is simply sqrt(0.5^2/200 + 0.5^2/200) = 0.05. The experiment allows you to estimate the treatment effect to within an accuracy of about 0.05.
Now suppose you’re looking at an interaction between treatment effect and job category, and each of your two groups has 100 people in supervisory roles and 100 in non-supervisory roles. Then the standard error of the interaction estimate is sqrt(0.5^2/100 + 0.5^2/100 + 0.5^2/100 + 0.5^2/100) = 0.10.
That’s right—the standard error for the interaction is twice that for the main effect.
Now suppose that your study has been powered to just be able to estimate the main effect, and suppose that the interaction, while important, is not quite as large as the main effect. Then you’re in big trouble. Really the only way to estimate this interaction from data is if you have a lot of data. This connects to Steel’s theme of meta-analysis.
Steel’s paper linked above is a draft reply to the article, From Handmaidens to POSH Humanitarians: The Case for Making Human Capabilities the Business of I-O Psychology, by Alexander Gloss, Stuart Carr, Walter Reichman, Inusah Abdul-Nasiru, and W. Trevor Oestereich.