Why big effects are more important than small effects

The title of this post is silly but I have an important point to make, regarding an implicit model which I think many people assume even though it does not really make sense.

Following a link from Sanjay Srivastava, I came across a post from David Funder saying that it’s useful to talk about the sizes of effects (I actually prefer the term “comparisons” so as to avoid the causal baggage) rather than just their signs. I agree, and I wanted to elaborate a bit on a point that comes up in Funder’s discussion. He quotes an (unnamed) prominent social psychologist as writing:

The key to our research . . . [is not] to accurately estimate effect size. If I were testing an advertisement for a marketing research firm and wanted to be sure that the cost of the ad would produce enough sales to make it worthwhile, effect size would be crucial. But when I am testing a theory about whether, say, positive mood reduces information processing in comparison with negative mood, I am worried about the direction of the effect, not the size (indeed, I could likely change the size by using a different manipulation of mood, a different set of informational stimuli, a different contextual setting for the research — such as field versus lab). But if the results of such studies consistently produce a direction of effect where positive mood reduces processing in comparison with negative mood, I would not at all worry about whether the effect sizes are the same across studies or not, and I would not worry about the sheer size of the effects across studies. . . .

I’ve added the emphasis in the quote above to point to what I see as its key mistake, which is an implicit model in which effects are additive and interactions are multiplicative. My impression is that people think this way all the time: an effect is positive, negative, or zero, and if it’s positive, it will have different degrees of positivity depending on conditions (with a “pure” measurement having larger effects than an “attenuated” measurement). You can see this attitude in the above quote. There seems to be an idea, when considering true effects or population comparisons (that is, forgetting for a moment about sampling or estimation uncertainty), that there is a high fence at zero, stopping positive effects from becoming negative or vice versa.

This high fence doesn’t make sense to me. If main effects are additive, so can interactions. If “a different manipulation of mood, a different set of informational stimuli, a different contextual setting for the research” can change the magnitude of an effect, I think it can shift the sign as well. One reason not to trust effects of magnitude 0.001 is that they can be fragile; there’s no guarantee the effect might not be -0.002 next time around. And I’m not talking about sampling variability here, I’m talking about interactions, that is, real variability in the underlying effect or comparison. This idea is familiar to those of us who use multilevel models but it can be missing in some standard presentations of statistics in which parameters are estimated one at a time without interest in their variation.

P.S. Funder’s post is fine too; he focuses on a different point, which is how to assess the relevance of correlations such as 0.3 which are too large to be noise but too small to be overwhelming.

27 thoughts on “Why big effects are more important than small effects

  1. Having read this… I must say, that treating zero as a special effect size or threshold and looking at the sign of the effect can be valuable.

    Responses often have a kink at zero, and human responses especially. In fact, human responses are discretisized to a certain extent (do action B or not do action B) and no-treatment often yields no discernible response, thus a prior (yes this is Bayesian) for effect size would include a small finite mixture of effects. The posterior would be well characterized by a zero / non-zero values or a sign (or the mixture element most associated with treatment).

  2. Thanks for this post — I like the notion that small effects could flip sign for reasons other than sampling variability — i.e. because the relationship that (tenuously) held in the past might not hold in the future.

    A first reaction is that this could be driven by treatment response heterogeneity — and the category of people for whom the small effect was true no longer exists in the same form (for example, people mildly upset by Sarah Palin specifically).

    A second reaction is that “treatments” themselves are not necessarily stable things — The regressors we include in our models are often bundles of causal pathways, for example, if we’re exploring the “effects” of having two children in America on some outcome, we might be concerned that the whole world of things that is implicated by parenthood is different in 2013 than it was in 2003. This is more than just a “time effect” — parenthood might be a (slightly) different thing in and of itself.

    Both might be ways in which true (small) effects could flip signs!

  3. Interesting, although my first reaction was to think the post would be about ESP experiments with many trials, but small effects, for example, as done at PEAR, and often reported in my favorite dog astrology journal JSE. Search for Jahn in the list for examples.

  4. If I understand you (and Alex) you’re not saying that circumstance can flip an effect direction because of a change in the underlying causal mechanism, but just because the operationalization of those mechanisms can change depending on context. Right?

    So I discover that water ice makes my coffee colder and claim that water ice makes things cold. But you counter that if my coffee were on the surface of Io, water ice would warm it up. You are right, and the direction of the effect is context-specific. The problem is our focus on a specific operationalization (effect of water ice on change in coffee temperature change) which is just an aspect of what should be a larger theory of thermodynamics.

    • Brent:

      Indeed. Or, for another example, suppose, in some country in some time period, that beautiful people are 0.1 percentage point more likely than non-beautiful people to have boys. And maybe in another country during another period, that difference is 0.2 percentage points in the other direction. In this case we’re talking about an observed or hypothesized difference that could have various causal mechanisms. My point is that, even if the sampling and measurement were clean enough, and the sample were large enough, that we could be highly certain that this was a true difference in the population being measured, this would not necessarily mean that the finding would generalize to other countries in other years. A very small difference could well be the product of several interacting and variable factors.

    • We don’t need to go to Io or change the experiment conditions any way. Here on earth, dropping an ice cube in coffee even from a small height will ADD heat (from kinetic energy). That’s a real effect, albeit tiny, working in the opposite direction. If this effect was of the same order of magnitude as your initial observation, one might get a different result just depending on how high the spoon was that dropped the ice!

  5. This is why I care about testing theory e.g. the causes of the outcome of interest.

    Small (reduced form) effects can be big effects in different contexts (e.g. in the presence of a catalyst). So I care about the causal structure, and only then about the effect sizes.

    In my view the parametrization of the structure is secondary.

    • Fernando:

      I agree, but the question remains: what to do with interesting patterns that are discovered atheoretically. For example, everything in our Red State Blue State book. Ultimately I agree that causal explanation is important—and is, indeed, a (largely unstated) thread running throughout the book—but, before we get there (and we will never get there!), the sizes of observed comparisons are important. Rich people voting 15% more Republican than poor people is much different than a gap of 45% or of 5%.

      One of my problems with studies such as Kanazawa’s is that he implicitly accepts a particular causal story for what is, at best, a population difference, and that motivates him and similar researchers to declare victory upon seeing statistical significance. Even if his work had no statistical errors, and even if his sample size were large enough to discover an unambiguous correlation in the population he is studying, I still think that if the difference is 0.1 percentage point, I won’t take it so seriously, as this could come from all sorts of causal paths, many of which are contingent on temporary and local socioeconomic conditions. A difference of 8 percentage points (or, as notoriously reported by Freakonomics, 36%), that would be something interesting. But of course completely implausible in this case. My point is, in the absence of a good causal story, the effect size can be relevant in considering importance and generalizability.

      • Andrew:

        You state: “(and we will never get there!)”

        I start there!! I start there, and then check whether data knock me of my pedestal.

        Causal inference starts with a guess, descriptive inference with exploration.

        Admittedly there is a grey area where one jumps are taken from one to the other, but in this Dervish dance it is easy to get intoxicated if one is not careful.

      • Andrew: “what to do with interesting patterns that are discovered a theoretically”

        You do what Galielo did: experiment. And note he did not have Stata, R, Stan, or OLS. He practiced what I call “naked causal inference”, or good design. No wonder the Church went amok.

        PS And yes, experimentation is hard in social sciences, but still.

        • Fernando:

          Different people have different skills. You may be good at social science experiments and that fine. I’m good at data analysis which is why I wrote Red State Blue State. I think there’s room in the world for more than one kind of scientist.

        • I thought this post was about effects, which I interpret to be causal, so I shared my opinion.

          Nothing against descriptive inference or prediction. They are both very valuable.

          PS. I’ll be the last person to try homogenize scientists.

        • Fernando:

          Jennifer would argue (and I might agree) that the ultimate goal in nearly all research (including that of Red State Blue State) is to identify causal effects. Even there, however, I think it can be useful for statisticians such as myself to blaze a trail with data by studying patterns descriptively, even while others start from first principles and conduct experiments and natural experiments.

        • Fernando: about the “what to do with interesting patterns that are discovered a theoretically”

          In my career, I have encountered some who seem to _know_ – to just ignore them.

          For instance, JG Gardin who published his pattern finding algorythm in archealogy only in protest.

          His explanation was “You cannot rule out a hypothesis by the way it was generated BUT you surely can choose what to spend your time on”

          So, some of us simply choose not to spend _our_ time working on patterns that are discovered a theoretically.

          p.s.
          Peirce spent a lot of effort trying to come up with logical support for this but in the end admitted it came down to a superstition than humans had evolved to be better than undirected pattern searches.
          (no spell checker available)

    • Why would you assume that the dependence on the context wouldn’t change the sign of the effect (especially if the effect is miniscule to begin with)? Given this, what good is a causal effect if you don’t get the sign right? You could literally be doing harm with your result.

      Also, most study designs (and pretty much any observational study) will introduce some small bias in the translation from the study question to the comparison populations which are analyzed. If you don’t care about effect size, every association study will (incorrectly) infer causal structure given a large enough sample size. Accounting for effect size is necessary to correctly assess the strength of evidence for causal structure.

      • Revo11

        Not sure if your comment was addressed to me. I care about effect size, but first I care about testing theory. Testing theory and estimation are different, sometimes overlapping, endeavors. The latter requires additional assumptions. I like to proceed in steps and be deliberate about what I am doing at each stage. But above all I don’t consider this an either/or approach. As I said, I do testing, and estimation, and judgement.

        I also don’t start by assuming parametric forms, so I allow for any type of interactions, changes in signs etc… I do typically assume faithfulness. Namely that if X affects Y through two different pathways, say, the parametrization is not so precise as to precisely cancel effects out. (Note that as this assumption breaks down huge effects may appear as small effects in reduced form).

        Moreover, even asymptotically I see no reason why large effects cannot change signs. Some medicines help a lot but kill a few. In any case, this is not a very productive conversation if we do not define more clearly what sort of effects or interventions we have in mind (e.g. ceteris paribus or not, within same sample, across populations, etc…)

        • I’m responding to the point regarding “I care about the causal structure, and only then about the effect sizes.” I think the latter is essential to getting at the former, but perhaps this falls under the category of what you’re calling “judgement” .

          I agree that one can’t get much deeper into this discussion without domain specifics which affect what constitutes a reasonable approach. I would say that within the domain social science, a dose of humility regarding one’s theories is warranted even with good study design given the degree of underdetermination and confounding intrinsic to the discipline. “Naked causal inference” is not feasible for all but small problems, although the allure of it probably explains why methods like IV get abused.

        • revo11

          I disagree with you on a number of points:

          First, you can test theories using non-parametric tests of an effect, leaving the effect itself mostly unmodeled. Then you can plot the data. Then you can impose a functional form and make stochastic assumptions to estimate size and compute CI. Often it will be sensible to fit a hierarchical model or whatever to summarize your inference. This is not too different from model checking, except I start with a very general model.

          Second, physics too is non-determinisitic, at least at the quantum level. And social scientists have an advantage over physicists. Physicists don’t know what it is like to be a subatomic particle but we know what it is like to be a human and interact in a society. We can come up with theories of why people vote, I think more easily, than Feynman came up with quantum electrodynamics. The line of thinking that social science is “different” has done more damage than good.

          Third, your statement that naked causal inference is not feasible for all but small problems is wrong, and I can formally prove that. Naked causal inference can be used to uncover almost any causal structure, no matter how complicated the mechanism. You just have to write down the DAG and then come up with a research design to answer one specific question.

          More generally I spouse naked causal inference bc in my (limited) experience many students I come across appear to believe that science *is* running a regression, and that good social scientists are those that use the most complicated estimators and models. For this reason I am now taking a stand at the antipodes of that paradigm. My goal is to bring attention to what I see as a fundamental problem in current social science practice and teaching.

          PS Understanding gravity or planetary motion is no small problem.

          PPS This is not to say complicated models should not be used. I agree with Andrew that sometimes we just want to go into the data to uncover interesting patterns: All science begins with observation. But on occasions when we want to test theories of cause and effect, I would proceed differently. And there simplicity and clarity is a virtue.

        • “test theories using non-parametric tests of an effect” – you can check that your theory is consistent with data, whether you end up with a theory that’s likely to capture the true causal structure depends on how badly underdetermined the set of models consistent with the data are. Within social science, I’m not optimistic about the underdetermination problem. This is not to ay that I’m against making progress with theory – I am pro-hypothesis driven research. However, I also assume that most theories will be false even though they claim to be “tested”. As you pointed out earlier, I don’t think we can get much further beyond this vague disagreement without a concrete problem.

          The comparison between probability in quantum physics vs. difficulties in social science is pretty far off. The difficulty with establishing “correct theories” in social science usually has less to do with sampling error associated with a well-defined distribution (and the distributions that arise in quantum are quite well defined) and more to do with confounding and underdetermination – issues that tend to be outside of the model.

          Regarding naked causal inference – I agree that you can come up with a hypothetical research design to address any question of causality. However, often the set of research designs that achieve naked causal inference and the set of research designs that are feasible don’t intersect, or worse, people find natural experiments that aren’t.

          We may not disagree all that much – I support furthering research through hypotheses (i.e. theory), regardless of the difficulty of the domain. I’m just am skeptical when people push a theory as right based on having “tested” it. Given the underdeterminism intrinsic to the domain, most reasonable-sounding theories which are consistent with data will be wrong from a purely entropic standpoint.

          To attempt to bring this long tangent back to the initial point, the virtue of incorporating effect size is it helps to regularize the solution space of theories “supported” by data by culling away those theories which are only “supported” through a small amount of confounding bias corrupting the signal.

        • revo11

          Underdetermination is a fundamental problem of identification that has little to do with effect size. For example, that X causes Y is compatible with X causes M causes Y and so on infinitely. We will never know. So we start with a guess and if the data are consistent with the guess we stick with it, perhaps appealing to Occam’s razor.

          Confounding can go in any direction. It can make small effects seem big, and big effects seem small. Which is to say confounding, is confounding, is confounding. Also, confounding is an identification issue which for the most part has little to do with parametric assumptions. We can keep the discussion simple. (Unless, of course, the quantity of interest is not identified non-parametrically in which case you need to make parametric assumptions.)

          All theories are false, we agree. Some are useful. Also note that I am not proposing we only test but rather that we start with tests, then graph, then model and estimation.

        • From a theoretical standpoint, identification and confounding are defined differently, but in the real world, they cannot be considered separately.

          Every comparison will be confounded a _little_, even controlled experiments. There’s always some tiny bit of temporal, spacial, or sampling variation biasing any comparison.

          Now a careful experimentalist might protest – “but you’re arguing about something that doesn’t matter – the .1 degree temperature (or whatever else) difference between my experimental conditions doesn’t affect my main result.” They’d be right, but that means that there’s an implicit definition of scale they have in mind, below which bias is not practically important. It’s not productive to say their experiment was confounded and that’s the end of the story.

          Can confounding occur in any direction, at any scale? Yes. However, just because it _can_ happen at any scale doesn’t mean that the risk of it is independent of the scale. For one thing, there’s an inherent asymmetry in the definition – confounding at an effect of 10 is still confounding at an effect of 1. Secondly, for any given comparison of interest, the number of ways confounding can occur can occur in practice tends to increase as one considers smaller and smaller scales of differences.

        • Revo11 March 3:

          Really like your points, especially this one “more to do with confounding and underdetermination – issues that tend to be outside of the model”

          An example from the past where I leaned heavily on size – http://www.ncbi.nlm.nih.gov/pubmed/10825042

          Here, looking at all possible logistic regression models, the minimum odds ratio was more than 4.
          (I had wanted to show all the possble adjusted odds ratio in a histogram)

          Years later someone tried to do an RCT but it failed to recruit patients.

        • PS Where complicated models are really very useful is in measurement e.g. ideal point estimation, natural language processing, remote sensing, etc…

          But for causal inference I think simplicity is key, and simplicity can be “nudged” by requiring the design rely on a simple test.

  6. I had to deal with genetic data to predict the phenotype, given thousands of gene expressions (as in y = a + sum(bx) ). A recurrent problem is selecting those genes x that contribute to the observed phenotype y. A way is to measure the effect size b of each gene and take those the value of which is above a fixed threshold. Most of the times a large number of genes is observed to contribute to the particular phenotype.
    How about that?

Comments are closed.