Skip to content

Balancing bias and variance in the design of behavioral studies: The importance of careful measurement in randomized experiments

At Bank Underground:

When studying the effects of interventions on individual behavior, the experimental research template is typically: Gather a bunch of people who are willing to participate in an experiment, randomly divide them into two groups, assign one treatment to group A and the other to group B, then measure the outcomes. If you want to increase precision, do a pre-test measurement on everyone and use that as a control variable in your regression. But in this post I argue for an alternative approach—study individual subjects using repeated measures of performance, with each one serving as their own control.

As long as your design is not constrained by ethics, cost, realism, or a high drop-out rate, the standard randomized experiment approach gives you clean identification. And, by ramping up your sample size N, you can get all the precision you might need to estimate treatment effects and test hypotheses. Hence, this sort of experiment is standard in psychology research and has been increasingly popular in political science and economics with lab and field experiments.

However, the clean simplicity of such designs has led researchers to neglect important issues of measurement . . .

I summarize:

One motivation for between-subject design is an admirable desire to reduce bias. But we shouldn’t let the apparent purity of randomized experiments distract us from the importance of careful measurement. Real-world experiments are imperfect—they do have issues with ethics, cost, realism, and high drop-out, and the strategy of doing an experiment and then grabbing statistically-significant comparisons can leave a researcher with nothing but a pile of noisy, unreplicable findings.

Measurement is central to economics—it’s the link between theory and empirics—and it remains important, whether studies are experimental, observational, or some combination of the two.

I have no idea who reads that blog but it’s always good to try to reach new audiences.


  1. gwern says:

    The link in your post is broken. (I left a comment there but ’twas moderated.)

  2. OK–I think I get it. Here’s an example that illustrates how I see it.

    Say you want to know which is more effective for treatment of insomnia: cognitive-behavioral therapy or listening to Debussy for an hour every evening. The standard approach would be to gather a large number of subjects who have volunteered to take part in the study, give them a pre-test to assess their insomnia, conduct the intervention for a given length of time, and then do a post-test. The problem is that the results may be noisy; people may drop out of the study, some people may hate Debussy and get all riled up during the treatment, some may have undisclosed internet addictions, some may have uneven insomnia, etc.

    In contrast, if you find several subjects who are willing to try the intervention for a year or more, and who have no undisclosed issues that could interfere with the outcomes, you can measure each individual multiple times. Because of the precision and integrity of the measurements, you could learn more from the few than from the many.

    This makes sense. So then my question is: why not aim for the best of both worlds? Try to measure a large number of individuals–as individuals–repeatedly, and then compare their results? I understand that this is exceedingly difficult (and expensive) to pull off–but even if it were possible, the combination of methodologies would create its own complications and noise. So, as I understand it, if one is to choose between simplicities, the simplicity of individual study will ultimately yield more information.

    • Actually there is a growing body of literature on both the collection and the analysis of these kind of data (time series for multiple subjects). You can find many of them, and applications, by google-scholaring “intensive longitudinal data”.

Leave a Reply