Skip to content

Statistical Challenges of Survey Sampling and Big Data (my remote talk in Bologna this Thurs, 15 June, 4:15pm)

Statistical Challenges of Survey Sampling and Big Data

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University, New York

Big Data need Big Model. Big Data are typically convenience samples, not random samples; observational comparisons, not controlled experiments; available data, not measurements designed for a particular study. As a result, it is necessary to adjust to extrapolate from sample to population, to match treatment to control group, and to generalize from observations to underlying constructs of interest. Big Data + Big Model = expensive computation, especially given that we do not know the best model ahead of time and thus must typically fit many models to understand what can be learned from any given dataset. We discuss Bayesian methods for constructing, fitting, checking, and improving such models.

It’ll be at the 5th Italian Conference on Survey Methodology, at the Department of Statistical Sciences of the University of Bologna. A low-carbon remote talk.


  1. Ben Hanowell says:

    Thank you for doing the Lord’s work here. I have become so sick and tired of the assumption that more data is always better no matter how you treat it. That is, of course, only true in the absence of selection biases, etc.

    It reminds me of a project I did with my employer where we estimated the impact of moving to an assisted living community on quality of life by comparing present quality of life between . I was happy to convince my employer the necessity of carefully matching treated older adults to untreated older adults. Andrew, you wouldn’t be proud of me though: I ended up using nearest-neighbor propensity-score matching, with the propensity score estimated using a gradient-boosting machine model (using RAND’s twang package for R), and then after matching I did all my modeling of the treatment effect. I didn’t model it all simultaneously.

    That said, some argue you shouldn’t model the propensity score and target estimates simultaneously anyway.

    • Ben Hanowell says:

      I sent too soon. We estimated the impact of moving to an assisted living community on quality of life by comparing present quality of life between those who had moved to assisted living and those who were looking for it.

      • Anoneuoid says:

        You have to make a clear distinction between making predictions/classifications and estimating model parameters. The second is much, much more unreliable without a good model. For the first you simply hold out a good amount of data until the end then use that to assess skill. As long as the situation doesn’t greatly change it should be in the ballpark. On the other hand, I usually won’t even take such estimation projects since it keeps ending up as “wanting the impossible”.

        If you read the academic literature do you ever see someone plugging in new data to a previously published regression model (using same coefficients, etc)? I actually have never seen this in biomed.

        • Matthew Zack says:

          In the biomedical literature, look up the Framingham Risk score to predict coronary artery disease occurrence and the Gail model to predict breast cancer occurrence.

          Other models to perform such predictions, using different covariates or different coefficient values, have been published since then.

  2. zbicyclist says:

    Sounds interesting.

    I’ve found Kaiser Fung’s OCCAM mnemonic useful is characterizing big data challenges.

    A couple of times I’ve run into confused researchers who think that taking a random sample of a convenience sample accomplishes something (other than to make the computation faster).

  3. Vishv Jeet says:

    Given that your talk is delivered remotely, any possibility that it is open to public?

  4. Mike Beyer says:

    Thanks Andrew! Big Data = Complex Population (sometimes/often?). One problem, especially with “intro” courses to data science and statistics, is that they do all the hard steps for you (or, probably more often, just gloss over or ignore them):

    1) collecting data
    2) cleaning it
    3) validating it
    4) specifying a model (!!!)
    5) accounting for sampling biases during collection (MAR, MCAR, etc) <– [see Andrew's talk…and his book!!]
    6) replication/cross-validation

    So, all students see is some nice data (perhaps needing some cursory data transformations, trimming, etc), a pre-selected model (e.g., logistic regression), and the output of a standard package. It's a very simplistic view (perhaps necessarily…but bootcamps purport to make you "battle ready" in just a few weeks).

    Fitting a model may be computationally hard and require ingenious solutions, but it is not the hard part of understanding the data.

    If fact, forget fancy models, just correctly interpreting a sample mean from a complex dataset can be hard (i.e., what population does it describe…important for genearlizability).

  5. Rajesh says:

    In certain Big Data and digital social science literature, I encounter terms like *found data*. Should I treat this synonymous to data acquired through convenience sampling? Are there classic references on convenience sampling that I might find useful? I am also curious to know from some of you (working at the interface of Statistics and Machine Learning) if there are papers discussions these issues from a ML point of view.


Leave a Reply