Skip to content

Big Data Needs Big Model

Big Data are messy data, available data not random samples, observational data not experiments, available data not measurements of underlying constructs of interest. To make relevant inferences from big data, we need to extrapolate from sample to population, from control to treatment group, and from measurements to latent variables. All these steps require modeling.

At a more technical level, each of these steps should be more effective if we adjust for more factors, but models that adjust for lots of factors are hard to estimate, and we need (a) regularization to get more stable estimates (and in turn to allow us to adjust for more factors, and (b) modeling of latent variables. Bayesian inference in Stan can do these things. And our next step is Bayesian workflow, building up models from simple to complex and tracking how these work on data as we do it.

So somebody please give me 10 million dollars.


  1. Z says:

    “So somebody please give me 10 million dollars.”

    Is that a joke about how much tech people get paid, or does Stan need more funding? (or a little from column A, a little from column B?)

    • Dan Simpson says:

      Stan always needs more funding! It’s like a gas that expands to consume any available funding. (Or, more seriously, the limiting factor with how fast and how broadly Stan can grow is mainly around who is able to work on it. Funding buys people hours and postdocs and PhD students. Not just at Columbia, but also everywhere else that a StanDev is working.)

    • No joke. No, we don’t “need” more funding. At least for the people working at Columbia, our funding’s secure for the usual horizon—two to three years. Longer than that’s pretty much impossible to guarantee on soft money positions, which is why a large endowment to provide security would be so helpful in hiring. No, we don’t get paid a ton—those of us from industry took pay cuts to work at Columbia (but we like the team, flexibility, and university environment).

      If someone donated $10M, we’d pretty much have to start our own foundation rather than going through Columbia or NumFOCUS if we didn’t want a lot shaved off the top and wanted complete control for the project to spend it. And we’d use it to do things like pay for testing infrastructure, more tutorial outreach, and salaries and salary long-term guarantees for developers. And of course, legal costs, because now we’d be running our own foundation.

  2. Dan Simpson says:

    That’s the best program grant application I’ve read in a while.

    • The response we’ve gotten to proposals for big data pitching this have fallen about as flat as you might expect. The referees just tend to say it’s not innovative enough or not aiming at large enough data, whereas in person people tell me we should just use deep belief nets—you can always throw in more features and why not let the net work out the latent structure for you. They’re even less keen on us doing things like posterior predictive checks for calibration of techniques like variational inference, which we’ve also been pitching after we pitch that we’ll need approximate methods to really scale.

  3. Pointeroutguy says:

    Extrapolate from control to treatment? If you can do that reliably, it’s easily worth 100 million!

  4. Yuge Da'a says:

    Grateful to hear this; it seems obvious but never acknowledged (see most big data methods papers and applications).

    Also obvious and unacknowledged: Bayesian analyses suffer huge scalability issues. I wish I would see more focus on fast approximations that could help overcome this huge problem, even perhaps included as a part of Stan…

    • Dan Simpson says:

      Fast approximations that work heavily leverage specific structural assumptions about the model. So the challenge is working out how to build them. We have ADVI in Stan (although we’re still working on diagnostics to tell when it works) and we have a plan to implement some of the functionality from INLA. There’s also some rogue code using EP for parallelization, so there is a lot of work happening in this direction.

      Probably the hardest thing about doing this type of work in Stan (and why it take time/money etc) is that Stan really isn’t a prototyping language for new algorithms. This means that you really want to have something that works before you go to all of the effort of plumbing it through the system.

      • If someone demonstrates that one of these methods works for a general class of model some other way than building it into Stan, we’ll certainly put in the effort to code it up in C++. We can export Stan’s model log density and gradient calcs to algorithms in R and (I think) Python. Now that the services are cleanly factored into independent functions, adding them to Stan isn’t so hard when you have the C++ implementation of the algorithm.

        Things like INLA that Dan works on are much more custom. You can do much better scaling if you devote yourself to a single problem. Vowpal Wabbit is a nice example of how far that can get you with something like logistic regression.

    • Michael Thompson says:

      Well, here’s but one example from Berkeley:

      [Parallel Supercomputing for Astronomy](

      And, the modeling details are here:
      [Learning an Astronomical Catalog of the Visible Universe through Scalable Bayesian Inference](

    • This is exactly what we’ve been proposing. And it’s why we developed ADVI as part of Stan—it’s a black-box scalable approximate Bayesian technique. Right now, we’re trying to figure out for what class of models it will provide decent approximations. It’s easy to build approximations, but much harder to understand them.

      Stan’s going in two orthogonal directions for generic speedup of the log density and gradient calculations (where all the effort is spent): parallel likelihood computations across multiple cores and even over networks and GPU-based matrix operations. That’ll let us scale another order of magnitude or two, which of course, will never be enough.

      Algorithms that combine approximations and parallelism are obviously going to be the near-term future for scaling Bayes.

      • Sebastian says:

        Bob is referring to MPI parallelization. The nice thing about this technique is that if your likelihood per whatever unit of your hierarchical model is very costly to calculate, then the MPI approach will scale really well on good hardware, see here:

        So the harder your problem, the more benefit users will have as long as you can provide the hardware. On the problem I link above we were able to deliver a 62x speedup using 80 cores on an infiniband cluster. That takes 2.6 days down to 1h computation time on a real-world problem. Can’t wait to see this land in Stan.

  5. Erikson says:

    About modeling latent variables:

    I’m quite familiar with the Structural Equations Modeling framework (I’m even (slowly) working on an interface to use R’s lavaan syntax to fit those models easily in Stan, But I’m skeptical about its usefulness in big data scenarios because their restrictive assumption about the measurement model: local independence (unless some modification indices suggest freeing a residual covariance); indicators as linear combinations of latent variables, multivariate normal distribution of observables and latent variables, etc.

    So, I wonder: when Andrew talks about ‘modeling latent variables’, is he thinking about some specific framework like SEM?

    • Dan Simpson says:

      The latent variables he’s talking about aren’t the same as the ones in SEM. He’s usually talking about random effects models (although the term “latent variable” really just means an unobserved parameter, it’s common use is more specific). He writes about it in his book with Jennifer (and I think in BDA).

      A quick literature review of a large class of these types of models (which should give you an idea of the scope) is here (article chosen mostly at random, and entirely for the lit review in Section 2)

      • Erikson says:

        Thanks, Dan!

        I thought Andrew wouldn’t be making a reference to the SEM framework, but it’s good to be sure. His book with Jennifer gave me the impression he doesn’t really like SEM models (a short passage in Chapter 10 criticizing their use to estimate many causal effects under strong assumptions).

    • Cool! I hadn’t heard of that effort before.

      My main objection to something like SEM is that I can’t understand models stated that way. I haver to get out a decoder (I like Sophia Rabe-Hesketh’s in the stata doc). I’m sure if I used SEM all the time, it’d make some models easier to specify.

    • Aki Vehtari says:

      Erikson: I’m interested learning more about lavaanStan. Could you add a bit more explanation (and examples) to in your github repo?

    • Ben Goodrich says:

      You might consider combining your efforts with the blavaan developers

      • Erikson says:

        Ben, I exchanged some e-mails with Prof. Merkle. He has plans to include model estimation with Stan, but so far I haven’t worked on a proper interface to work with his package. We also discussed how to allow for custom priors without recompiling the whole model, but I haven’t devised a proper solution so far (maybe something in the lines of

        • Ben Goodrich says:

          A long time ago, I thought about how to put state-space models into **rstanarm** but never got very far. I think for something similar like lavaan, you can implement a lot of likelihoods just by multiplying some matrices together. The challenge is that the user will want to constrain some elements of those matrices to be zero, so the interface would have to internally pass the pattern of fixed zeros and free elements to Stan and then fill in the free elements of the matrix from an unknown vector in the transformed parameters block. So, it does seem as if you could pre-compile the Stan programs.

          • Erikson says:


            Indeed, I used this idea in the transformed parameters block to make a single compiled program able to run any SEM model (and, man, that code is ugly!). But it works nicely; I have kept separated programs for CFA, path analysis and SEM, for the sake of optimization.

  6. Keith O'Rourke says:

    > Bayesian workflow, building up models from simple to complex and tracking how these work on data as we do it
    A resource like that would accelerate science in many areas – like building infrastructure to support a realistic understanding of uncertainties and how to purposely manage them.

    Unlikely to happen in academia with least publishable units exaggerating individual contributions…

  7. Kaiser says:

    Might be inappropriate to put this here:
    certain big models may have 10 million tunable parameters, so it’s only $1 per parameter!

Leave a Reply