“if you add a few more variables, you can do a better job at predictions”

Ethan Bolker points me to this news article by Neil Irwin:

Robert J. Gordon, an economist at Northwestern University, has his own version that he argues explains inflation levels throughout recent decades. But it is hardly simple. Its prediction for inflation relies not just on joblessness but also on measures of productivity growth, six shifts in food and energy prices and overall inflation over the six preceding years.

In other words, just knowing the unemployment rate may tell you very little about what inflation will be. But if you add a few more variables, you can do a better job at predictions.

Overfitting could be an issue, no?

I think adding a lot of predictors is a great idea; the mistake is to fit the model using least squares! If the predictors have realistic priors, their coefs will be appropriately pulled down (in expectation) and overfitting shouldn’t be such a problem. That said, we don’t give good advice or examples of such analyses in our books. It’s an important research project.

32 thoughts on ““if you add a few more variables, you can do a better job at predictions”

  1. Agreed on overfitting. However, I doubt reducing squares will ameliorate the problem – or perhaps that’s not what the real problem is.

    Doesn’t the real problem lie in believing your can treat unnatural systems as if they were natural systems? That is to say: isn’t the real problem in recognizing that nature , over time, has puts bounds on variables; whereas unnatural systems contain variables that are unbound? When I do a study of maple leaves, I can have confidence that nature has bounded the variables in play. I’ll never find a purple, glow in the dark maple leave. Conversely, when I study an unnatural system, (e.g. an economy) the variables I choose are man made and man defined. These variables had not had the rigor of time to eliminate large deviations (e.g. purple, glow in the dark leaves). Moreover, I can’t be sure of how all the interactions of these variables.

    What do you think?

    • I don’t buy your distinction as useful. Nature puts checks on the color, size, etc. of maple leaves. Nature also puts restrictions on economic variables – restrictions that are often overlooked. For example, the old Limits to Growth models failed to include checks that economic systems provide – as resources get scarce, their prices rise, and this causes demand substitutions as well as effects on the supply side of energy sources. These checks may be harder to model than restrictions on maple leaves (I’m not sure I can say this for certain, however), but there are “natural” checks in both systems. The labeling of an economic system as somehow “unnatural” has always baffled me. Yes, economic systems are made by humans but humans are also natural. I believe the difference lies in the degree to which we can predict the behavior of the systems rather than the categorization of one as “natural” and the other as “unnatural.”

      • Our ability to predict natural systems can be pretty darn good if you ask me. A simple example is this: will you die one day? or Will ice melt in the desert?

        For a more complicated example: consider how we use relativity to predict the motion of the planet. This is how GPS works.

        • And so is our ability to predict systems you call non-natural. Will the Dow close above 78,000 today? How many apples will be sold today? How many Apples?

          In both the natural and the non-natural world problems of prediction come down to three issues:
          (a) Do I have a model which is predictive? (It is generally harder to prove one has a predictive model in the non-natural world, but that doesn’t mean you can’t model.)
          (b) Can I get the inputs to that model which are sufficiently precise? (Sufficiently precise in this context includes both the unavailability of data as well as sensitivity to initial conditions a la chaos theory.)
          (c) Is there an irreducibly random element which makes prediction impossible? (Try predicting which slit an electron goes through. Roger Penrose and Quantum Mind theory carries this over to human choice modeling.)

          In principle, the non-natural models have a fourth element, which is whether the existence of predictions themselves undermine the predictions of models (in economics, this is called the Lucas Critique.) But with a sufficiently nuanced concept of God, this fourth effect could be present in nature as well. Who knows?

        • I read that stuff twenty five years ago and have never revisited it. For all I know, he has none at all.

  2. But what would be a reasonable solution?

    One option would be to consider all variables and perform model selection on all combinations with BIC criteria so to minimize not only log-likelihood but also penalize by model complexity.
    Of course with enough points an out-of-sample approach would be preferred.

    What do you think?

    • One of the nice thing about basic time-series economic models like this is that you’re never working with more than a couple of hundred observations, making true out-of-sample time series cross validation entirely possible.

    • Please don’t do this (i.e., make up a bunch of artificially simplified versions of the most-complex model with subsets of parameters set to zero and try to select among them)! Selecting on BIC and then making predictions based on the pretense that the BIC-optimal model was known _a priori_ is still a recipe for overfitting. Andrew already gave you one solution – use realistic, informative priors to control model complexity. Another would be (as Jeff Walker and David J. Harris comment below) to use some kind of well-thought-out frequentist or Bayesian shrinkage estimator, which are often the same procedures with different derivations.

      • On the topic, rstanarm’s regularized linear model is very nifty–users supply a prior for the (out of sample?) R^2 of the model, which gives the prior great interpretability. I only had a quick read, but it appears to penalize the (sum of squared) correlations between the outome and the Q matrix from a QR decompostion of the design matrix, which intuitively should deal better with correlated covariates than some other regularization methods. Pretty cool. https://cran.rstudio.com/web/packages/rstanarm/vignettes/lm.html

      • While it’s often (always?) the case that ML/OLS procedures can be viewed as approximate Bayes, is it really fair to say, for example, that Lasso is the same procedure (albeit differently derived) as MAP with Laplace priors? Lasso lets you shrink coefficients all the way to zero, while no proper Bayesian method that I’m aware of will do that. Every time I’ve put a regularizing prior on coefficients, it’s had fairly modest effects, certainly nothing like shrinking to zero! I’m curious if you’re aware of systematic comparisons between Bayes versus frequentist regularization in terms of both inference and prediction?

        Thanks,
        Chris

        • “is it really fair to say, for example, that Lasso is the same procedure (albeit differently derived) as MAP with Laplace priors?”
          Yes, it’s fair. With Laplace priors, the posterior mode will indeed often be 0.

        • ah yes, I was replacing Laplace with Cauchy priors in my mind. I guess what I’m more interested in is whether regularization that allows shrinkage to zero really makes more sense than a more Bayesian approach, using posterior distribution rather than mode only. In what circumstances would this address the overall issue of overfitting better?

          Thanks for catching my error!

          Chris

        • So this original question was misguided, as Z pointed out below. What I’m interested in is where it might be better to deploy one of these shrink-to-zero approaches (i.e. Lasso or MAP-Laplace), versus the approach to regularization that Gelman recommends where you put on weakly informative Cauchy or Student-T priors but use the full posterior as usual, which I think of as the Bayesian approach to regularization. I only have experience with the latter, but I think it’s an interesting question. My initial guess is that shrink-to-zero would be helpful where N is very limiting relative to params or something like that. Are there systematic comparisons of out-of-sample predictive power for different types of settings?

          Thanks,
          Chris

        • Hi Andrew,
          Good to know! Is your stance changing because you think there are more problems with weakly informative priors than initially, or just that strongly informative are better for the usual reasons? For getting regularization in a setting without basis for strongly informed priors, do you think Lasso/MAP-Laplace can make more sense than weakly informative Cauchy or Student-T priors? I suppose this is probably too broad a question…

          Chris

  3. “six shifts in food and energy prices” – not sure if I understand this correctly, does this refer to abrupt changes at certain time points?
    If so, how to use such a model to predict what happens after the seventh shift?

  4. My experience/intuitive feeling is that applying Bayesian analysis to complex models almost invariably result in over confident credible intervals. The problem is that the calibration target uncertainties have more covariance than specified. This is more problematic for complex models than for simple. I often find intervals from simple models more plausible.

    • On the overly narrow credible intervals I completely agree with you, though not sure I agree on the cause. Of course we probably have different domains in mind when thinking about this issue, mine is mechanistic simulation modelling.

      My concerns are:
      – Often with complicated models I will have some routine reporting data or such as one of my calibration targets, which will be small on the sampling uncertainty but big on the non-sampling uncertainty. Adequately capturing the data generating mechanism ( and thereby capturing the real bias/uncertainty in the calibration target) is really hard, and if you dont get it right you get results that are biased or overly precise.
      – With a more complicated model you will generally be in a position to consider a wider array of evidence for training the model (that might be why you made the model more complicated in the first place). This can be fine if your model captures all of the mechanisms that might generate a particular pattern of data (where data == all the calibration targets), but again this is really hard to do. If your model doesn’t admit all the possible mechanisms, the estimation procedure will tweak the knobs and levers it has available to improve fit to the calibration targets. In practice I have found this issue is only somewhat dampened by the presence of informative priors.

      These are both really the same problem. The first puts the blame on undiscovered biases in the calibration targets, while the second puts the blame on the model, but same deal, and very much analogous to confounding or omitted variable bias in standard regression modelling.

      These problems could be much worse if a comparatively simpler model were applied to the same data, but with a simple model there may be less hubris about being able to predict (and calibrate to) a bunch of different things, and more intuitive understanding about how the model worse and what might break down.

  5. Andrew – there are lots of non-bayesian methods for shrinking OLS coefficients right? I’m thinking of ridge regression, lasso, model-averaging (including model-averaging using BIC), etc. Are there simulations comparing prediction error based on these methods with Bayesian methods?

  6. I think in a prediction context such as this, the model performance out of sample is arguably more important than getting something like the ‘right’ estimate of variable influence. I would tend to advocate building many models, some with lots of variables and some with few, and constructing an ensemble model of their predictions. In that case the individual estimates on variables in individual models *probably* aren’t as important as the general sense of their influence you get from an ensemble. But I could well be wrong!

  7. “If the predictors have realistic priors, their coefs will be appropriately pulled down (in expectation) and overfitting shouldn’t be such a problem.”

    But what’s a “realistic prior”? I assume it’s one that’s based on solid background knowledge? But what do you do if you’re not able to come up with priors that are realistic? Or if you’re not sure whether your priors are realistic? Is it then OK to leave out potential predictors? How much background knowledge do you need to have about a potential predictor in order to come up with a prior that is “realistic” enough to justify putting the predictor in the model?

Leave a Reply

Your email address will not be published. Required fields are marked *