Controlling for variation in the weather in a regression analysis: Joe and Uri should learn about multilevel models and then they could give even better advice

Joe Simmons and Uri Simonsohn have an interesting post here. Unfortunately their blog doesn’t have a comment section so I’m commenting here.

They write this at the end of their post:

Another is to use daily dummies. This option can easily be worse. It can lower statistical power by throwing away data. First, one can only apply daily fixed effects to data with at least two observations per calendar date. Second, this approach ignores historical weather data that precedes the dependent variable. For example, if using sales data from 2013-2015 in the analyses, the daily fixed effects force us to ignore weather data from any prior year. Lastly, it ‘costs’ 365 degrees-of-freedom (don’t forget leap year), instead of 1.

With multilevel models you can control for variation at different levels without worrying about running out of degrees of freedom. That’s one of the big selling points of multilevel modeling. Give it a shot.

Counting degrees of freedom is so 20th century.

P.S. Regular readers know I have a lot of respect for Simmons and Simonsohn. But, just for those of you who were not aware: these guys have done some great work. That’s on reason for this post: These are two thoughtful, careful researchers who are well-versed in quantitative psychology. But they don’t know about multilevel models. They should.

48 thoughts on “Controlling for variation in the weather in a regression analysis: Joe and Uri should learn about multilevel models and then they could give even better advice

  1. The other flaw in their logic is the assumption that ‘Year Over Year’ is the best way to assess current performance. While it is certainly the most widely used, it is hardly the only available method and certainly not the most insightful information that can be brought to the table.

    • The other comment that should not go unmentioned is that if there is in fact enough data for within day regression analyses, weather will be irrelevant to the relationship between the variables, except for the rare exceptions of significant weather events that occur in the morning or afternoon and only affect one portion of the day. Though I have never seen weather data kept and utilized at that level of granularity.

  2. Since seasonality effects are more or less periodic, if you are not going to go bayesian heirarchical modeling (with gaussian processes using pseudo-periodic covariance functions etc), it makes a lot more sense to control using a fourier series on (2*pi*day/365) and control your “degrees of freedom” using a defined number of terms, like 6 or so (that’s both a cos and sin term for 3 different periodicities) you might increase it to 8 or 10 terms, but that’s still a LOT less than 365 for “daily dummy”.

    I suspect that the reason this stuff isn’t done is that people learn how to plug stuff into a function like “lm” that spits out model fits, but they don’t learn how to think about and develop models at a “mathematical” level. It’s more about “what can I plug in to this function?” than “how should I structure my explanation of the world?” Who is the master? The computer function, or the model builder?

    • Very well put. One thing that happened to me was that I started thinking about the problem as model building rather than which-computer-function only after I started assembling models in JAGS and then Stan. Until then I had been fitting only canned models.

    • “what can I plug in to this function?” is a few hours of work, maybe. To modify / write a function from scratch is weeks or months or work.

      Naturally, you see what you describe.

      • Don’t misunderstand me, it’s important to have functions, but I think the more general purpose they are, the more you are able to structure your explanation according to what makes sense instead of what the fixed set of “buttons” will do. Shravan’s experience is the same as mine, it takes actually having a general purpose inference engine to get you to the point where you think about the thing you’re modeling instead of thinking about the function you’re calling.

        This is why BUGS/JAGS/Stan are so great.

        • Also, with Stan (as with its predecessors) you can have example programs that do different things, and then if you want to do something new you can take some existing program as a starting point.

        • Oh, ok.

          When I commented, I didn’t realize your point was BUGS/JAGS/Stan vs the rest.

          It just sounded like a critique of people wanting to understand what all they could accomplish using pre-defined functions to solve the problem at hand rather than program a custom-made solution for every specific problem.

        • In psycholinguistics at least, and probably psychology as well, this emphasis on canned model-fitting, plug-and-play statistical analysis, has caused a big mess that will be very hard to undo.

        • Yes, but that seems to be a problem more with Psych. than the tools. Perhaps the training or incentives. Lot of other users use “lm” in perfectly reasonable ways. e.g. Just yesterday I saw a regression used to fit density as a fn(conc., Temperature) for a reactor and it works just fine.

          You cannot blame the tool for a particular set of users that want to abuse it.

        • I think it is more a problem with the way that these tools are taught, even in statistics programs. There is no culture of thinking about plausible models of how the data were generated (given the data at hand). Psychologists seem to think that “confirmatory hypothesis testing” is something special, in that canned modeling is the right solution there (at least some think that this avoids garden-of-forking-paths). This goes back to the absurd obsession with the p-value as something that’s somehow informative. If you ask an expert about the literature in his/her field, they will divide up the literature into two categories, results that were significant and those that were not. There is no sense of parameter estimation as being a first, important goal. It’s all about the p. So the problem really boils down to how we were/are taught these things in psych and related areas, and even proper statistics programs. It’s not a problem with the tools per se, I agree.

        • At the same time, I have to concede that with Stan on the market, one can happily fit canned models and not face many problems, because there’s a mildly informative prior for everything, and if there isn’t enough evidence to overturn the prior, you will get posteriors centered around 0. But this still needs a shift in focus away from p-values and hypothesis testing and towards working out the posterior distributions of the parameters. What you do after that is strictly between you and your posterior.

        • Shravan:

          > strictly between you and your posterior
          Exactly, given an adequate sample from the posterior, there are only two possible mistakes –
          1. The prior was a mistake.
          2. The data generating model (likely) was a mistake.
          3. The posterior was mistakenly processed/interpreted.

          I believe the third one is often overlooked.

        • lm works well when the problem can be specified by a simple formula of the kind that is used in lm, a ~ b + c or a ~ b+ b*c or whatever.

          But when you want to fit a low-order fourier series to a weather time series and then look at the deviations from that fourier series and do a regression on how the weather deviations affect the rate of violent crime or something like that…

          even if you can figure out how to do that in lm and get reasonable results, it’s probably a mistake.

        • Rahul, about abusing Stan. Sure, no problem, it’s easy to misuse Bayesian methods too. So it boils down to providing better education all around.

        • Although certainly education is important, I think also the tools sort of circumscribe the level of flexibility you can have, and when those tools are pretty limited, you will not teach people about the things they should be doing, because they can’t actually go out and *do* them.

          Starting with a Turing Complete language for building models should be flexible enough to avoid this problem :-)

        • Very true; the right tools are a great enabler. Neverthless, even with standard lm functions, today I would start by teaching people to generate simulated data by building up the data bit by bit. I would not start with the function lm, but rather end with it. This way we start thinking about the possible generative process right away even with standard tools. But I would never have appreciated the value of this without having learnt to use JAGS or Stan.

        • I’m reminded of this problem from what is it now… 9 years ago!

          http://statmodeling.stat.columbia.edu/2007/12/05/mixedhierarchic/

          That was the general feeling I had in doing statistics… I had some idea of what I wanted, but if it got at all complicated, it seemed near impossible to specify to the limited tools I had, and the “formula” description was often terse enough that even when I had a very well specified mathematical model, I couldn’t figure out how to map it to a formula.

          Today, given even JAGS, but especially Stan, I’d write this model in an hour or less and be on my way (likely I could do it in 10 mins). And, I’d understand exactly what the model was assuming, whereas LMER was always black-boxy, I was never sure what the heck was going on inside it.

          With this problem I’d have strong priors on the size of the coefficients, because they would have a physical meaning, something like how well people could align a measurement tool by eye. I’d know that it could be done to on the order of 1 cm linear deviation over 1m so a coefficient on x with prior of normal(0,.02) in those units would be informative but not too restrictive.

        • You are too hard on lmer, Daniel! When I start analyzing data, I usually, no, always, start with lmer. It’s only when I need to fit a customized model that cannot be fit out of the box that I move to Stan or JAGS. Also, lmer has the great advantage of speed. For many of my data sets, even with the best optimizations one can do, Stan is frustratingly slow. It breaks one’s chain of thought when one is exploring the patterns in the data (and yes, I fit many models before I settle on a final one!). Using Stan or JAGS is usually the final step in a modeling exercise for me, not the first. lmer is solid and beautiful for many, many problems. I think the problem is rather that people treat it as a black-box too much. This semester I will teach LMMs by incrementally generating data, starting with the random variable X ~ N(mu, sigma^2); I think that will help beginning students to think about the underlying generative process that we are trying to approximate.

        • What the heck is gmo? I’m falling behind! I googled it but found genetically modified organism, which is probably not right.

          I want to pre-register my prediction that lmer is never going to be obsolete. Bates has been developing even faster code in Julia, and it’s really shocking how fast it is.

          I think that there is no way around the fact that for lots and lots of data, lmer will be much faster. Also, for standard models, there isn’t much of a difference between Bayesian and frequentists LMMs, the divergence happens when people get fixated on the t- and p-values in the latter. Bates went out of his way to remove p-values from lmer output, leading to widespread and generalized confusion all round; and then in 2007 Andrew went out of his way with arm to remove t-values as well, and so it is possible to focus on the important stuff even in lmer.

          For me the problem with Bayesian modeling is that it takes thought and time; I usually too busy to think hard about secondary issues. I just want a street-fighting math type answer to my questions before I take the time to actually think about the data. It makes no sense to take the time to ponder over priors and do sensitivity analyses at that early stage.

        • Just count how many people still use Notepad. That’s good evidence that performance and obsolescence are not very correlated.

          Legacy advantage means a lot. You need to offer really a lot to convince entrenched users to jump ship. Not appreciating that have been the death of many a tech startup.

        • Ha. I’m pretty sure David Lakeland would think it is insulting to Notepad. :)

          PS. The above argument is one reason I suspect Julia, with all its hype, is never going to really catch on.

        • Shravan:

          Fair enough. For moderate sample sizes and the particular classes of model for which it works, lmer really is fast. One thing we’ve been talking about is hacking Stan gmo to call lmer for such problems.

          Calling lmer just by itself doesn’t quite work in general, even for purely linear models, because lmer doesn’t have a plug-in for priors, which can be a big deal in hierarchical models, especially when the number of levels is low for some of the factors. So it will make sense to get some Stan in there, as adding the prior should not slow the computation—if anything, it will make the computation more stable and thus faster.

        • “One thing we’ve been talking about is hacking Stan gmo to call lmer for such problems.”

          How interesting! The best of both worlds! Currently I have to assemble all my Stan models (after I have decided on them) and then just let them run overnight or over several days as a batch job, otherwise one is just stuck and waiting.

        • Shravan:

          I know some people have really big problems and do require overnight runs from Stan. But I also know that understanding what makes Stan fast can speed it up immensely. Vectorized functions in particular are very efficient because of the gradient calculations. Try writing a Stan program doing loops, and then as vectorized math, and you’ll find it’s typically a LOT faster.

          In many of my problems the Stan compilation to C++ code is the slowest part. That can sometimes take 2 to 5 minutes, and then I get samples in 1 minute or something like that.

        • Also, Rahul, I’m an emacs user from way back in the 90’s so I gotta go with lmer > notepad :-)

          I guess these days I have a tendency to already see things in many problems I attack that aren’t of the form “just jam it into lmer and see what you come up with” though I can certainly see the advantage to that if your problem is amenable to it. Doing that typically takes less than a second or whatever.

  3. So how would this look as a hierarchical model? Still modeling each day but as a unique random effect drawn from some overarching distribution?

    • AJ:

      Simplest would be to take whatever model you’re already fitting and add to it a date-level error term. Then look at the estimates of these and use that to suggest a refinement of your original model.

  4. When someone wants to analyse the effects of day-to-day variation in a weather variable like temperature on a behavioural variable, there are a few ways to do this. Multilevel modelling with a date-level error term could work. Daniel’s suggestion of Fourier series could work too. Another option, where suitable data is available, is to calculate temperature anomalies (observed temp on date – historical average for that date).

    The thing is, all of these are viable options *if your real interest is in the effects of random day-to-day variation* (which is only one component of variation in temperature). It’s perfectly reasonable that someone might want to actually study the effects of seasonal variation in temperature (or of geographical variation in temperature). Investigating the effects of these components of variation in temperature while trying to rule out confounding is a harder task. Goals like these might be particularly relevant if what you really care about is the effects of exposure to sustained and predictable higher temperature (e.g., because you’re interested in making inferences about the effects of climate change).

    I have written papers trying to estimate the effects of various forms of variation in temperature on behavioural variables, but
    I really don’t know what the ideal solution is here. Ideas welcome. E.g. for temperature and suicide: http://link.springer.com/article/10.1007/s10584-015-1371-9 or open access at http://tiny.cc/fyd7ay

  5. If someone runs the hierarchical model alternative for our Table 2 (http://datacolada.org/46), predicting day-length in Bangkok with temperature in Philadelphia, I will add it to the table in our post. A short explanation of what we learn from that model that the simpler plug-in historical average predictor does not tell us would be great, but is not ‘required.’

    Feel free to email me to coordinate ([email protected])

    • Hi Uri, I just downloaded the data and ran the model I recommended above with 4th order Fourier series. A summary of the results are on my blog:

      http://models.street-artists.org/2016/05/02/philly-vs-bangkok-thinking-about-seasonality-in-terms-of-fourier-series/

      Using “lm” to estimate just 9 coefficients (an overall average, and a 4th order Fourier sin,cos series) hammers the seasonality effect to zero and leaves temperature in Philly at the level of 10^-5 hours / degree C, with no statistical significance.

      • 9 coefficients is more or less equivalent to “monthly” dummy variables in terms of the number of things you’re estimating (9 vs 12… these are similar). But dummy variables are essentially step functions, and this is a continuous and nearly-periodic series. Making your basis functions fit the type of function better means faster convergence and hence fewer terms required and hence fewer things to estimate…. better regularization.

        As you can see on the blog, the cos terms are the ones that really “matter” because of the way the data collection is aligned with the calendar year. But the cos terms decline exponentially fast with higher order:

        the cos coefficients are: -8.4e-1, 1.1e-1, -1.8e-2, 4.2e-3

        higher order coefficients would be expected to be on the order of 1e-4, 1e-5, 1e-6 etc

        • Well, perhaps, but then I have to install Arm and soforth :-)

          OK I did it this morning and I don’t like the display output in this case:

          lm(formula = day_duration_bangkok ~ temp_today_philly + sin(2 *
          pi * day/365) + cos(2 * pi * day/365) + sin(2 * pi * day *
          2/365) + cos(2 * pi * day * 2/365) + sin(2 * pi * day * 3/365) +
          cos(2 * pi * day * 3/365) + sin(2 * pi * day * 4/365) + cos(2 *
          pi * day * 4/365), data = dat)
          coef.est coef.se
          (Intercept) 14.61 0.00
          temp_today_philly 0.00 0.00
          sin(2 * pi * day/365) -0.01 0.00
          cos(2 * pi * day/365) -0.84 0.00
          sin(2 * pi * day * 2/365) 0.00 0.00
          cos(2 * pi * day * 2/365) 0.12 0.00
          sin(2 * pi * day * 3/365) 0.00 0.00
          cos(2 * pi * day * 3/365) -0.02 0.00
          sin(2 * pi * day * 4/365) 0.00 0.00
          cos(2 * pi * day * 4/365) 0.00 0.00

          n = 730, k = 10
          residual sd = 0.01, R-Squared = 1.00

          Sure, for something like a social science calculation, you might never need more than this kind of precision, but this is an astronomical calculation.

        • Oh, yuk. can I redo that:

          lm(formula = day_duration_bangkok ~ temp_today_philly + sin(2 *
          pi * day/365) + cos(2 * pi * day/365) + sin(2 * pi * day *
          2/365) + cos(2 * pi * day * 2/365) + sin(2 * pi * day * 3/365) +
          cos(2 * pi * day * 3/365) + sin(2 * pi * day * 4/365) + cos(2 *
          pi * day * 4/365), data = dat)
          coef.est coef.se
          (Intercept) 14.61 0.00
          temp_today_philly 0.00 0.00
          sin(2 * pi * day/365) -0.01 0.00
          cos(2 * pi * day/365) -0.84 0.00
          sin(2 * pi * day * 2/365) 0.00 0.00
          cos(2 * pi * day * 2/365) 0.12 0.00
          sin(2 * pi * day * 3/365) 0.00 0.00
          cos(2 * pi * day * 3/365) -0.02 0.00
          sin(2 * pi * day * 4/365) 0.00 0.00
          cos(2 * pi * day * 4/365) 0.00 0.00

          n = 730, k = 10
          residual sd = 0.01, R-Squared = 1.00

  6. A slightly different perspective on this problem (maybe, or maybe just different language):

    This is a problem with using “within time effect” temporal variation to identify the coefficient of interest. To make clear what I mean – Suppose you start with daily dummie variables. Now, the model is implicitly comparing measurements on a specific date across years (so, comparing July 23, 1997 with July 23, 1998 or 1999 or 2000….) and then aggregating up across all days.

    Now consider month of birth dummies – this brings in another “temporal” dimension of identifying variation. We compare measurements across every September (in each year) but also within September (from the 1st to the 30th). This second “temporal” variation is the problem – both length of day and temperature are smoothly changing within some month (so within September, it is getting colder in Pittsburgh and days are getting longer in Bangkok). This is repeated “within” each month – this is what causes the problem (the same problem you see in the naive regression, just repeated 12 times in smaller bins).

    This is in general a problem that is under-discussed in the literature. People throw in year dummies and claim to have “non-parametrically controlled for time trends”, even if the variation they merge into their data occurs at the quarterly or monthly level. Thinking about within-time-dummie variation in X is more important than has been given credit in the literature (in my opinion).

    I personally would use the day-of-the-year dummies, but it is interesting that a linear control for “usual temperature” works so well here to soak up the within-period variation. I think it is mis-specified in a sense, since that relationship should not be linear, but since you are dealing with a smaller-scale problem than across a whole year, I think it just works “well enough” in this context.

    So to conclude: the point that the “day of the year” dummies would soak up a lot of variation and throw out a lot of information is true, but I think it is not the best interpretation. A better interpretation, to me, is that “day of the year” dummies focuses on only 1 comparison (across years on a single date) whereas “wider” temporal dummies allow for another comparison (days close to each other within a calendar year). You should choose which controls you want not because one is “right” but because one of them better captures the comparisons in the world that you want to make.

    • I think what “second ‘temporal’ variation” is intended to mean is that the function of interest isn’t constant during a certain period. Unlike the basis function chosen (the indicator function for a time period of a week == “weekly dummy” etc).

      It’s a bad idea to approximate a smooth function of time by a sequence of step functions.

      it’s a well understood idea in function approximation that you choose a basis function that matches some of the important characteristics of your function you’re approximating. For example if the frequency spectrum of the approximant is limited, you don’t want to use a function like a step which is highly delocalized in frequency (sharp edge).

  7. The following statement about the strength of multilevel modelling is very catchy.

    “With multilevel models you can control for variation at different levels without worrying about running out of degrees of freedom. That’s one of the big selling points of multilevel modeling. Give it a shot.

    Counting degrees of freedom is so 20th century.”

    Would you mind to elaborate for readers who are not too familiar with these sort of models? Thanks

    • One way you can think about Bayesian models is as ways of specifying tradeoffs. For example, suppose you have 100 parameters and 100 data points. In a classical model, you have no “Degrees of freedom” left, you should be able to fit the data exactly!

      In a Bayesian model you can put priors on the 100 parameters that say in essence “many of these should be small (close to zero)”. In order for the Bayesian machinery to “decide” that it should move one of the parameters away from zero, it has to make the fit much better (increase the likelihood) more than it makes the prior probability bad. The product prior * likelihood has to increase.

      These tradeoffs are continuous functions unlike “adding one more unconstrained parameter” so you can think of them as getting away from “counting degrees of freedom” but at the same time you can think of them in some cases as trading off between better fit vs model complexity. In some way “model complexity” can be thought of as how much extra information you need about parameters, over what’s already in the prior.

Leave a Reply

Your email address will not be published. Required fields are marked *