Skip to content

Six quick tips to improve your regression modeling

It’s Appendix A of ARM:

A.1. Fit many models

Think of a series of models, starting with the too-simple and continuing through to the hopelessly messy. Generally it’s a good idea to start simple. Or start complex if you’d like, but prepare to quickly drop things out and move to the simpler model to help understand what’s going on. Working with simple models is not a research goal—in the problems we work on, we usually find complicated models more believable—but rather a technique to help understand the fitting process.

A corollary of this principle is the need to be able to fit models relatively quickly. Realistically, you don’t know what model you want to be fitting, so it’s rarely a good idea to run the computer overnight fitting a single model. At least, wait until you’ve developed some understanding by fitting many models.

A.2. Do a little work to make your computations faster and more reliable

This sounds like computational advice but is really about statistics: if you can fit models faster, you can fit more models and better understand both data and model. But getting the model to run faster often has some startup cost, either in data preparation or in model complexity.

Data subsetting . . .

Fake-data and predictive simulation . . .

A.3. Graphing the relevant and not the irrelevant

Graphing the fitted model

Graphing the data is fine (see Appendix B) but it is also useful to graph the estimated model itself (see lots of examples of regression lines and curves throughout this book). A table of regression coefficients does not give you the same sense as graphs of the model. This point should seem obvious but can be obscured in statistical textbooks that focus so strongly on plots for raw data and for regression diagnostics, forgetting the simple plots that help us understand a model.

Don’t graph the irrelevant

Are you sure you really want to make those quantile-quantile plots, influence dia- grams, and all the other things that spew out of a statistical regression package? What are you going to do with all that? Just forget about it and focus on something more important. A quick rule: any graph you show, be prepared to explain.

A.4. Transformations

Consider transforming every variable in sight:
• Logarithms of all-positive variables (primarily because this leads to multiplicative models on the original scale, which often makes sense)
• Standardizing based on the scale or potential range of the data (so that coefficients can be more directly interpreted and scaled); an alternative is to present coefficients in scaled and unscaled forms
• Transforming before multilevel modeling (thus attempting to make coefficients more comparable, thus allowing more effective second-level regressions, which in turn improve partial pooling).

Plots of raw data and residuals can also be informative when considering transformations (as with the log transformation for arsenic levels in Section 5.6).

In addition to univariate transformations, consider interactions and predictors created by combining inputs (for example, adding several related survey responses to create a “total score”). The goal is to create models that could make sense (and can then be fit and compared to data) and that include all relevant information.

A.5. Consider all coefficients as potentially varying

Don’t get hung up on whether a coefficient “should” vary by group. Just allow it to vary in the model, and then, if the estimated scale of variation is small (as with the varying slopes for the radon model in Section 13.1), maybe you can ignore it if that would be more convenient.

Practical concerns sometimes limit the feasible complexity of a model—for example, we might fit a varying-intercept model first, then allow slopes to vary, then add group-level predictors, and so forth. Generally, however, it is only the difficulties of fitting and, especially, understanding the models that keeps us from adding even more complexity, more varying coefficients, and more interactions.

A.6. Estimate causal inferences in a targeted way, not as a byproduct of a large regression

Don’t assume that a regression coefficient can be interpreted causally. If you are interested in causal inference, consider your treatment variable carefully and use the tools of Chapters 9, 10, and 23 to address the difficulties of comparing comparable units to estimate a treatment effect and its variation across the population. It can be tempting to set up a single large regression to answer several causal questions at once; however, in observational settings (including experiments in which certain conditions of interest are observational), this is not appropriate, as we discuss at the end of Chapter 9.

First day of class update

I got to class on time. The class went ok but I spent too much time talking, which is what happens when I don’t put a lot of effort ahead of time into making sure I don’t spend too much time talking.

My first-day-of-class activity was ok but I think I needed another activity for the students, something more statistical, to better set the tone of the course.

I think I should’ve given them a 10-minute work-in-pairs activity where I’d first give them some real-world problem and then ask them, in pairs, to design a study to address it. The problem could be anything: it could be to assess Ebola risks or answer questions about political ideology and personality, or even to design a plan for assessing the effectiveness of this course that they’re taking. Just something that would get them communicating, but also thinking about statistics in some detail. Not just talking generally about the cool problems they’re working on or are interested in, but some attempt to get into details.

We could do this next class, of course, but I already have things planned. So maybe this will have to wait until the next time I teach the course.

Just in case

Hi, R. Could you please prepare 50 handouts of the attached draft course plan (2-sided printing is fine) to hand out to students? I prefer to do this online but it sounds like there’s some difficulty with that, so we can do handouts on this first day of class.


My Amtrak is rescheduled and it is scheduled to arrive in Boston at 4:35. This should give me plenty of time to get to class on time, but Amtrak is sometimes delayed. So if class begins and I am not there yet, please start without me!

If I’m not there, please do the following:

- Get to the room 10 minutes early. Before class begins, chat with the students as they are coming in. You can talk about any topic, as long as it’s statistical: tell them about your qualifying exam, or discuss how to express uncertainty in weather forecasts, or talk about the Celtics (ha ha). No need to be lecturing here, just get them on track, thinking and talking about statistics. Also during that time, please get the projector set up so that, when I do arrive, I can plug in my laptop and be all ready to go.

- Once class begins (I don’t remember the convention at Harvard; will it start exactly at the scheduled time, or 5 minutes later?), start right away with a statistics story. I have stories of my own prepared, but if I’m not there, you can do one yourself. Prepare something; feel free to use the blackboard. It doesn’t have to be a long story; 5 or 10 minutes will be fine.

- Then write the following on the blackboard: “(a) Say something about yourself or your work in relation to statistics, (b) Why are you in this class?”

- Have the students divide into pairs. In pairs, they meet each other:
(3 min) A talks to B
(2 min) B asks a question to A, and A responds
(3 min) B talks to A
(2 min) A asks a question to B, and B responds
They are supposed to be talking to each other about their work in relation to statistics.

- If not all the students fit in the room, that’s not really a problem; you can have the overflow people in the lounge area, doing the same thing.

Once the students have done the intros in pairs, take a few volunteers (or, if there are no volunteers, pick some students and ask them to pick other students) to stand up and answer questions (a) and (b) above. Use these to lead the class into discussions that loop around to consider the relevance and different varieties of statistical communication.

Really, this can take all the class period. But I assume that at some point I’ll arrive—how delayed could Amtrak be, after all?? I just wanted to give you some contingency plan so that nobody has to worry if it’s 6:25 and I’m still not there.


See you

P.S. Here’s what happened.

About a zillion people pointed me to yesterday’s xkcd cartoon

I have the same problem with Bayes factors, for example this:

Screen Shot 2015-01-27 at 4.42.52 PM

and this:

Screen Shot 2015-01-27 at 4.45.03 PM

(which I copied from Wikipedia, except that, unlike you-know-who, I didn’t change the n’s to d’s and remove the superscripting).

Either way, I don’t buy the numbers, and I certainly don’t buy the words that go with them.

I do admit, though, to using the phrase “statistically significant.” It doesn’t mean so much, but, within statistics, everyone knows what it means, so it’s convenient jargon.

P.S. Kruschke had a similar reaction.

Crowdsourcing data analysis: Do soccer referees give more red cards to dark skin toned players?

Raphael Silberzahn Eric Luis Uhlmann Dan Martin Pasquale Anselmi Frederik Aust Eli Christopher Awtrey Štěpán Bahník Feng Bai Colin Bannard Evelina Bonnier Rickard Carlsson Felix Cheung Garret Christensen Russ Clay Maureen A. Craig Anna Dalla Rosa Lammertjan Dam Mathew H. Evans Ismael Flores Cervantes Nathan Fong Monica Gamez-Djokic Andreas Glenz Shauna Gordon-McKeon Tim Heaton Karin Hederos Eriksson Moritz Heene Alicia Hofelich Mohr Fabia Högden Kent Hui Magnus Johannesson Jonathan Kalodimos Erikson Kaszubowski Deanna Kennedy Ryan Lei Thomas Andrew Lindsay Silvia Liverani Christopher Madan Daniel Molden Eric Molleman Richard D. Morey Laetitia Mulder Bernard A. Nijstad Bryson Pope Nolan Pope Jason M. Prenoveau Floor Rink Egidio Robusto Hadiya Roderique Anna Sandberg Elmar Schlueter Felix S Martin Sherman S. Amy Sommer Kristin Lee Sotak Seth Spain Christoph Spörlein Tom Stafford Luca Stefanutti Susanne Täuber Johannes Ullrich Michelangelo Vianello Eric-Jan Wagenmakers Maciej Witkowiak SangSuk Yoon and Brian A. Nosek write:

Twenty-­nine teams involving 61 analysts used the same data set to address the same research questions: whether soccer referees are more likely to give red cards to dark skin toned players than light skin toned players and whether this relation is moderated by measures of explicit and implicit bias in the referees’ country of origin. Analytic approaches varied widely across teams. For the main research question, estimated effect sizes ranged from 0.89 to 2.93 in odds ratio units, with a median of 1.31. Twenty teams (69%) found a significant positive effect and nine teams (31%) observed a non­significant relationship. The causal relationship however remains unclear. No team found a significant moderation between measures of bias of referees’ country of origin and red card sanctionings of dark skin toned players. Crowdsourcing data analysis highlights the contingency of results on choices of analytic strategy, and increases identification of bias and error in data and analysis. Crowdsourcing analytics represents a new way of doing science; a data set is made publicly available and scientists at first analyze separately and then work together to reach a conclusion while making subjectivity and ambiguity transparent.

“It is perhaps merely an accident of history that skeptics and subjectivists alike strain on the gnat of the prior distribution while swallowing the camel that is the likelihood”

I recently bumped into this 2013 paper by Christian Robert and myself, “‘Not Only Defended But Also Applied': The Perceived Absurdity of Bayesian Inference,” which begins:

Younger readers of this journal may not be fully aware of the passionate battles over Bayesian inference among statisticians in the last half of the twentieth century. During this period, the missionary zeal of many Bayesians was matched, in the other direction, by a view among some theoreticians that Bayesian methods are absurd—not merely misguided but obviously wrong in principle. Such anti-Bayesianism could hardly be maintained in the present era, given the many recent practical successes of Bayesian methods. But by examining the historical background of these beliefs, we may gain some insight into the statistical debates of today. . . .

The whole article is just great. I love reading my old stuff!

Also we were lucky to get several thoughtful discussions:

“Bayesian Inference: The Rodney Dangerfield of Statistics?” — Steve Stigler

“Bayesian Ideas Reemerged in the 1950s” — Steve Fienberg

“Bayesian Statistics in the Twenty First Century” — Wes Johnson

“Bayesian Methods: Applied? Yes. Philosophical Defense? In Flux” — Deborah Mayo

And our rejoinder, “The Anti-Bayesian Moment and Its Passing.”

Good stuff.

“The Statistical Crisis in Science”: My talk this Thurs at the Harvard psychology department

Noon Thursday, January 29, 2015, in William James Hall 765 room 1:

The Statistical Crisis in Science

Andrew Gelman, Dept of Statistics and Dept of Political Science, Columbia University

Top journals in psychology routinely publish ridiculous, scientifically implausible claims, justified based on “p < 0.05.” And this in turn calls into question all sorts of more plausible, but not necessarily true, claims, that are supported by this same sort of evidence. To put it another way: we can all laugh at studies of ESP, or ovulation and voting, but what about MRI studies of political attitudes, or embodied cognition, or stereotype threat, or, for that matter, the latest potential cancer cure? If we can’t trust p-values, does experimental science involving human variation just have to start over? And what to we do in fields such as political science and economics, where preregistered replication can be difficult or impossible? Can Bayesian inference supply a solution? Maybe. These are not easy problems, but they’re important problems.

Here are the slides from the last time I gave this talk, and here are some relevant articles:

[2014] Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors. {\em Perspectives on Psychological Science} {\bf 9}, 641–651. (Andrew Gelman and John Carlin)

[2014] The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective. {\em Journal of Management}. (Andrew Gelman)

[2013] It’s too hard to publish criticisms and obtain data for replication. {\em Chance} {\bf 26} (3), 49–52. (Andrew Gelman)

[2012] P-values and statistical practice. {\em Epidemiology}. (Andrew Gelman)

The (hypothetical) phase diagram of a statistical or computational method

phase_diagram_sketch copy

So here’s the deal. You have a new idea, call it method C, and you try it out on problems X, Y, and Z and it works well—it destroys the existing methods A and B. And then you publish a paper with the pithy title, Method C Wins. And, hey, since we’re fantasizing here anyway, let’s say you want to publish the paper in PPNAS.

But reviewers will—and should—have some suspicions. How great can your new method really be? Can it really be that methods A and B, which are so popular, have nothing to offer anymore?

Instead give a sense of the bounds of your method. Under what conditions does it win, and under what conditions does it not work so well?

In the graph above, “Dimension 1″ and “Dimension 2″ can be anything, they could be sample size and number or parameters, or computing time and storage cost, or bias and sampling error, whatever. The point is that a method can be applied under varying conditions. And, if a method is great, what that really means is that it works well under a wide range of conditions.

So, make that phase diagram. Even if you don’t actually draw the graph or even explicitly construct a definition of “best,” you can keep in mind the idea of exploring the limitations of your method, coming up with places where it doesn’t perform so well.

On deck this week

Mon: The (hypothetical) phase diagram of a statistical or computational method

Tues: “It is perhaps merely an accident of history that skeptics and subjectivists alike strain on the gnat of the prior distribution while swallowing the camel that is the likelihood”

Wed: Six quick tips to improve your regression modeling

Thurs: “Another bad chart for you to criticize”

Fri: Cognitive vs. behavioral in psychology, economics, and political science

Sat: Economics/sociology phrase book

Sun: Oh, it’s so frustrating when you’re trying to help someone out, and then you realize you’re dealing with a snake

Tell me what you don’t know

We’ll ask an expert, or even a student, to “tell me what you know” about some topic. But now I’m thinking it makes more sense to ask people to tell us what they don’t know.

Why? Consider your understanding of a particular topic to be divided into three parts:
1. What you know.
2. What you don’t know.
3. What you don’t know you don’t know.

If you ask someone about 1, you get some sense of the boundary between 1 and 2.

But if you ask someone about 2, you implicitly get a lot of 1, you get a sense of the boundary between 1 and 2, and you get a sense of the boundary between 2 and 3.

As my very rational friend Ginger says: More information is good.