This entry was posted by Phil Price.
A colleague is looking at data on car (and SUV and light truck) collisions and casualties. He’s interested in causal relationships. For instance, suppose car manufacturers try to improve gas mileage without decreasing acceleration. The most likely way they will do that is to make cars lighter. But perhaps lighter cars are more dangerous; how many more people will die for each mpg increase in gas mileage?
There are a few different data sources, all of them seriously deficient from the standpoint of answering this question. Deaths are very well reported, so if someone dies in an auto accident you can find out what kind of car they were in, what other kinds of cars (if any) were involved in the accident, whether the person was a driver or passenger, and so on. But it’s hard to normalize: OK, I know that N people who were passengers in a particular model of car died in car accidents last year, but I don’t know how many passenger-miles that kind of car was driven, so how do I convert this to a risk? I can find out how many cars of that type were sold, and maybe even (through registration records) how many are still on the road, but I don’t know the total number of miles. Some types of cars are driven much farther than others, on average.
Most states also have data on all accidents in which someone was injured badly enough to go to the hospital. This lets you look at things like: given that the car is in an accident, how likely is it that someone in the car will die? This sort of analyses makes heavy cars look good (for the passengers in those vehicles; not so good for passengers in other vehicles, which is also a phenomenon of interest!) but perhaps this is misleading: heavy cars are less maneuverable and have longer stopping distance, so perhaps they’re more likely to be in an accident in the first place. Conceivably, a heavy car might be a lot more likely to be in an accident, but less likely to kill the driver if it’s in one, compared to a lighter car that is better for avoiding accidents but more dangerous if it does get hit.
Confounding every question of interest is that different types of driver prefer different cars. Any car that is driven by a disproportionately large fraction of men in their late teens or early twenties is going to have horrible accident statistics, whereas any car that is selected largely by middle-aged women with young kids is going to look pretty good. If 20-year-old men drove Volvo station wagons, the Volvo station wagon would appear to be one of the most dangerous cars on the road, and if 40-year-old women with 5-year-old kids drove Ferraris, the Ferrari would seem to be one of the safest.
There are lots of other confounders, too. Big engines and heavy frames cost money to make, so inexpensive cars tend to be light and to have small engines, in addition to being physically small. They also tend to have less in the way of safety features (no side-curtain airbags, for example). If an inexpensive car has a poor safety record, is it because it’s light, because it’s small, or because it’s lacking safety features? And yes, size matters, not just weight: a bigger car can have a bigger “crumple zone” and thus lower average acceleration if it hits a solid object, for example. If large, heavy cars really are safer than small, light cars, how much of the difference is due to size and how much is due to weight? Perhaps a large, light car would be the best, but building a large, light car would require special materials, like titanium or aluminum or carbon fiber, which might make it a lot more expensive…what, if anything, do we want to hold constant if we increase the fleet gas mileage? Cost? Size?
And of course the parameters I’ve listed above — size, weight, safety features, and driver characteristics — don’t begin to cover all of the relevant factors.
So: is it possible to untangle the causal influence of various factors?
Most people who are involved in this research topic appear to rely on linear or logistic regression, controlling for various explanatory variables, and make various interpretations based on the regression coefficients, r-squared values, etc. Is this the best that can be done? And if so, how does one figure out the right set of explanatory variables?
This is a “causal inference” question, and according to the title of this blog, this blog should be just the place for this sort of thing. So, bring it on: where do I look to find the right way to answer this kind of question?
(And, by the way, what is the answer to the question I posed at the end of this causal inference discussion?)