Good examples of lurking variables?

Rama Ganesan writes:

I have been using many of your demos from the Teaching Stats book . . . Do you by any chance have a nice easy dataset that I can use to show students how ‘lurking variables’ work using regression? For instance, in your book you talk about the relationship between height and salaries – where gender is the hidden variable.

Any suggestions?

1. Eli says:

I always liked the WWII bombing analysis — I think this was in some old textbook (by Tukey?). After the war they studied the accuracy of strategic bombing with regressions. Some things made sense (different types of bombers had different accuracy levels, higher altitude meant less accuracy). But one variable was whether enemy fighters opposed the bombers, and this had the *opposite* effect from what anyone would expect (fighter opposition meant more accuracy). Want to try to guess the hidden variable?

Cloud cover. If the weather was cloudy the enemy wouldn’t bother to send up fighters, and accuracy was terrible because in that era bombing depended on sighting landmarks on the ground.

• Andrew says:

Eli:

That’s good. What would complete the analysis would be the construction of some fake data to make the point. It would be really cool if the students in the class could themselves learn how to create the fake data.

• Jonathan says:

You could maybe falsify the data yourself and make them try and find the issue? That would be fun too for both you and them!

2. UCB admissions data (UCBAdmissions in dataset package of R) on the relationships among gender (exposure), admission (outcome) and department (confounding/lurking).

• Andrew says:

I don’t like that example because it’s hard to do it as a regression.

What about trade and conflict, with distance as a lurking variable? The bivariate relationship is positive, but this simply reflects that closer pairs of states tend to trade more and fight more.

4. derek says:

To me, height is the hidden variable in discussions of the relationship between sex and salaries, not the other way round. Can both those statements be valid?

• Rama says:

Gender is closer to being the cause of salary discrepancy than height. Now, there might be various luring variables between gender and salary, and Andrew has another blog post on that (Andrew, could you please link? Thanks!)

• jimmy says:

“closer to being the cause?” what does that mean?

• It means that the “cause” is something like the long term biases built in to societal norms, but those biases are more related to the gender than to the height. I would guess that within gender height might be an issue, but that across genders you’ll find even that people of the same height have a gender discrepancy, combine that with the fact that the two genders tend to have different average heights, and you get a large portion of the bias.

• derek says:

How do you distinguish that statement from “within height, gender might be an issue, but that across heights you’ll find that even people of the same gender have a height discrepancy”?

• Rama says:

In consumer research, an area that I am getting to know, they are always looking for intervening variables or ‘process measures’. A causes C via B. So they set up experiments, where they show that A causes C. Then they measure/manipulate B to show the causality.

• jimmy says:

dan, thanks for the reply. by more related, do you mean more correlated? and when you talk about looking within genders vs between genders, i think that is an example of when andrew talks about the all else equal fallacy.

i preface this with the disclaimer that i don’t really know what i am talking about.
rama, is that consumer research example your definition of a lurking variable? would you define a lurking variable as a confounder too? (again, i have no clue what any of these terms mean.) if yes, then something seems off. definitions of confounding usually say that the confounder cannot be in the causal path.

• Rama says:

Jimmy (There is no reply button on your post so I’m replying to myself) A confound is something that you need to get rid of to get your paper published. An intervening variable/mediator is something you need to have to get your paper published. That just about sums it up for me.

5. Matt says:

I seem to remember one from university talking about the number of storks in an area was a fantastic predictor for the number of babies being born in areas of Oslo. Turns out the hidden variable was the number of chimneys in the area as storks like nesting there!

• Rama says:

I love this image of storks and babies and chimneys. However, why would number of chimneys be related to number of babies born in an area? Now that question leads to me other kinds of images — fireplaces are aphrodisiac?

• Phil says:

Perhaps there are more chimneys in more densely populated areas.

• number of chimneys is closely related to number of people.

• Rama says:

So this case, number of people is the lurking variable.

• people with chimneys. In a region where chimneys are uncommon the number of people would be less related.

6. Tom says:

Aren’t “lurking” or moderator variables in social research really the same thing as instrumental variables in econometrics?

• I actually have the same question.

• Soren says:

To be clear, I understand a ‘lurking variable’ to be an ‘omitted variable’ in the metrics sense.

The answer, then, is no. Recall that instrumental variables (IV) are used to ‘instrument’ for variables that we believe *are* correlated with the error term. For instance, consider a model where we look to describe earnings based on education and age with age^2, for instance. It’s likely that ability describes wages, but ability is unobserved and is correlated with education. [So ability is the omitted variable, and thus is the lurking variable to which you were referring?] So since ability is in the error term, u, we have bias in the parameter on education. So to avoid this, we look for a variable Z that

(1) is correlated with education, so that Cov(educ, Z) ≠ 0

(2) is not correlated with the error, so that Cov(Z,u) = 0.

Then by estimated the model:

educ = A + B*Z + u_2

and taking the predictions (which are notably *not* correlated with ability), we are able to estimate the effect of education that is *not* correlated with ability, which is what we wanted.

So a ‘lurking variable’ is not the same thing as an ‘instrumental variable’ because they have strictly different definitions, but an IV is sometimes used in response to a well specified lurking variable.

• thank you. That helped a lot to distinguish the two concepts. I keep that “an IV is sometimes used as a response to well speceified lurking variables”

7. jimmy says:

hi andrew, (eli’s example is pretty good.) do you think it would be worthwhile to define some of the terms? i always find this conversation confusing, because i do not ever really know what people mean when they use these terms above. what is a lurking variable? is it a confounder? if yes, what is confounding? people will sometimes say (as part of their definition) that the confounder is associated with both exposure and outcome. however, others will strengthen this requirement and say that the confounder has to cause exposure and outcome. if so, is the uc berkeley admissions example an example of confounding? but why does it make sense to say that department causes gender? if a lurking variable is not a confounder, what is it? and then, how do other things such as interaction, instrumental variables, etc fit? how do you think about this?

8. Soren says:

What about David Card’s classic dataset on return to schooling where he instruments education with a dummy for proximity to a four-year college. The lurking variable here, as usual, is ability, where ability is correlated with education (meaning that an explanatory variable is correlated with the error term). The dataset is available from Wooldridge’s Intermediate metrics text, and is available for download at the following link:

• Rama says:

• tgs says:

From the fellow’s website:

Note: The .dta file is a Stata dataset that should be downloaded and opened in Stata. The .def file is a text file with definitions of the variables and sample statistics. Open the latter in any text editor.

9. afoss says:

My favorite is the (possibly apocryphal) finding that ice cream consumed and number of drownings are correlated, with the lurking variable being hot weather.

10. Nick Cox says:

Brian L. Joiner. 1981.
Lurking variables: some examples.
The American Statistician 35(4): 227-233.