That modeling feeling

Posted on July 23, 2009 10:31 AM by Andrew

It goes like this: there’s something you want to estimate and you have some data. Maybe, to take my favorite recent example, you want to break down support for school vouchers by religion, ethnicity, income, and state (or maybe you’d like to break it down even further, but you have to start somewhere).

Or maybe you want to estimate the difference between how rich and poor people vote, by state, over several decades—but you’re lazy and all you want to work with are the National Election Studies, which only have a couple thousand respondents, at most, in any year, and don’t even cover all the states.

Or maybe you want to estimate the concentration of cat allergen in a bunch of dust samples, while simultaneously estimating the calibration curve needed to get numerical estimates, all in the presence of contamination that screws up your calibration.

Or maybe you want to identify the places in the United States where it’s cost-effective to test your house for radon gas—and the data you have across the country are 80,000 noisy measurements, 5,000 accurate measurements, and some survey data and geological information.

Or maybe you want to understand how perchloroethylene is absorbed in the body—a process that is active at the time scale of minutes and also weeks—given only a couple dozen measurements on each of a few people.

Or maybe you want to get a picture of brain activity given indirect measurements from a big clanking physical device encircling a person’s head.

Or maybe you want to estimate what might have happened in past elections had the Democrats or Republicans received 1% more, or 2% more, or 3% more, of the vote.

Or maybe . . . or maybe . . .

What all these examples have in common is some data—not enough, never enough!—and a vague sense arising in my mind of what the answer should look like. Not exactly what it would look like—for example, I did not in any way anticipate the now-notorious pattern of vouchers being more popular among rich white Catholics and evangelicals and among poor blacks and Hispanics (maybe I should’ve anticipated it; I’m not proud in the level of ignorance that I had that allowed this finding to surprise me, I’m just stating the facts)—but what it could look like. Or, maybe it would be more accurate to say, various things that wouldn’t look right, if I were to see them.

And the challenge is to get from point A to point B. So, you throw model after model at the problem, method after method, alternating between quick-and-dirty methods that get me nowhere, and elaborate models that give uninterpretable, nonsensical results. Until finally you get close. Actually, what happens is that you suddenly solve the problem! Unexpectedly, you’re done! And boy is the result exciting. And you do some checking, fit to a different dataset maybe, or make some graphs showing raw data and model estimates together, or look carefully at some of the numbers, and you realize you have a problem. And you stare at your code for a long long time and finally bite the bullet, suck it up and do some active debugging, fake-data simulation, and all the rest. You code your quick graphs as diagnostic plots and build them into your procedure. And you go back and do some more modeling, and you get closer, and you never quite return to the triumphant feeling you had earlier—because you know that, at some point, the revolution will come again and with new data or new insights you’ll have to start over on this problem, but, for now, yes, yes, you can stop, you can step back and put in the time—hours, days!—to make pretty graphs, you can bask in the successful solution of a problem. You can send your graphs out there and let people take their best shot. You’ve done it.

But, not so deep inside you, that not-so-still and not-so-small voice reminds you of the compromises you’ve made, the data you’ve ignored, the things you just don’t know if you believe. You want to do more, but that will require more computing, more modeling, more theory. Yes, more theory. More understanding of what these things called models do. Because, just like storybook characters take on a life of their own, just like Gollum wouldn’t die and Frank Bascombe comes up with wisecracks all on his own, and Ramona Quimby won’t stay down even if you try to make her, and so on and so on and so on, just like these characters, each with his or her internal logic, so any statistical model worth fitting also has its internal logic, mathematical properties latent in its form but, Turing-machine-like, impossible to anticipate before applying it to data—not just “real data” (how I hate that phrase), but data from live problems. And then comes Statistical Theory—the good kind, the kind that tells us what our models can and cannot do, when they can bend with the data and when they snap. (Did you know that doubly-integrated white noise can’t really turn corners? I didn’t, until I tried to fit such a model to data that went up, then down.) And you do your best with your Theory, and your simulations, and even your computing (yuck!). But you move on. And you hope that when it’s time to come back to this problem, you’ll have some better models at hand, things like splines and time series cross sectional models, and you’ll have a programming and modeling environment where you can just write down latent factors and have them interact, and you’ll be able to include three-way interactions, and four-way interactions, and . . . and . . . you hope that in ten years you’ll be fitting the models that, ten years ago, you thought you’d be fitting in five years. And you take a rest. You write up what you found and you write up exactly what you did (not always so easy to do). And a new question comes along. You want a quick answer. You try putting together available data in a simple way. You try some weighting. But you don’t believe your answer. You need more data. You need more model. You get to work.

That’s how it feels, from the inside.

11 thoughts on “That modeling feeling”

anon on July 23, 2009 7:33 AM at 7:33 am said:

Well said!! I'm at the point right now where I'm eagerly waiting to get a data set for a new project. Got kinda burned out on the last one although, like you say, I still have a bunch of questions in my mind about what I could have done or even maybe still should do on the project.
yolio on July 23, 2009 10:22 AM at 10:22 am said:

I really like this post.
Frank D on July 23, 2009 11:23 AM at 11:23 am said:

A better title for this post: 'The Modern Scientific Method'
anon2 on July 23, 2009 1:39 PM at 1:39 pm said:

Frank: So people not doing modeling are either out of date or unscientific? C'mon…

It might be useful to discuss how the conscience-wrestling Andrew describes (very well) gets shuffled around differently under different paradigms – but that most of it's there, in some form, for any flavor statistician.

Also, where's "That sinking feeling"? – the one you get, sometimes, when realizing that the available data just don't give a reasonable way to get from A to B.
Ethan White on July 23, 2009 7:26 PM at 7:26 pm said:

Fantastic
jonathan on July 24, 2009 7:36 AM at 7:36 am said:

There is one way: argue from first principles. As Kurt G said, "Arguing from first principles is very powerful." Of course, you're then limited to the small set of problems which can be addressed directly from first principles, but at least you know where you are!
Keith O'Rourke on July 24, 2009 10:19 AM at 10:19 am said:

I believe David Cox put this slightly more concisely with the comment "when working on an empirical (live) problem you can't _prevent_ someone from making a comment or suggestion that will totally change the way you thought you had resolved things"

A rough quote anyways and not at all meant to distract by authority ;-)

Keith
Josh Reich on July 25, 2009 8:41 AM at 8:41 am said:

You have no idea how good it feels to hear you say this. The continual self-evaluation, awareness of trade-offs and short-cuts, the acute awareness of short-cuts you know that you don't know that you are making, worrying about inductive biases.

I always thought that real statisticians worked exactly like the examples in textbooks. A follows directly from B, while I find myself touring Q, M, Zeta and D# before I settle on Bb and call it close enough.
fia on July 28, 2009 11:26 AM at 11:26 am said:

What a great post!
Brigindo on July 28, 2009 7:39 PM at 7:39 pm said:

You captured it perfectly.
Michael Bishop on August 18, 2009 6:13 PM at 6:13 pm said:

absolutely fantastically great post. i'd love to hear similar and/or critical thoughts from other people.

Comments are closed.