Skip to content

Apology to George A. Romero

This came in the email one day last year:

Good Afternoon Mr. Gelman,

I am reaching out to you on behalf of Pearson Education who would like to license an excerpt of text from How Many Zombies Do You Know? for the following, upcoming textbook program:

Title: Writing Today
Author: Richard Johnson-Sheehan and Charles Paine
Edition: 3
Anticipated Pub Date: 01/2015

For this text, beginning with “The zombie menace has so far,” (page 101) and ending with “Journal of the American Statistical Association,” (409-423), Pearson would like to request US & Canada distribution, English language, a 150,000 print run, and a 7 year term in all print and non-print media versions, including ancillaries, derivatives and versions whole or in part.

The requested material is approximately 550 words and was originally published March 31, 2010 on Scienceblogs.com

If you could please look over the attached license request letter and return it to us, it would be much appreciated. If you need to draw up an invoice, please include all granted rights within the body of your invoice (the above, underlined portion). . . .

I decided to charge them $150 (I had no idea, I just made that number up) and I sent along the following message:

Also, at the bottom of page 2, they have a typo in my name (so please cross that out and replace with my actual last name!) and also please cross out “Author: George A. Romano”. Finally, please cross out the link (http://scienceblogs.com/appliedstatistics/2010/07/01/how-many-zombies-do-you-know-u/) and replace by: http://arxiv.org/pdf/1003.6087.pdf

I got the $150 and they told me they’d send me a copy of the book. And last month it came in the mail. So cool! I’ve always fancied myself a writer so I loved the idea of having an entry in a college writing textbook. (Yeah, yeah, I know some people say that college is a place where kids learn how to write badly. Whatever.)

I quickly performed what Yair calls a “Washington read” and found my article. It’s right there on page 266, one of the readings in the Analytical Reports chapter. B-b-b-ut . . . they altered my deathless prose!

– They removed the article’s abstract. That’s fine, the abstract wasn’t so funny.

– My name in the author list pointed to the following hilarious footnote which they removed: “Department of Statistics, Columbia University, New York. Please do not tell my employer that I spent any time doing this.”

– George A. Romero’s name in the author list pointed to the following essential footnote which they removed: “Not really.”

– They changed “et al.” to “et. al.” That’s just embarrassing for them. Making a mistake is one thing, but changing something correct into a mistake, that’s just sad. It reminds me of when one of my coauthors noticed the word “comprises” in the paper I’d written and scratched it out and replaced it with “is comprised of.” Ugh.

– They removed Section 4 of the paper, which read:

Technical note

We originally wrote this article in Word, but then we converted it to Latex to make it look more like science.

Ouch. That hurts.

But the biggest problem was, by keeping Romero’s name on the article and removing the disclaimer, they made it look like Romero actually was involved in this silly endeavor. Indeed, in their intro they refer to “the authors,” and after they refer to “Gelman and Romero’s article.” That’s better than “Gleman and Romano,” but, still, it doesn’t seem right to assign any of the blame for this on Romero. I’d have no problem sharing the credit but I have no idea how he’d feel about it.

At least they kept in the ZDate joke.

P.S. Overall I’m happy to see my article in this textbook. But it’s funny to see where it got messed up.

I actually think this infographic is ok

Under the heading, “bad charts,” Mark Duckenfield links to this display by Quoctrung Bui and writes:

So much to go with here, but I [Duckenfield] would just highlight the bars as the most egregious problem as it is implied that the same number of people are in each category. Obviously that is not the case — the top 1% and the 90-99% group, even if the coverage were comprehensive which it isn’t, would have fewer people in them than the decile groups.

But even more to the point, there is no reason to think that the top 10 jobs in each category all yield the same total number of jobs since there much be dozens, if not more broad categories of employment, most of which are off. They’d have been better to have a big “other jobs” block at the end so the bars balanced out. But I suspect the coverage of the top 10 jobs in each category is under 50%,so you’d see a lot of “other”

And this leads to the implication about total number of people making certain incomes might be the same. But all that is the same is their percentage of the employment among the top 10 jobs in the income range.

And I’m not entirely sure that the median salary, which looks to conveniently be $40K is correct. Household income is more like $50K and the median wage earner had a *median* net compensation according to Social Security of around $27500 (although I expect that has deductions and other withholdings from SS wages like health insurance). And the *average* net compensation was about $42500.

My reply: I agree that these graphs have problems but I kinda like them because they do contain a lot of information, if you don’t over-interpret them.

The connection between varying treatment effects and the well-known optimism of published research findings

Jacob Hartog writes:

I thought this article [by Hunt Allcott and Sendhil Mullainathan], although already a couple of years old, fits very well into the themes of your blog—in particular the idea that the “true” treatment effect is likely to vary a lot depending on all kinds of factors that we can and cannot observe, and that especially large estimated effects are likely telling us as much about the sample as about the Secrets of the Social World.

They find that sites that choose to participate in randomized controlled trials are selected on characteristics correlated with the estimated treatment effect, and they have some ideas about “suggestive tests of external validity.”

I’d be curious about where you agree and disagree with their approach.

I pointed this to Avi, who wrote:

I’m actually a big fan of this paper (and of Hunt and Sendhil). Rather than look at the original NBER paper, however, I’d point you to Hunt’s recent revision, which looks at 111 experiments (!!) rather than the 14 experiments analyzed in the first study.

In particular, Hunt uses the first 10 experiments they conducted to predict the results of the next 101, finding that the predicted effect is significantly larger than the observed effect in those remaining trials.

Good stuff. I haven’t had the time to look at any of this, actually, but it does all seem both relevant to our discussions and important more generally. It’s good to see economists getting into this game. The questions they’re looking at are similar to issues of Type S and Type M error that are being discussed in psychology research, and I feel that, more broadly, we’re seeing a unification of models of the scientific process, going beyond the traditional “p less than .05″ model of discovery. I’m feeling really good about all this.

“I mean, what exact buttons do I have to hit?”

While looking for something else, I happened to come across this:

Unfortunately there’s the expectation that if you start with a scientific hypothesis and do a randomized experiment, there should be a high probability of learning an enduring truth. And if the subject area is exciting, there should consequently be a high probability of publication in a top journal, along with the career rewards that come with this. I’m not morally outraged by this: it seems fair enough that if you do good work, you get recognition. I certainly don’t complain if, after publishing some influential papers, I get grant funding and a raise in my salary, and so when I say that researchers expect some level of career success and recognition, I don’t mean this in a negative sense at all.

I do think, though, that this attitude is mistaken from a statistical perspective. If you study small effects and use noisy measurements, anything you happen to see is likely to be noise, as is explained in this now-classic article by Katherine Button et al. On statistical grounds, you can, and should, expect lots of strikeouts for every home run—call it the Dave Kingman model of science—or maybe no home runs at all. But the training implies otherwise, and people are just expecting to the success rate you might see if Rod Carew were to get up to bat in your kid’s Little League game.

To put it another way, the answer to the question, “I mean, what exact buttons do I have to hit?” is that there is no such button.

My talk at MIT this Thursday

When I was a student at MIT, there was no statistics department. I took a statistics course from Stephan Morgenthaler and liked it. (I’d already taken probability and stochastic processes back at the University of Maryland; my instructor in the latter class was Prof. Grace Yang, who was super-nice. I couldn’t follow half of what was going on in that class, but she conveyed a sense of the mathematical excitement of the topic.) Statistics was still a bit mysterious to me. I thought I’d take another class but the math department didn’t offer much statistics and I didn’t want to take something boring. I asked my advisor who recommended I talk with Prof. Chernoff who recommended I take a course at Harvard. Which I did, but that’s another story.

Anyway, it seems that MIT is starting some sort of statistics program, and they invited me to the inaugural symposium! Which is cool.

My talk is this Thurs, 14 May, at 10am at room 46-3002. I’m pretty sure this is a new building.

Here are my title and abstract:

Little Data: How Traditional Statistical Ideas Remain Relevant in a Big-Data World; or, The Statistical Crisis in Science; or, Open Problems in Bayesian Data Analysis

“Big Data” is more than a slogan; it is our modern world in which we learn by combining information from diverse sources of varying quality. But traditional statistical questions—how to generalize from sample to population, how to compare groups that differ, and whether a given data pattern can be explained by noise—continue to arise. Often a big-data study will be summarized by a little p-value. Recent developments in psychology and elsewhere make it clear that our usual statistical prescriptions, adapted as they were to a simpler world of agricultural experiments and random-sample surveys, fail badly and repeatedly in the modern world in which millions of research papers are published each year. Can Bayesian inference help us out of this mess? Maybe, but much research will be needed to get to that point.

(The title is long because I wasn’t sure which of 3 talks to give, so I thought I’d give them all.)

There’s something about humans

An interesting point came up recently. In the abstract to my psychology talk, I’d raised the question:

If we can’t trust p-values, does experimental science involving human variation just have to start over?

In the comments, Rahul wrote:

Isn’t the qualifier about human variation redundant? If we cannot trust p-values we cannot trust p-values.

My reply:

At a technical level, a lot of the problems arise when signal is low and noise is high. Various classical methods of statistical inference perform a lot better in settings with clean data. Recall that Fisher, Yates, etc., developed their p-value-based methods in the context of controlled experiments in agriculture.

Statistics really is more difficult with humans: it’s harder to do experimentation, outcomes of interest are noisy, there’s noncompliance, missing data, and experimental subjects who can try to figure out what you’re doing and alter their responses correspondingly.

There’s No Such Thing As Unbiased Estimation. And It’s a Good Thing, Too.

crawtators

Following our recent post on econometricians’ traditional privileging of unbiased estimates, there were a bunch of comments echoing the challenge of teaching this topic, as students as well as practitioners often seem to want the comfort of an absolute standard such as best linear unbiased estimate or whatever. Commenters also discussed the tradeoff between bias and variance, and the idea that unbiased estimates can overfit the data.

I agree with all these things but I just wanted to raise one more point: In realistic settings, unbiased estimates simply don’t exist. In the real world we have nonrandom samples, measurement error, nonadditivity, nonlinearity, etc etc etc.

So forget about it. We’re living in the real world.

P.S. Perhaps this will help. It’s my impression that many practitioners in applied econometrics and statistics think of their estimation choice kinda like this:

1. The unbiased estimate. It’s the safe choice, maybe a bit boring and maybe not the most efficient use of the data, but you can trust it and it gets the job done.

2. A biased estimate. Something flashy, maybe Bayesian, maybe not, it might do better but it’s risky. In using the biased estimate, you’re stepping off base—the more the bias, the larger your lead—and you might well get picked off.

This is not the only dimension of choice in estimation—there’s also robustness, and other things as well—but here I’m focusing on the above issue.

Anyway, to continue, if you take the choice above and combine it with the unofficial rule that statistical significance is taken as proof of correctness (in econ, this would also require demonstrating that the result holds under some alternative model specifications, but “p less than .05″ is still key), then you get the following decision rule:

A. Go with the safe, unbiased estimate. If it’s statistically significant, run some robustness checks and, if the result doesn’t go away, stop.

B. If you don’t succeed with A, you can try something fancier. But . . . if you do that, everyone will know that you tried plan A and it didn’t work, so people won’t trust your finding.

So, in a sort of Gresham’s Law, all that remains is the unbiased estimate. But, hey, it’s safe, conservative, etc, right?

And that’s where the present post comes in. My point is that the unbiased estimate does not exist! There is no safe harbor. Just as we can never get our personal risks in life down to zero (despite what Gary Becker might have thought in his ill-advised comment about deaths and suicides), there is no such thing as unbiasedness. And it’s a good thing, too: recognition of this point frees us to do better things with our data right away.

On deck this week

Mon: There’s No Such Thing As Unbiased Estimation. And It’s a Good Thing, Too.

Tues: There’s something about humans

Wed: Humility needed in decision-making

Thurs: The connection between varying treatment effects and the well-known optimism of published research findings

Fri: I actually think this infographic is ok

Sat: Apology to George A. Romero

Sun: “Do we have any recommendations for priors for student_t’s degrees of freedom parameter?”

Collaborative filtering, hierarchical modeling, and . . . speed dating

t3_lhy_1

Jonah Sinick posted a few things on the famous speed-dating dataset and writes:

The main element that I seem to have been missing is principal component analysis of the different rating types.

The basic situation is that the first PC is something that people are roughly equally responsive to, while people vary a lot with respect to responsiveness to the second PC, and the remaining PCs don’t play much of a role at all, so that you can just allow the coefficient of the second PC to vary.

Despite feeling like I understand the qualitative phenomenon, if I do a train/test split, the multilevel model doesn’t yield better log loss, (though there are other respects in which the multilevel model yields clear improvements) and I haven’t isolated the reason. I don’t think that there’s a quick fix – I’ve run into ~5 apparently deep statistical problems in the course of thinking about this. The situation is further complicated by the fact that in this context the issues are intertwined.

And he adds:

Do you know of researchers who work at the intersection of collaborative filtering and hierarchical modeling? Googling yields some papers that seem like they might fall into this category, but in each case it would take me a while to parse what the authors are doing.

Social networks spread disease—but they also spread practices that reduce disease

I recently posted on the sister blog regarding a paper by Jon Zelner, James Trostle, Jason Goldstick, William Cevallos, James House, and Joseph Eisenberg, “Social Connectedness and Disease Transmission: Social Organization, Cohesion, Village Context, and Infection Risk in Rural Ecuador.”

Zelner follows up:

This made me think of my favorite figure from this paper, which showed the impact of relative network position within villages on risk. Basically, less-connected households in low average-degree villages were at higher risk than more-connected households in those villages, but in high average-degree places there was no meaningful relative degree effect.

Here it is:

high_low