Skip to content

Bad Statistics: Ignore or Call Out?

Evelyn Lamb adds to the conversation that Jeff Leek and I had a few months ago. It’s a topic that’s worth returning to, in light of our continuing discussions regarding the crisis of criticism in science.

On deck this week

Mon: Bad Statistics: Ignore or Call Out?

Tues: Questions about “Too Good to Be True”

Wed: I disagree with Alan Turing and Daniel Kahneman regarding the strength of statistical evidence

Thurs: Why isn’t replication required before publication in top journals?

Fri: Confirmationist and falsificationist paradigms of science

Sat: How does inference for next year’s data differ from inference for unobserved data from the current year?

Sun: Likelihood from quantiles?

We’ve got a full week of statistics for you. Welcome back to work, everyone!

On deck this month

Bad Statistics: Ignore or Call Out?

Questions about “Too Good to Be True”

I disagree with Alan Turing and Daniel Kahneman regarding the strength of statistical evidence

Why isn’t replication required before publication in top journals?

Confirmationist and falsificationist paradigms of science

How does inference for next year’s data differ from inference for unobserved data from the current year?

Likelihood from quantiles?

My talk with David Schiminovich this Wed noon: “The Birth of the Universe and the Fate of the Earth: One Trillion UV Photons Meet Stan”

Suspicious graph purporting to show “percentage of slaves or serfs in the world”

“It’s as if you went into a bathroom in a bar and saw a guy pissing on his shoes, and instead of thinking he has some problem with his aim, you suppose he has a positive utility for getting his shoes wet”

One-tailed or two-tailed

What is the purpose of a poem?

He just ordered a translation from Diederik Stapel

Six quotes from Kaiser Fung

More bad news for the buggy-whip manufacturers

They know my email but they don’t know me

What do you do to visualize uncertainty?

Sokal: “science is not merely a bag of clever tricks . . . Rather, the natural sciences are nothing more or less than one particular application — albeit an unusually successful one — of a more general rationalist worldview”

Question about data mining bias in finance

Estimating discontinuity in slope of a response function

I can’t think of a good title for this one.

Study published in 2011, followed by successful replication in 2003 [sic]

My talk at the University of Michigan on Fri 25 Sept

I’m sure that my anti-Polya attitude is completely unfair

Waic for time series

MA206 Program Director’s Memorandum

“An exact fishy test”

People used to send me ugly graphs, now I get these things

If you do an experiment with 700,000 participants, you’ll (a) have no problem with statistical significance, (b) get to call it “massive-scale,” (c) get a chance to publish it in a tabloid top journal. Cool!

Carrie McLaren was way out in front of the anti-Gladwell bandwagon

Avoiding model selection in Bayesian social research

One of my favorites, from 1995.

Don Rubin and I argue with Adrian Raftery. Here’s how we begin:

Raftery’s paper addresses two important problems in the statistical analysis of social science data: (1) choosing an appropriate model when so much data are available that standard P-values reject all parsimonious models; and (2) making estimates and predictions when there are not enough data available to fit the desired model using standard techniques.

For both problems, we agree with Raftery that classical frequentist methods fail and that Raftery’s suggested methods based on BIC can point in better directions. Nevertheless, we disagree with his solutions because, in principle, they are still directed off-target and only by serendipity manage to hit the target in special circumstances. Our primary criticisms of Raftery’s proposals are that (1) he promises the impossible: the selection of a model that is adequate for specific purposes without consideration of those purposes; and (2) he uses the same limited tool for model averaging as for model selection, thereby depriving himself of the benefits of the broad range of available Bayesian procedures.

Despite our criticisms, we applaud Raftery’s desire to improve practice by providing methods and computer programs for all to use and applying these methods to real problems. We believe that his paper makes a positive contribution to social science, by focusing on hard problems where standard methods can fail and exp sing failures of standard methods.

We follow up with sections on:
- “Too much data, model selection, and the example of the 3x3x16 contingency table with 113,556 data points”
- “How can BIC select a model that does not fit the data over one that does”
- “Not enough data, model averaging, and the example of regression with 15 explanatory variables and 47 data points.”

And here’s something we found on the web with Raftery’s original article, our discussion and other discussions, and Raftery’s reply. Enjoy.

When we talk about the “file drawer,” let’s not assume that an experiment can easily be characterized as producing strong, mixed, or weak results

Neil Malhotra:

I thought you might be interested in our paper [the paper is by Annie Franco, Neil Malhotra, and Gabor Simonovits, and the link is to a news article by Jeffrey Mervis], forthcoming in Science, about publication bias in the social sciences given your interest and work on research transparency.

Basic summary: We examined studies conducted as part of the Time-sharing Experiments in the Social Science (TESS) program, where: (1) we have a known population of conducted studies (some published, some unpublished); and (2) all studies exceed a quality threshold as they go through peer review. We found that having null results made experiments 40 percentage points less likely to be published and 60 percentage points less likely to even be written up.

My reply:

Here’s a funny bit from the news article: “Stanford political economist Neil Malhotra and two of his graduate students . . .” You know you’ve hit the big time when you’re the only author who gets mentioned in the news story!

More seriously, this is great stuff. I would only suggest that, along with the file drawer, you remember the garden of forking paths. In particular, I’m not so sure about the framing in which an experiment can be characterized as producing “strong results,” “mixed results,” or “null results.” Whether a result is strong or not would seem to depend on how the data are analyzed, and the point of the forking paths is that with a given data it is possible for noise to appear as strong. I gather from the news article that TESS is different in that any given study is focused on a specific hypothesis, but even so I would think there is a bit of flexibility in how the data are analyzed and a fair number of potentially forking paths. For example, the news article mentions “whether voters tend to favor legislators who boast of bringing federal dollars to their districts over those who tout a focus on policy matters).” But of course this could be studied in many different ways.

In short, I think this is important work you have done, and I just think that we should go beyond the “file drawer” because I fear that this phase lends too much credence to the idea that a reported p-value is a legitimate summary of a study.

P.S. There’s also a statistical issue that every study is counted only once, as either a 1 (published) or 0 (unpublished). If Bruno Frey ever gets involved, you’d have to have a system where any result gets a number from 0 to 5, representing the number of different times it’s published.

Pre-election survey methodology: details from nine polling organizations, 1988 and 1992

This one from 1995 (with D. Stephen Voss and Gary King) was fun. For our “Why are American Presidential election campaign polls so variable when votes are so predictable?” project a few years earlier, Gary and I had analyzed individual-level survey responses from 60 pre-election polls that had been conducted by several different polling organizations. We wanted do know exactly what went on in these surveys but it was hard to learn anything much at all from the codebooks or the official descriptions of the polls. So Voss, a student of Gary’s who had experience as a journalist, contacted the polling organizations and got them to cough up lots of information, which we reported in this paper, along with some simple analyses indicating the effects of weighting. All the surveys had serious nonresponse issues, of course, but some dealt with them during the data-collection process while others did more of the adjustment using weights.

By the way, the paper has a (small) error. The two outlying “h” points in Figure 1b are a mistake. I can’t remember what we did wrong, but I do remember catching the mistake, I think it was before publication but too late for the journal to fix the error. The actual weighted results for the Harris polls are not noticeably different from those of the other surveys at those dates.

Polling has changed in the past twenty years, but I think this paper is still valuable, partly in giving a sense of the many different ways that polling organizations can attempt to get a representative sample, and partly as a convenient way to shoot down the conventional textbook idea of survey weights as inverse selection probabilities. (Remember, survey weighting is a mess.)

One of the worst infographics ever, but people don’t care?

This post is by Phil Price.

Perhaps prompted by the ALS Ice Bucket Challenge, this infographic has been making the rounds:
Infographic: disease deaths and dollars spent

I think this is one of the worst I have ever seen. I don’t know where it came from, so I can’t give credit/blame where it’s due.

Let’s put aside the numbers themselves – I haven’t checked them, for one thing, and I’d also say that for this comparison one would be most interested in (government money plus donations) rather than just donations — and just look at this as an information display. What are some things I don’t like about it? Jeez, I hardly know where to begin.

1. It takes a lot of work to figure it out. (a) You have to realize that each color is associated with a different cause — my initial thought was that the top circles represent deaths and dollars for the first cause, the second circles are for the second cause, etc. (b) Even once you’ve realized what is being displayed, and how, you pretty much have to go disease by disease to see what is going on; there’s no way to grok the whole pattern at once. (b) Other than pink for breast cancer and maybe red for AIDS none of the color mappings are standardized in any sense, so you have to keep referring back to the legend at the top. (c) It’s not obvious (and I still don’t know) if the amount of “money raised” for a given cause refers only to the specific fundraising vehicle mentioned in the legend for each disease. It’s hard to believe they would do it that way, but maybe they do.
2. Good luck if you’re colorblind.
3. Maybe I buried the lede by putting this last: did you catch the fact that the area of the circle isn’t the relevant parameter? Take a look at the top two circles on the left. The upper one should be less than twice the size of the second one. It looks like they made the diameter of the circle proportional to the quantity, rather than the area; a classic way to mislead with a graphic.

At a bare minimum, this graphic could be improved by (a) fixing the terrible mistake with the sizes of the circles, (b) putting both columns in the same order (that is, first row is one disease, second row is another, etc)., (c) taking advantage of the new ordering to label each row so you don’t need the legend. This would also make it much easier to see the point the display is supposed to make.

As a professional data analyst I’d rather just see a scatterplot of money vs deaths, but I know a lot of people don’t understand scatterplots. I can see the value of using circle sizes for a general audience. But I can’t see how anyone could like this graphic. Yet three of my friends (so far) have posted it on Facebook, with nary a criticism of the display.

[Note added the next day:
The graphic is even worse than I thought. As several people have pointed out, my suspicion is true: the numbers do not show the total donations to fight the diseases listed, they show only the donations to a single organization. For instance, according to the legend the pink color represents donations to fight breast cancer, but the number is not for breast cancer as a whole, it's only for Komen Race for the Cure.

If they think people are interested in contributions to only a single charity in each category --- which seems strange, but let's assume that's what they want and just look at the display --- then they need a title that is much less ambiguous, and the labels need to emphasize the charity and not the disease.]

This post is by Phil Price.

Discussion of “A probabilistic model for the spatial distribution of party support in multiparty elections”

From 1994. I don’t have much to say about this one. The paper I was discussing (by Samuel Merrill) had already been accepted by the journal—I might even have been a referee, in which case the associate editor had decided to accept the paper over my objections—and the editor gave me the opportunity to publish this dissent which appeared in the same issue with Merrill’s article.

I like the discussion, and it includes some themes that keep showing up: the idea that modeling is important and you need to understand what your model is doing to the data. It’s not enough to just interpret the fitted parameters as is, you need to get in there, get your hands dirty, and examine all aspects of your fit, not just the parts that relate to your hypotheses of interest.

There is a continuity between the criticisms I addressed of that paper in 1994, and our recent criticisms of some applied models, for example of that regression estimate of the health effects of air pollution in China.

Dave Blei course on Foundations of Graphical Models

Screen Shot 2014-08-25 at 5.47.47 PM

Dave Blei writes:

This course is cross listed in Computer Science and Statistics at Columbia University.

It is a PhD level course about applied probabilistic modeling. Loosely, it will be similar to this course.

Students should have some background in probability, college-level mathematics (calculus, linear algebra), and be comfortable with computer programming.

The course is open to PhD students in CS, EE and Statistics. However, it is appropriate for quantitatively-minded PhD students across departments. Please contact me [Blei] if you are a PhD student who is interested, but cannot register.

Research in probabilistic graphical models has forged connections between signal processing, statistics, machine learning, coding theory, computational biology, natural language processing, computer vision, and many other fields. In this course we will study the basics and the state of the art, with an eye on applications. By the end of the course, students will know how to develop their own models, compute with those models on massive data, and interpret and use the results of their computations to solve real-world problems.

Looks good to me!

Review of “Forecasting Elections”

From 1993. The topic of election forecasting sure gets a lot more attention than it used to! Here are some quotes from my review of that book by Michael Lewis-Beck and Tom Rice:

Political scientists are aware that most voters are consistent in their preferences, and one can make a good guess just looking at the vote counts in the previous election.

Objective analysis of a few columns of numbers can regularly outperform pundits who use inside knowledge.

The rationale for forecasting electoral vote directly . . . is mistaken.

The book’s weakness is its unquestioning faith in linear regression . . . We should always be suspicious of any grand claims made about a linear regression with five parameters and only 11 data points. . . .

Funny that I didn’t suggest the use of informative prior distributions. Only recently have I been getting around to this point.

And more:

The fact that U.S. elections can be successfully forecast with little effort, months ahead of time, has serious implications for our understanding of politics. In the short term, improved predictions will lead to more sophisticated campaigns, focusing more than ever on winnable races and marginal states.