Skip to content

One of the worst infographics ever, but people don’t care?

This post is by Phil Price.

Perhaps prompted by the ALS Ice Bucket Challenge, this infographic has been making the rounds:
Infographic: disease deaths and dollars spent

I think this is one of the worst I have ever seen. I don’t know where it came from, so I can’t give credit/blame where it’s due.

Let’s put aside the numbers themselves – I haven’t checked them, for one thing, and I’d also say that for this comparison one would be most interested in (government money plus donations) rather than just donations — and just look at this as an information display. What are some things I don’t like about it? Jeez, I hardly know where to begin.

1. It takes a lot of work to figure it out. (a) You have to realize that each color is associated with a different cause — my initial thought was that the top circles represent deaths and dollars for the first cause, the second circles are for the second cause, etc. (b) Even once you’ve realized what is being displayed, and how, you pretty much have to go disease by disease to see what is going on; there’s no way to grok the whole pattern at once. (b) Other than pink for breast cancer and maybe red for AIDS none of the color mappings are standardized in any sense, so you have to keep referring back to the legend at the top. (c) It’s not obvious (and I still don’t know) if the amount of “money raised” for a given cause refers only to the specific fundraising vehicle mentioned in the legend for each disease. It’s hard to believe they would do it that way, but maybe they do.
2. Good luck if you’re colorblind.
3. Maybe I buried the lede by putting this last: did you catch the fact that the area of the circle isn’t the relevant parameter? Take a look at the top two circles on the left. The upper one should be less than twice the size of the second one. It looks like they made the diameter of the circle proportional to the quantity, rather than the area; a classic way to mislead with a graphic.

At a bare minimum, this graphic could be improved by (a) fixing the terrible mistake with the sizes of the circles, (b) putting both columns in the same order (that is, first row is one disease, second row is another, etc)., (c) taking advantage of the new ordering to label each row so you don’t need the legend. This would also make it much easier to see the point the display is supposed to make.

As a professional data analyst I’d rather just see a scatterplot of money vs deaths, but I know a lot of people don’t understand scatterplots. I can see the value of using circle sizes for a general audience. But I can’t see how anyone could like this graphic. Yet three of my friends (so far) have posted it on Facebook, with nary a criticism of the display.

This post is by Phil Price.

Discussion of “A probabilistic model for the spatial distribution of party support in multiparty elections”

From 1994. I don’t have much to say about this one. The paper I was discussing (by Samuel Merrill) had already been accepted by the journal—I might even have been a referee, in which case the associate editor had decided to accept the paper over my objections—and the editor gave me the opportunity to publish this dissent which appeared in the same issue with Merrill’s article.

I like the discussion, and it includes some themes that keep showing up: the idea that modeling is important and you need to understand what your model is doing to the data. It’s not enough to just interpret the fitted parameters as is, you need to get in there, get your hands dirty, and examine all aspects of your fit, not just the parts that relate to your hypotheses of interest.

There is a continuity between the criticisms I addressed of that paper in 1994, and our recent criticisms of some applied models, for example of that regression estimate of the health effects of air pollution in China.

Dave Blei course on Foundations of Graphical Models

Screen Shot 2014-08-25 at 5.47.47 PM

Dave Blei writes:

This course is cross listed in Computer Science and Statistics at Columbia University.

It is a PhD level course about applied probabilistic modeling. Loosely, it will be similar to this course.

Students should have some background in probability, college-level mathematics (calculus, linear algebra), and be comfortable with computer programming.

The course is open to PhD students in CS, EE and Statistics. However, it is appropriate for quantitatively-minded PhD students across departments. Please contact me [Blei] if you are a PhD student who is interested, but cannot register.

Research in probabilistic graphical models has forged connections between signal processing, statistics, machine learning, coding theory, computational biology, natural language processing, computer vision, and many other fields. In this course we will study the basics and the state of the art, with an eye on applications. By the end of the course, students will know how to develop their own models, compute with those models on massive data, and interpret and use the results of their computations to solve real-world problems.

Looks good to me!

Review of “Forecasting Elections”

From 1993. The topic of election forecasting sure gets a lot more attention than it used to! Here are some quotes from my review of that book by Michael Lewis-Beck and Tom Rice:

Political scientists are aware that most voters are consistent in their preferences, and one can make a good guess just looking at the vote counts in the previous election.

Objective analysis of a few columns of numbers can regularly outperform pundits who use inside knowledge.

The rationale for forecasting electoral vote directly . . . is mistaken.

The book’s weakness is its unquestioning faith in linear regression . . . We should always be suspicious of any grand claims made about a linear regression with five parameters and only 11 data points. . . .

Funny that I didn’t suggest the use of informative prior distributions. Only recently have I been getting around to this point.

And more:

The fact that U.S. elections can be successfully forecast with little effort, months ahead of time, has serious implications for our understanding of politics. In the short term, improved predictions will lead to more sophisticated campaigns, focusing more than ever on winnable races and marginal states.

Discussion of “Maximum entropy and the nearly black object”

From 1992. It’s a discussion of a paper by Donoho, Johnstone, Hoch, and Stern. As I summarize:

Under the “nearly black” model, the normal prior is terrible, the entropy prior is better and the exponential prior is slightly better still. (An even better prior distribution for the nearly black model would combine the threshold and regularization ideas by mixing a point mass at 0 with a proper distribution on [0, infinity].) Knowledge that an image is nearly black is strong prior information that is not included in the basic maximum entropy estimate.

Overall I liked the Donoho et al. paper but I was a bit disappointed in their response to me. To be fair, the paper had lots of comments and I guess the authors didn’t have much time to read each one, but still I didn’t think they got my main point, which was that the Bayesian approach was a pretty direct way to get most of the way to their findings. To put it another way, that paper had a lot to offer (and of course those authors followed it up with lots of other hugely influential work) but I think there was value right away in thinking about the different estimates in terms of prior distributions, rather than treating the Bayesian approach as a sort of sidebar.

On deck this week

Mon: Discussion of “Maximum entropy and the nearly black object”

Tues: Review of “Forecasting Elections”

Wed: Discussion of “A probabilistic model for the spatial distribution of party support in multiparty elections”

Thurs: Pre-election survey methodology: details from nine polling organizations, 1988 and 1992

Fri: Avoiding model selection in Bayesian social research

Sat, Sun: You might not be aware, but the NYC Labor Day parade is not held on Labor Day, as it would interfere with everyone’s holiday plans. Instead it’s held on the following weekend.

Poker math showdown!

Screen Shot 2014-08-25 at 7.16.22 AM

In comments, Rick Schoenberg wrote:

One thing I tried to say as politely as I could in [the book, "Probability with Texas Holdem Applications"] on p146 is that there’s a huge error in Chen and Ankenman’s “The Mathematics of Poker” which renders all the calculations and formulas in the whole last chapter wrong or meaningless or both. I’ve never received a single ounce of feedback about this though, probably because only like 2 people have ever read my whole book.

Jerrod Ankenman replied:

I haven’t read your book, but I’d be happy to know what you think is a “huge” error that invalidates “the whole last chapter” that no one has uncovered so far. (Also, the last chapter of our book contains no calculations—perhaps you meant the chapter preceding the error?). If you contacted one of us about it in the past, it’s possible that we overlooked your communication, although I do try to respond to criticism or possible errors when I can. I’m easy to reach; firstname.lastname@yale.edu will work for a couple more months.

Hmmm, what’s on page 146 of Rick’s book? It comes up if you search inside the book on Amazon:

Screen Shot 2014-08-25 at 7.24.26 AM

So that’s the disputed point right there. Just go to the example on page 290 where the results are normally distributed with mean and variance 1, check that R(1)=-14%, then run the simulation and check that the probability of the bankroll starting at 1 and reaching 0 or less is approximately 4%.

I went on to Amazon but couldn’t access page 290 of Chen and Ankenman’s book to check this. I did, however, program the simulation in R as I thought Rick was suggesting:

waiting <- function(mu,sigma,nsims,T){
  time_to_ruin <- rep(NA,nsims)
  for (i in 1:nsims){
    virtual_bankroll <- 1 + cumsum(rnorm(T,mu,sigma))
    if (any(virtual_bankroll<0)) {
      time_to_ruin[i] <- min((1:T)[virtual_bankroll<0])
    }
  }
  return(time_to_ruin)
}

a <- waiting(mu=1,sigma=1,nsims=10000,T=100)
print(mean(!is.na(a)))
print(table(a))

Which gave the following result:

> print(mean(!is.na(a)))
[1] 0.0409
> print(table(a))
a
  1   2   3   4   5   6   8   9 
218 107  53  13   9   7   1   1 

These results indicate that (i) the probability is indeed about 4%, and (ii) T=100 is easily enough to get the asymptotic value here.

Actually, the first time I did this I kept getting a probability of ruin of 2% which didn't seem right--I couldn't believe Rick would've got this simple simulation wrong--but then I found the bug in my code: I'd written "cumsum(1+rnorm(T,mu,sigma))" instead of "1+cumsum(rnorm(T,mu,sigma))".

So maybe Chen and Ankenman really did make a mistake. Or maybe Rick is misinterpreting what they wrote. There's also the question of whether Chen and Ankenman's mathematical error (assuming they did make the mistake identified by Rick) actually renders all the calculations and formulas in their whole last chapter, or their second-to-last chapter, wrong or meaningless or both.

P.S. According to the caption at the Youtube site, they're playing rummy, not poker, in the above clip. But you get the idea.

P.P.S. I fixed a typo pointed out by Juho Kokkala in an earlier version of my code.

How Many Mic’s Do We Rip

Yakir Reshef writes:

Our technical comment on Kinney and Atwal’s paper on MIC and equitability has come out in PNAS along with their response. Similarly to Ben Murrell, who also wrote you a note when he published a technical comment on the same work, we feel that they “somewhat missed the point.” Specifically: one statistic can be more or less equitable than another, and our claim has been that MIC is more equitable than other existing methods in a wide variety of settings. Contrary to what Kinney and Atwal write in their response (“Falsifiability or bust”), this claim is indeed falsifiable — it’s just that they have not falsified it.

2. We’ve just posted a new theoretical paper that defines both equitability and MIC in the language of estimation theory and analyzes them in that paradigm. In brief, the paper contains a proof of a formal relationship between power against independence and equitability that shows that the latter can be seen as a generalization of the former; a closed-form expression for the population value of MIC and an analysis of its properties that lends insight into aspects of the definition of MIC that distinguish it from mutual information; and new estimators for this population MIC that perform better than the original statistic we introduced.

3. In addition to our paper, we’ve also written a short FAQ for those who are interested in a brief summary of where the conversation and the literature on MIC and equitability are at this point, and what is currently known about the properties of these two objects.

PS – at your suggestion, the theory paper now has some pictures!

We’ve posted on this several times before:

16 December 2011: Mr. Pearson, meet Mr. Mandelbrot: Detecting Novel Associations in Large Data Sets

26 Mar 2012: Further thoughts on nonparametric correlation measures

4 Feb 2013: Too many MC’s not enough MIC’s, or What principles should govern attempts to summarize bivariate associations in large multivariate datasets?

14 Mar 2014: The maximal information coefficient

1 May 2014: Heller, Heller, and Gorfine on univariate and multivariate information measures

7 May 2014: Once more on nonparametric measures of mutual information

I still haven’t formed a firm opinion on these things. Summarizing pairwise dependence in large datasets is a big elephant, and I guess it makes sense that different researchers who work in different application areas will have different perspectives on the problem.

Recently in the sister blog

Replication Wiki for economics

Jan Hoeffler of the University of Gottingen writes:

I have been working on a replication project funded by the Institute for New Economic Thinking during the last two years and read several of your blog posts that touched the topic.

We developed a wiki website that serves as a database of empirical studies, the availability of replication material for them and of replication studies.

It can help for research as well as for teaching replication to students. We taught seminars at several faculties for which the information of this database was used. In the starting phase the focus was on some leading journals in economics, and we now cover more than 1800 empirical studies and 142 replications. Replication results can be published as replication working papers of the University of Göttingen’s Center for Statistics.

Teaching and providing access to information will raise awareness for the need for replications, provide a basis for research about the reasons why replications so often fail and how this can be changed, and educate future generations of economists about how to make research replicable.