Boxplot challenge

In response to the comments here, I say:

I have never ever seen an example where I’ve felt a boxplot was appropriate. I’m open to being convinced, but I don’t think you’ll be able to convince me. Bring on the examples!

12 thoughts on “Boxplot challenge

  1. I know Tufte thinks they are a waste of ink and paper, and prefers the quartile plot.

    That looks like this:

    min 25% 50% 75% max
    . ————— —————– .

    I have no strong feelings for the method myself. However, I recall the professor using box-plots in my undergraduate ANOVA class to explain how two groups could differ in mean but the between group variation was small relative to within group variation. That was done with two two plots on one board, with a small difference in mean relative to the large box widths (inter-quartile range more specifically)

  2. I use boxplots when I want to compare two or more distributions. A recent example is comparison of blood pressures measured under two different conditions. In publications I usually use something else but when exploring data boxplots work fine.

  3. How are we supposed to , for instance, plot the outcome variable across several different levels, say, for each year during 10 years? In this case, histograms make it much more difficult for us to see the the trend in the outcome over the levels (time, in the example). Don't you think so?

  4. I agree with the original comment, boxplots are good when n is large, just because it is hard to make sense of too many dots. I would add that boxplots are good when your primary concern is with the trend in the mean, variance and/or range. It is particularly good when you are trying to show that the trend in these three at the same time, in order to contrast them. I.e. if the main point of your plot is something like (1) the mean doesn't change but the variance does or (2) the mean declines but at the same time the variance and range increase.

  5. I agree with One Eyed Man (boxplots can be used as teaching tools) and with Seth (I can see that boxplots can be ok as a quick tool, given that the better alternatives are not always available), but I completely disagree with the rest of you! More to follow . . .

  6. I hope the "more to follow" includes a graphical demonstration.

    I'd like to note that it's misleading to use boxplots when the data aren't rougly bell-ish looking curves. Try a boxplot on deviates from a Beta(0.1,0.1) distribution and see if you can correctly interpret what you get.

  7. Cory, I've done that before. You get a long box with short whiskers. I think one problem is that we don't have much experience with non-bell shaped distributions; if we had, and we used boxplots, it wouldn't be an issue.

    But maybe this isn't quite right. The long box from a Beta(.1,.1) distribution SEEMS to say that most of the data is inside the box, because it takes up most of the plot. In reality, of course, there is just as much data in the whiskers, but it isn't obvious. That may be the biggest problem with the boxplot. Emphasizing the middle with a box only makes sense for certain distributions.

    But still, it is a really fast way to see heterogeneity of variance when you have a lot of datapoints. I use them in data exploration.

  8. In a study I did that related number of sex partners, sex, and hardest drug used, a set of parallel box plots revealed that, while the number of partners increased with hardness of drug use for both men and women, the patterns were different.

    Could another graph have shown this? Yes.
    But a box plot did it in a simple and easily understandable way.

  9. Richard:

    Perhaps think of a good five number summary for the inverse distribution function (which Tukey apparently preferred) …

    In quadrature we know what those five good numbers should do, they depend highly on knowing the class of functions we are working with and today we would surely go for many more than just 5.

  10. As others have pointed out already, a bimodal U-shaped distribution will produce a box plot with very short tails. Even statistically sophisticated readers often misinterpret such displays unless they can see the raw data too.

    I sometimes uses a simple slogan in teaching "if half the data are in the box, the other half must be outside!"

    Note that John Tukey included in "Exploratory data analysis" (Addison-Wesley, Reading, MA, 1977) such an example where the box plot was definitely a bad idea, Rayleigh's data that lay behind the discovery of argon.

    Howard Wainer has also made the same point with his customary clarity in "Graphical Visions from William Playfair to John Tukey"
    Statistical Science 5: 340-346 (1990).

    Incidentally, a meme that seems impossible to correct is the idea that Tukey invented box plots. The name, yes, and the precise rules often used for whiskers and which points are shown individually, yes. But the main idea of summarizing the middle half by a box, no. It goes back at least 75 years to work in climatology and geography. But they wouldn't be nearly so much used now without Tukey's energetic promotion.

    More importantly, it is surprising that some people seem to think that somehow data analysts must choose between a few standard graphical forms, e.g. between box plots and histograms. Why so? For example, one easy strategy is to hybridise whatever combination of graphical elements fits the purpose e.g. dot plots for detail plus boxes for summary. It's odd that many statistical researchers regard it as imperative to follow the cutting-edge in (e.g.) modelling, but then turn to utterly conservative graphics.

  11. "More importantly, it is surprising that some people seem to think that somehow data analysts must choose between a few standard graphical forms, e.g. between box plots and histograms. Why so?"

    I think this is because there's a trade off: when we are familiar with a plot type, in general we can extract information quickly. When there is an unfamiliar plot type, we must spend time deciphering it. The rewards might be great if we spend the time deciphering a good plot, but even the best plots sometimes aren't interpretable right away.

Comments are closed.