pretty() ain’t always so pretty

As part of his continuing campaign to destroy my productivity, Aleks sporadically sends me emails such as the following: the subject line was “algorithm for graph labels” and the message read, in its entirety:

http://books.google.com/books?id=fvA7zLEFWZgC&pg=PA61&lpg=PA61#v=onepage&q=&f=false

I had to look, and what I found was a three-page article by someone named Paul Heckbert, published in a 1990 book called Graphics Gems. The article was called “Nice numbers for graph labels” and, through the magic of Google Books + Print-Screen + Paint + Movable Type, I’m able to share it with you:

heckbert1.png

heckbert2.png

heckbert3.png

Aleks knows this would interest me because he’s always hearing me complain about the R graphics defaults: The tick marks are too long, the axis labels are too far from the axes, there are typically too many tick marks (for my taste) on the graphs, blah blah blah. A bunch of unfair complaints, given that I get R for free, but complaints nonetheless. I see right away that Heckbert (at least, as of 1990) also had the problem of long tick marks, but of course that’s trivial. What’s more relevant is the rule for setting up where the numbers go on the axis.

R does this using the pretty() function. What exactly does prettty() do, I wondered? Back from the days when R was S, I learned that the way to learn what a function does is type its name in the console. So I’ll give that a try:


> pretty
function (x, n = 5, min.n = n%/%3, shrink.sml = 0.75, high.u.bias = 1.5,
u5.bias = 0.5 + 1.5 * high.u.bias, eps.correct = 0)
{
x <- as.numeric(x) if (length(x) == 0L) return(x) x <- x[is.finite(x)] if (is.na(n <- as.integer(n[1L])) || n < 0L) stop("invalid 'n' value") if (!is.numeric(shrink.sml) || shrink.sml <= 0) stop("'shrink.sml' must be numeric > 0")
if ((min.n <- as.integer(min.n)) < 0 || min.n > n)
stop("'min.n' must be non-negative integer <= n") if (!is.numeric(high.u.bias) || high.u.bias < 0) stop("'high.u.bias' must be non-negative numeric") if (!is.numeric(u5.bias) || u5.bias < 0) stop("'u5.bias' must be non-negative numeric") if ((eps.correct <- as.integer(eps.correct)) < 0L || eps.correct >
2L)
stop("'eps.correct' must be 0, 1, or 2")
z <- .C("R_pretty", l = as.double(min(x)), u = as.double(max(x)), n = n, min.n, shrink = as.double(shrink.sml), high.u.fact = as.double(c(high.u.bias, u5.bias)), eps.correct, DUP = FALSE, PACKAGE = "base") s <- seq.int(z$l, z$u, length.out = z$n + 1) if (!eps.correct && z$n) { delta <- diff(range(z$l, z$u))/z$n if (any(small <- abs(s) < 1e-14 * delta)) s[small] <- 0 } s }

Damn. That didn’t work. I’d briefly forgotten that a modern R function looks like this:

1. Lots and lots of exception-handling, handshaking, data-frame-handling, and general paperwork.

2. A call to the C or Fortran program that does the real work.

But I have other recourses. Let’s try Googling “R pretty.” No, that doesn’t work (you can try it yourself and see). Neither does “cran pretty,” “cran pretty(),” or anything else I can think of.

But wait! There’s the online help function! Just type “?pretty” from the console and you get a nice man page (as we used to say). Here it is:

Let d <- max(x) - min(x) >= 0. If d is not (very close) to 0, we let c <- d/n, otherwise more or less c <- max(abs(range(x)))*shrink.sml / min.n. Then, the 10 base b is 10^(floor(log10(c))) such that b <= c < 10b. Now determine the basic unit u as one of {1,2,5,10} b, depending on c/b in [1,10) and the two 'bias' coefficients, h =high.u.bias and f =u5.bias.

I’m too lazy to read this and Heckbert’s pseudocode and compare, but they certainly seem to be doing the same thing. We can try it on Heckbert’s example:


> pretty (c(105,543))
[1] 100 200 300 400 500 600

Hey, it works! And, indeed, pretty() has an argument called “n”–the desired number of intervals–so I could just set n=3 or 4 and probably be happy. I’m not quite sure how to alter R’s plotting functions to work with a modified default parameter for pretty(), but there’s probably a way–maybe in ggplot2?

Let’s try it out:


> pretty (c(105,543), n=3)
[1] 0 200 400 600
>

Much better. I wasn’t happy with that axis starting at 100. If the data range from 105 to 543, I’d rather take that axis all the way down to 0. I’m not a Darrell Huff-style fanatic on taking the axis down to 0, but I’d prefer to include 0 (or other natural boundaries or reference points, for example 100 if you’re plotting numbers on a percentage scale, or 1.0 if you’re plotting odds ratios).

I’m pretty sure that the next generation of pretty() or its equivalent should have a slightly more elaborate objective function to allow a preference for inclusion of special points such as 0.

More generally, I suspect it’s helpful to think of this sort of “AI”-like task as an statistical inference problem, or as a minimization problem, rather than to frame it as a search for an algorithm. I mean, sure, it all comes down to an algorithm at some point, but the inference and minimization frameworks seem better to me–more flexible and more direct–than the approach of going straight for an algorithm.

I suspect the above point is very well understood in computer science, but I also suspect it bears repeating. I say this because statisticians are certainly aware of the benefits of framing decision problems as inference problems, but we still sometimes slip into a lazy algorithmic mode of thinking when we’re not careful. Almost always, it’s better to ask “what are we estimating?” or, at the very least, “what are we trying to minimize?”, rather than jumping to “what do we want our answer to look like.” I think there’s more to be said on this point, but rather than try to come up with it all myself from scratch, I’ll let youall fill me in on the relevant literature.

There’s also some other weird thing that goes on in R, where it will put tick marks at, say, 10,12,14,16,18 rather than 10,15,20. The problem here, I think, is a combination of (a) too many intervals as a default setting, and (b) an idea that numbers divisible by 2 are as clean to interpret as numbers divisible by 5. I don’t think the latter assumption is correct. To me when reading a graph, 10,15,20 is much much easier to scan than 10,12,14,16,18.

P.S. OK, since we’re on the topic of R defaults, howzabout this one:


> a <- 1:5 > hist (a)

Which produces the following:

histo.png

You notice anything funny about this? Oh, yeah:

1. The data are uniformly distributed but the histogram isn’t.
2. The histogram is virtually impossible to read because all of the data fall between the histogram bars.

OK, sure, histograms can’t be perfect. But this isn’t an isolated case. We deal with integer data all the time, and it’s not good to have a default that fails in these settings. There’s always a question of how complicated you want a histogram function to be, but I’d think that with integer data it would be a good idea to use integers as the centers of the bars rather than as their boundaries.

P.P.S. Usually I’d put most of this sort of long, technical, code-filled entry under the fold, but there was something about the pointlessness of all of this . . . I couldn’t resist splashing it all over the front page.

12 thoughts on “pretty() ain’t always so pretty

  1. "Slicking up" to less transparency is likely a very common thing.

    In early Mathematica you could see in the code how the problem was being symbolically solved – that soon ended – getting more correct answers trumped the transparency.

    Don Rubin suggested that the earliest papers on a topic be sought out and read to get at the motivations before the writing got slicked up enough only other experts could discern it.

    Reproducible research and compendiums try to unwrap this somewhat (or at least hide the distracting details until they are asked for)

    K

  2. This was helpful. Thanks.

    I'm inclined to see graphics that don't start at zero without clear motivation like the odds-ratio above or where the axis doesn't have meaningful units (like utility values) as a sign that the author wants to exaggerate the size of the observed effect by increasing the apparent slope. To avoid dead space I rescale the axis instead.

    I'd love to hear more of your thoughts on this.

  3. Just tried your histogram example in Stata. The default isn't any better than in R, but Stata's -histogram- command has a -discrete- option which specifies that the variable is discrete and that you want each unique value to have its own bin:

    . histogram a, discrete

    Can't spot an equivalent in R. Is there a simpler way than something like this?

    > hist(a, breaks=0.5:(max(a)+.5))

  4. For discrete data in R, IIRC the following works:

    hist(x, breaks = min(x):max(x), right = FALSE)

    MATLAB's behavior is also terrible. It uses 10 bins by default, so the bin centers go from 1.2 to 4.8 in intervals of 0.4. This puts one count in each of the first, third, fifth, eighth, and tenth bins. (At least they all have the same height.)

  5. Well, then, maybe I should tell you about a book I saw advertised in the New York Review of Books. The blurb suggests that reading the book can help you will lotteries. (As if.) Anyway, the author's website:
    http://www.ionsaliutheory.com/
    The title is Probability Theory, Live!
    (Yes, with an exclamation point.) (And, no, I don't expect you or anyone else to be sucker enough to buy the thing.) (But if you do win the lottery, remember where you heard about it.)

  6. The R authors insists that a histogram is a density estimator, so not suitable for discrete data.
    (no refs, but B Ripley has written somewhere to the effect). They will say you want a barplot:

    a

  7. Kjetil: Your tip is much appreciated! Shows how little of R I really know. As I played with it, I found that you could use tabulate instead of table to get zero values for unrepresented numbers.

    a <- c (1, 2, 4, 8, 2, 3)
    barplot (tabulate (a))

  8. In R base graphics, you can draw a plot without axes, then add your axes with custom tickmarks afterwards. The axis() function has "at" as its second argument, which you can use to pass the value of pretty():

    x

Comments are closed.