An embarrassing question

Someone asks:

Do you have, and would you be willing to share, the code used to generate figure 1 of your paper “Lets practice what we preach: Turning Tables Into Graphs”?

My response: I’d be happy to share the code, if I had it! The program was written by my collaborator, and I doubt he’s saved it. We do have coefplot() in the arm package, though. Beyond this, I’ve heard people say good things about ggplot2.

In any case, I think the hardest part is figuring out what graph you want to make, and the second hardest part is making all the little choices to make the display work well. Once you have the basic idea, you can sketch it out and then just draw in all the lines and points one at a time pretty easily.

10 thoughts on “An embarrassing question

  1. If you're willing to learn lattice graphics, the following might be a start. I don't know lattice well enough to eliminate the box, top tick marks, and the gray lines, but perhaps someone can quickly expand:

    a <- data.frame (name=factor (c("Negative", "Neutral", "Positive"), ordered=T),
    pct=c(23.9, 52.2, 23.9),
    bound=c(4, 8, 4))

    dotplot (name ~ pct, data=a, xlim=c(-3, 63), main="Characteriztion", bound=a$bound,
    panel=function (x, y, bound, …)
    {
    panel.dotplot (x, y, …, col="black")
    panel.segments (0, y, x, y, lwd=3)
    panel.text (x – bound, y, "(")
    panel.text (x + bound, y, ")")
    })

  2. A little bit of followup that manages to eliminate the axis lines/box and the gray lines:

    library (lattice)

    a <- data.frame (name=factor (c("Negative", "Neutral", "Positive"), ordered=T),
    pct=c(23.9, 52.2, 23.9),
    bound=c(4, 8, 4))

    dotplot (name ~ pct, data=a, xlim=c(-3, 63), main="Characteriztion", bound=a$bound, par.settings=list(axis.line=list(col=NA)),
    panel=function (x, y, bound, …)
    {
    panel.dotplot (x, y, …, col="black", col.line="transparent")
    panel.segments (0, y, x, y, lwd=3)
    panel.text (x – bound, y, "(")
    panel.text (x + bound, y, ")")
    })

    If I could figure out how to do the axis.line setting within the panel function, I could make a nice panel.gelman function that you simply supply to dotplot to get the final result. I could still do this without the axis.line, but it feels a bit painful to have to add extra arguments to the dotplot in addition to a replacement panel function.

  3. Here is a ggplot2 implementation. I haven't been able to get tick marks on the x-axis alone or get the x-axis segment without the y-axis segment.

    name = factor(c("Negative", "Neutral", "Positive"), level = c("Positive", "Neutral", "Negative"))
    pct = c(.239, .522, .239)
    bound = sqrt(pct*(1-pct)/46)
    a = data.frame(name, pct, bound)

    library(ggplot2)

    ggplot(a, aes(name, pct)) +
    geom_pointrange(aes(ymin = 0, ymax = pct)) +
    geom_text(aes(y = pct – bound, label = "(")) +
    geom_text(aes(y = pct + bound, label = ")")) +
    scale_y_continuous(breaks = c(0, .2, .4, .6),
    limits = c(0, .60),
    expand = c(0, 0),
    formatter = "percent") +
    coord_flip() + theme_bw() +
    opts(axis.line = theme_segment(),
    axis.title.x = theme_blank(),
    axis.title.y = theme_blank(),
    # axis.ticks = theme_blank(),
    panel.grid.major = theme_blank(),
    panel.grid.minor = theme_blank(),
    panel.border = theme_blank())

  4. Maybe you should take a look at J. Scott Long's book "The workflow of data analysis using Stata". This book is about organizing data analysis in way that you are able to reproduce every tiny number (and graph!) – even years later. This is a topic that is hardly teached but nevertheless very important. In this regard this book is a must read – even for R users.

  5. This is not likely to be very appealing, but lately I've been making simple graphs like this in LaTeX, in the tikzpicture environment. It's very tedious, but sometimes I don't mind the tedium if in return I can have complete control over the details. The other benefit is I can use LaTeX directly in the graph without worrying about how it's rendered.

    Here's the entire document to remake that graph (I had indentations to make this easier to read, but they were eaten):

    documentclass{article}
    usepackage{tikz}

    egin{document}
    egin{tikzpicture}[scale=0.05]
    draw (-37,30) node[right]{large Negative};
    draw[line width = 2pt] (0,30)–(23.9,30);
    filldraw (23.9,30) circle (35pt);
    draw (17.6,30) node{(};
    draw (30.5,30) node{)};

    draw (-37,20) node[right]{large Neutral};
    draw[line width = 2pt] (0,20)–(52.2,20);
    filldraw (52.2,20) circle (35pt);
    draw (44.8,20) node{(};
    draw (59.6,20) node{)};

    draw (-37,10) node[right]{large Positive};
    draw[line width = 2pt] (0,10)–(23.9,10);
    filldraw (23.9,10) circle (35pt);
    draw (17.6,10) node{(};
    draw (30.5,10) node{)};

    draw (0,3)–(60,3);
    draw (0,3)–(0,1);
    draw (20,3)–(20,1);
    draw (40,3)–(40,1);
    draw (60,3)–(60,1);

    draw (0,-4) node{0%};
    draw (20,-4) node{20%};
    draw (40,-4) node{40%};
    draw (60,-4) node{60%};
    end{tikzpicture}
    end{document}

  6. I have a question about that graph. Do you prefer parentheses to straight vertical lines for marking the bounds? I've tried it both ways. I like the aesthetic of the parentheses better, and I like that using parentheses mirrors notation that would be used in a table.

    My only concern is that if we put a parenthesis on the line, we might not be sure of what point on the line is being marked. For instance, in my tikzpicture above, I think the parenthesis as a whole is centered at, say, the value 17.6, but since the parenthesis is curved, this means that it does not cross the line at 17.6. Using "node[right]" or "node[left]" instead of "node" would probably fix this, but I don't think it's that easy with other programs. Besides this, I wonder if the position of a vertical line isn't more clearly perceived by the user than the position of a curved line.

    Perhaps I'm being overly concerned with exactitude here, and that's what a table is for? (Nobody complains that the dot isn't zero-dimensional.)

  7. The way I use R — which I use for almost all of my data analysis and graphics — I can find the code for whatever plot or analysis I've published, and indeed most of the ones I haven't published. But I'd be really embarrassed to give most of that code to anyone. Not because I'm afraid that it's wrong, although of course there is always that possibility at some level, but because my programming style is so shoddy, and because so often I put in ugly little kludges to handle special things that come up. I always start with the best intentions and aim for a high degree of generality, but end up mucking around in the details, hard-coding the angle at which text gets written on figures and so on. And sometimes, when I need to add some functionality to some kind of wrapper function, I'll do it in an ugly way.

    I suppose if I had a policy of always making my code public when I publish something — probably this would be a good policy, right? — it might help the quality of my work quite a bit. Almost certainly I would be a lot more careful when I write stuff, which would help when I go back 6 months or 3 years later and wish to reuse some of the functions or reanalyze with new data. Sounds like a pain, though.

  8. I suppose if I had a policy of always making my code public when I publish something … Sounds like a pain, though.

    Sounds like a New Years resolution.

  9. Part 1: (figure and ggplot2)

    Thanks for the comments. With slight modification, I have a very beautiful and informative draft of a plot [1] and I appreciate the assistance of your readers. I'm still working out the uncertainty estimates and explanation.

    ggplot2 looks like a great package, and this is my first use. I used gimp for the label and increasing the line thickness, and still need to work out some details of the plotting. ggplot2 seems like an excellent package for creating Tufte-friendly plots.

    [1] https://netfiles.uiuc.edu/dlebauer/VarianceDecomp

    Part 2 (publishing code)
    Andrew – your answer and the ensuing discussion was much more useful than the original code would have been – so thanks!

    In the general context of code / data, I do have some thoughts on future directions and I appreciate the opportunity to share my perspective.

    I hope that sharing data and computer code will become a necessary component of doing science. I would like to see scientists and statisticians documenting any and all computational steps and data used to support the conclusions of a publication.

    At the present time, about half of researchers in my field freely share data and fewer share code. In many cases, the details of a statistical test are insufficient to recreate the analysis, even if the data were made available. It would be relatively easy to append the code and data in a text – especially as an online supplement and into one or more accessible database repositories. Sharing the data behind publicly funded research should be the default practice.

    In this context, it is not necessary to make code general, or pretty or efficient. Scientific measurement involves tinkering and development; the computation should be judged on functionality rather than efficiency, beauty, or generality. In many contexts, it is not worthwhile to seek anything but functionality (and functioning as intended). Good meta-data practices are important but rarely found in unshared data.

    Not only is can code and data sharing achieve the objective of reproducible science, it should facilitate scientific progress utilizing our expanding data and computational resources.

Comments are closed.