A statistical graphics course and statistical graphics advice

Dean Eckles writes:

Some of my coworkers at Facebook and I have worked with Udacity to create an online course on exploratory data analysis, including using data visualizations in R as part of EDA.

The course has now launched at https://www.udacity.com/course/ud651 so anyone can take it for free. And Kaiser Fung has reviewed it. So definitely feel free to promote it! Criticism is also welcome (we are still fine-tuning things and adding more notes throughout).

I wrote some more comments about the course here, including highlighting the interviews with my great coworkers.

I didn’t have a chance to look at the course so instead I responded with some generic comments about eda and visualization (in no particular order):

– Think of a graph as a comparison. All graphs are comparison (indeed, all statistical analyses are comparisons). If you already have the graph in mind, think of what comparisons it’s enabling. Or if you haven’t settled on the graph yet, think of what comparisons you’d like to make.

– For example, Tukey described EDA as the search for the unexpected (or something like that, I don’t remember the exact quote). But, if you think about it, the unexpected is necessarily defined relative to what is expected, thus the (possibly implicit) model that the graph is being compared to.

– Consider two extreme views: (a) a graph as a pure exploration, where you bring no expectations whatsoever to the data, (b) a graph as pure execution, you know what you want to show and then you show it. The truth should always be in between. Exploration is always relative to expectations, but on the other hand you always want the capacity for being surprised.

– No need to cram all information onto a single graph. Make multiple graphs, each clear for its own purpose.

– A related point: Make each graph small, then you can put lots of graphs on a page (or screen)

– A tradeoff: clarity, recognized graphical forms (time series, scatterplot, etc), and spare (Tufte or Cleveland-like) design make a graph easier to read. But too many similar-looking spare graphs can blur in the mind and then you’re not fully engaging the reader’s visual brain. I’m not pro-chartjunk but a bit of color and glitz, even if not strictly necessary, can help.

– I like line plots. A graph with two or three lines (labeled directly, not with a legend, please!) allows comparisons across each line and between lines. I’ve found this to be amazingly effective.

– You are the first audience for your graphs. I don’t see a big difference between exploratory graphics and presentation graphics. When I make graphics for myself, I make them (roughly) presentation quality: I make them in pdf, give titles and axis labels, grids of graphs, etc.

– Statistical graphics is commonly presented as being exploratory and about plotting the raw data. I think that’s important, no doubt about it, and I don’t do enough of it. In particular, I need better tools for data cleaning. But, in addition, beyond data exploration, graphics is important for understanding the models we have fit, so I also like the term “exploratory model analysis.”

– I’ve only very rarely made dynamic graphs (oddly enough, the only one I can think of offhand, I made for a research project in 1989 and never published or even showed it to anyone else, I think). This is not a brag. I think dynamic graphics have great potential. I just want to be honest about my limitations. When teaching, it’s good to give the students a good sense of the areas I don’t know about, to help them better map the territory relative to my course material.

1. Kaiser says:

Andrew: this is really great. I have two comments. I do think there is a difference between exploration and presentation. Certain tasks are onerous and do not add much value to the explorer: for example, text labels that have highly varying lengths require work and I typically don’t want to waste time editing those unless I know I will be using that version of the chart for presentation.
On dynamic/interactive charts, they are vastly over-valued. One of the questions that I have students in my dataviz workshop discuss is when interactivity adds value to the chart and when it destroys value. Surprisingly, the latter is common, and I’m referring to your initial comment that comparison is an essential task for graphics.

• Rahul says:

I think interactivity / dynamic charts are fantastic for exploratory work on complex / big data-sets but mostly only a coolness honey trap in presentation mode.

• Phil says:

I agree with Kaiser about exploration vs presentation. I make lots and lots of exploratory graphs, often without axis labels or legends, usually using R defaults for margins and such. I might make 6 plots in a minute. A presentation-quality plot takes at least 5-10 minutes, and a publication-quality graph can take anywhere from 10 minutes to an hour or even two as I tinker with aspect ratio, axis label sizes, line types, colors, and so on. One of the things I love about R is that I can make plots so quickly, even if they’re kind of crummy by default.

I would love to make more interactive plots and I wish I knew how to do it better in R. I do use identify() sometimes but there’s a ton more I would do if I knew how and if it’s easy enough.

2. John Mashey says:

“One of the questions that I have students in my dataviz workshop discuss is when interactivity adds value to the chart and when it destroys value. Surprisingly, the latter is common”

Can you say some more on this? It’s an important topic.
Sometimes interactivity is like chartjunk as per Tufte, but other times it is incredibly valuable, to the point where one could simply not get the data analysis done, or where one interactive display usefully replaces a large group of graphs, better.

• Kaiser says:

The simplest example is the mouse-over data labels instead of printing data labels on the page. If I care about the data behind the chart, I’d want to compare several numbers, not see them one at a time.
Another example is the use of a dot moving sideways instead of a line chart for the time series. If I care about the trend, I want to see all the data on the same screen and not have to rely on my spatial memory.
The key is to prioritize the list of comparisons and make sure that the top tasks are the easiest to get to.

• Daniel Wright says:

I don’t use dynamic and interactive plots much, but in my organization (ACT Inc) we use a lot of interactive plots where the user can choose his/her state for example to highlight on a time series, or a student can adjust some of his/her values and see how that affects predictions. So I think the purpose of the interactive plots are that they are for a user, and you don’t know what the user wants. In research you know the story that you want to convey to reader, so they are not used much (if a reader wants to look at something different, they usually have to ask the author).

The only dynamic plots (I use dynamic to mean plots which are already in a “video” form [usually a bunch of plots animated] and that the user can only pause and stop the sequence like a video, so not otherwise interactive) I make are for showing associations among lots of variables (and for teaching things like the bootstrap). The video goes through removing them (e.g., the graphical lasso, or glasso). It looks cool, but publications showing several plots for different values for removing lines, are also useful.

Any thoughts on the packages for rotating scatter plots of different dimensions?

• Rahul says:

@Kaiser

You are focusing on the junk interactivity. There’s useful interactivity too. e.g. 50 time series, one for each state, are a mess to see. Interactivity lets the user focus on bite-sized comparisons of interest.

Or on a dense scatter plot, say countries of world, health vs GDP, a label on mouseover is far easier to identify outliers. Printing all labels would be a mess. Or a scatterplot with labels but mouseover gives precise x,y co-ordinates

Filters in general are useful interactivity.

• Kaiser says:

Yes, I was giving examples of what don’t work because that’s what John seemed to be asking. But there are definitely examples in which interactivity adds value.

• John Mashey says:

Thanks.
1) Indeed, use of mouseovers that way is silly. I have seen them used well, as in this example, where the curves are set of reconstructions (using various statistical methods of past temperatures, often called a spaghetti chart. For some uses, I prefer other representations, like (c), but here, the mouseover highlights the specific series of interest, out of the clutter. I’ve never seen it done, but I’d like to see the (c) cloud representation with mouseovers to select one or more lines to overlay, to get geels for outliers or differences.

2) Dot moving sideways: yes, ugh, although a really nice example, to deal with time-series with huge range of time-scales, is an animation from NOAA
On recent time-scales, it shows the geographic variability, year-by-year, then keeps expanding the period. I could wish for a slider bar.

3) Back at Silicon Graphics in the 1990s, many of our 3D Graphics workstations were used in a hierarchy going from visualization of real objects, to visuallizations of differents sorts of data with less obvious represntations, the latter being more relevant to statistics, whether EDA or presentation.

a) Real things: automobile styling (where visual fidelity really, really matters, and one needs to get lighitng, shadows, etc right.)

b) Real things, but with visual elements added, such as airflows around planes or cars, or heatflows in palstic injection molding for Barbies or chocolate ice cream bars, or color-intensity displays of stresses on car-crashe dummy simulations, or visual flythroughs of seismic data for oil exploration.

c) Then there were things that could be real, but weren’t, i.e., dinosaurs in Jurassic Park.

d) But then, there were interesting applications where people on Wall Street were using big displays with multivariate representations of stock activity, to be able to see patterns quickly. Jurassic Park actually had an example of a tool that acted as a “data helicopter”, used in the UNIX system – I know this scene.
People built some interesting apps on top of the software base, as in displaying voting patterns in US, but a really good example had:
a) A map of the US by state, with a vertical bar to show the per capita income in a specified year.
b) You could tilt/rotate/zoom to you could see states hidden by nearby big bars.
c) The year was controlled by a slider. If you started at the beginning and moved forward, the bars would go up and down,l but 2 odd events leaped out (since human eyes are good at detecting relative motion).
At one point, Kansas leapt up for a few years compared to surrounding states, then subsided, and South Dakota did the same thing, at a different time. Here, one didn’t want to see a line graph of 50 states, but the change in tontext of nearby state economies.
What was going on? Minuteman missiles.

• Nick Stokes says:

John,
“On recent time-scales, it shows the geographic variability, year-by-year, then keeps expanding the period. I could wish for a slider bar.”
I’ve been experimenting with paleo time scales here. There are controls like a slider bar (you can drag). I’ve inorporated the mechanics in a more general plotter here.

I spent a lot of time with IRIS GL in the 90’s too. So I was pretty excited when WebGl became a practical tool. I tried too work out mechanics for using it routinely here.

3. Dean Eckles says:

Thanks for posting, Andrew. I’m looking forward to folks here (or your students) checking out the course.

A lot of this advice really resonates with what we have in the course. For example, we emphasize making multiple graphs, where each might be better for seeing certain features. We have students work through “refining” plots a number of times, but often the first plots are not strictly dominated by the later ones.

As for when to add labels, etc.: I find it is good not to do this too early. When I’m doing an exploratory analysis, generally the raw axis labels are at least as informative to me as any simple label I’d want to give them for a different audience.

• Rahul says:

I’m confused. Is it a free course or \$150/month?

• Dean Eckles says:

All of the course materials (including the videos, quizzes, etc) are free.

The paid version gets you tutoring, grading (esp. useful for the final project), and certification if you complete/pass the course. This makes a lot of sense for some people, maybe not as much for others.

4. derek says:

The trouble with dynamic graphs is you have to put a ton of effort into them if they’re not to just break, and then only a few people use them. Or, if you’re making the graph for EDA, you have to put a ton of effort into making them, and then you’ve used up time you could have spent exploring. There are two ways round this in my experience:

1) easy interfaces designed by a software vendor; Microsoft have made me a great dynamic tables app, a.k.a. Excel Pivot Tables, and I send them out all the time. They’re so robust and intuitive, even managers can and do use them instead of bugging me for followup details. (Pivot charts, not so much, sorry)

2) make once, use many times, for many people at a time. This is not my experience, because I don’t have that sort of job, but the New York Times has some fantastic dynamic maps of the 2,000 or so counties of the USA, and they use the code for it over and over again. Making that dynamic graph was a good investment.

• John Mashey says:

Software progresses by doing individual cases, and after getting experience, generalizing them enough to put them in libraries of some sort, or inventing higher-level languages that offer more leverage in some specific domain.
The re-use theme was already decades old before the 1977 talk. DOug McIlroy was talking about in 1968, but it actually goes back 20 years before that to some extent.

I often ask people at talks how many are programmers?
Then I ask them how many use Excel? observing that people routinely do thigns with it that would have taken hundred of lines of FORTRAN in the 1960s, ignoring the graphics.

I don’t know offhand what got John Chambers doing S, but the mid-1970s at Bell Labs witnesses a big uptick in computing from minicomputers/UNIX, and various people were building languages/tools to avoid having lots of people spending so much time cranking out low=level code, hence:
– awk
– S
– shell programming
– etc

Anyway, interactive graphing can be a terrific help, or the moral equivalent of chartjunk; better tools help as do explorations of different presentations, as per Solomon Hsiang.

5. Robert Grant says:

What did you mean by dynamic graphs? Everybody here is talking about interactive online graphs but if you made it in 1989 then I’m guessing that’s not what you meant. Unless it was done by flicking rapidly through the pages of your notepad.

• Andrew says:

Robert:

1989 wasn’t so long ago! My dynamic graph was a movie that I programmed in S on a Sun workstation. It was a simulation of a random walk through a space of models of a network.

6. question says:

What do you all think about dynamite plots (mean with error bars). These account for the vast majority of charts I come across:
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/TatsukiRcode/Poster3.pdf

Does anyone here know who first invented these?

7. Antony says:

It’s nice to see interactive graphics being discussed though it’s surprising how many do not seem to have seen real interactive graphics in action.

Interactive graphics is much more than dynamic graphics and much more than querying and selecting in single plots. IG is about fast and flexible tools for selecting and linking between multiple plots, zooming, reformatting and analysing on the fly. A possible analogy might be walking and cycling: you can walk anywhere, but if you are going where you can use a bike it is much faster and much more enjoyable to cycle. IG requires effective software. The unpolished R package iplots has plenty of IG features, but Martin Theus’s Mondrian is better for showing the true potential of interactive graphics for data analysis.

@Robert Grant There was good IG software available in 1989. Early versions of Data Desk and JMP were really impressive.

• Nick Cox says:

Antony is naturally correct (except that walking is pleasant and I have less risk of being knocked down when I walk than when I cycled).

Interactive and dynamic graphics (*) was very alive and well 25 or so years ago. See the 1988 collection http://dl.acm.org/citation.cfm?id=576082

(* I will leave definitions and distinctions to others.)

15 years before that was PRIM: http://stat-graphics.org/movies/prim9.html

The neglect of dynamic graphics, I guess, owes most to

1. Focus on the graphics that could be included in a journal or book: despite e-versions most such publication outlets retain interest mostly or exclusively in what you could print on paper. (Even with presentations, finite time allowance cuts into demonstrations that are dynamic.)

2. We are mostly less willing to spend time interacting with other people’s data than we would be on discovering something about our own.

3. Reporting of interactive graphics is trickier to write and to understand than is comment on a static graph.

4. A perception that dynamic graphics help less than is often claimed.

5. Academic snobbery, as a step towards computer games and so forth.

6. The neglect of dynamic graphics. If people around you just use static graphics, you do too.

Antony (and others) would have good comments on all of these; none is intended as a “knockdown argument” and this is, clearly in no sense a complete list.

• Kaiser says:

That’s a good list. The most important reason is interactive charts require audience participation. I wrote about this some time ago at the ASA forum: there is a return on time investment calculation from the reader’s perspective. If the designer has discovered some interesting feature in the data, usually there is a clean way to highlight those parts of the data, instead of burying the message in layers of other data. A simple summary is: the more work the designer does, the less work the reader is required to do, the more effective the presentation is.

The other wrong-headed premise of many interactive charts is that every part of a dataset is equally important. Are there examples of interactive charts in which the units/variables/cuts, etc. are not interchangeable? i.e. there is a sign of statistical intelligence?

That said, there are definitely specific instances in which interactive charts are useful. It is also more likely to be useful in science than in everyday reading. It is much more useful in exploratory charting than in presentation graphics.

• Nick Cox says:

I mostly agree. Sometimes it seems as if you’re being invited to go on a treasure hunt when you’re happy that someone just shows you the treasure they found. That would be quicker and more direct. Of course, there might be other treasure, or the treasure is really trash, and so forth.

But the premise of an interactive graphic can be precisely that different parts are not equally important to all readers. People may not care about New Jersey or North Dakota or North-East England, but if those that do have a way to select it, they win and no-one else loses.

• Rahul says:

Sometimes you just want to expose your dataset though. To let people discover whatever nuggets they can from it.

• John Mashey says:

“It is much more useful in exploratory charting than in presentation graphics.”

Yes, and I think that hints at a useful approach:
A. Start with with something interactive, to explore the data to gain insight.
B. Then, if the end goal is static images, hopefully one can select some that are especially good.

My main concern is that sometimes people seem to start with the restrictions of static graphs in mind (not all bad, since sometimes a familiar, if limited format communicates OK).

“It is also more likely to be useful in science” yep, a lot of the beautiful images one sees published are from 3D interactive models, where they then select a few frames to show. I mentioned:
‘b) Real things, but with visual elements added, such as … color-intensity displays of stresses on car-crashes dummy simulations’

I’ve never seen that image published, but long ago, Ford engineers were running (many) simulated car crashes, and were *very surprised* to find that in one narrow range of not-quite-head-on collision angles:
a) Simuulated dummy would pitch forward.
b) Airbag would knock them back
c) Just in time for the left roof strut to have collapsed inward and smash into the simulated skull, which in the simulation produced a bright yellow flash. … and if you were publishing a frame, that would be it, from the right visual angle.

The only way to find this sort of thing at the time was to run the crash sims over-night, then use a graphics supercomputer to viaualize the results intreractively, from different angles.

This is an example of creating masses of data, then using interactive graphics to help humans notice surprises.

They fixed the problem, I think by actually weakening the strut (as wa going on with many redesigns, where they were building cars with crumple zones.)

• Rahul says:

Ggobi was a similar system I had toyed with.

8. ezra abrams says:

As a scientist, I think the *only* rigorous measure of how good a graph is, is to show the graph to a target audience, and ask them what they think.
I havn’t looked into this for a year or so, but scattered through a very diffuse literature is a lot of info on this – goes back to the att guy who did 45o angles are best in scattergrams
For instance, their is a lady in Michegan who studies how diff graphics are interpreted by breast cancer patients.
Anyway, the point is, no one has made the effort to put this very, very diffuse group of people in touch with each other.

I would also say we are incredibly hobbled by ideas that were driven by slide rule era technology; when i was a grad student in the late 70s, making a simple scattergram was *painful* – you had to draw each point by hand

so, one, people like tukey, a lot of what they did was in response to this, and, sadly, is now as outdated as fortran punch cards.
Second, we still see in journals, all the time, really bad charts that derive from this how historic era.

Eg, supppose you have a simple scatter gram with 4 or 5 variables – say time on the x axis and %us pop on the y, and you have lines for white, asian, hispanic, jewish, whatever
The old thing – which was purely related to hand drawing of charts – was to label the lines with either diff symbols or maybe lettrs and then below the chart have a block of text: …(a)white; (b) asian…”
the trouble with this block of text is that it is really hard to back and forth bewteen the text and the graph
now adays we can add a marker at the end of the line…

the point is, before you get invovled in all sorts of fancy stuff like dynamic charting, make sure you clear your brain of all the cobwebs left over from as recently as the 80s, when making a publication scattergram was an hour task with rapidiographs..

the dead weight of technology is huge
as is the failure to do empirical studys on graph effectiveness; that is why i dislike tufte so much and prefer N Robbins; tufts offers opinons; but that is all they are, opinons.