Dean Eckles writes:
Some of my coworkers at Facebook and I have worked with Udacity to create an online course on exploratory data analysis, including using data visualizations in R as part of EDA.
The course has now launched at https://www.udacity.com/course/ud651 so anyone can take it for free. And Kaiser Fung has reviewed it. So definitely feel free to promote it! Criticism is also welcome (we are still fine-tuning things and adding more notes throughout).
I wrote some more comments about the course here, including highlighting the interviews with my great coworkers.
I didn’t have a chance to look at the course so instead I responded with some generic comments about eda and visualization (in no particular order):
- Think of a graph as a comparison. All graphs are comparison (indeed, all statistical analyses are comparisons). If you already have the graph in mind, think of what comparisons it’s enabling. Or if you haven’t settled on the graph yet, think of what comparisons you’d like to make.
- For example, Tukey described EDA as the search for the unexpected (or something like that, I don’t remember the exact quote). But, if you think about it, the unexpected is necessarily defined relative to what is expected, thus the (possibly implicit) model that the graph is being compared to.
- Consider two extreme views: (a) a graph as a pure exploration, where you bring no expectations whatsoever to the data, (b) a graph as pure execution, you know what you want to show and then you show it. The truth should always be in between. Exploration is always relative to expectations, but on the other hand you always want the capacity for being surprised.
- No need to cram all information onto a single graph. Make multiple graphs, each clear for its own purpose.
- A related point: Make each graph small, then you can put lots of graphs on a page (or screen)
- A tradeoff: clarity, recognized graphical forms (time series, scatterplot, etc), and spare (Tufte or Cleveland-like) design make a graph easier to read. But too many similar-looking spare graphs can blur in the mind and then you’re not fully engaging the reader’s visual brain. I’m not pro-chartjunk but a bit of color and glitz, even if not strictly necessary, can help.
- I like line plots. A graph with two or three lines (labeled directly, not with a legend, please!) allows comparisons across each line and between lines. I’ve found this to be amazingly effective.
- You are the first audience for your graphs. I don’t see a big difference between exploratory graphics and presentation graphics. When I make graphics for myself, I make them (roughly) presentation quality: I make them in pdf, give titles and axis labels, grids of graphs, etc.
- Statistical graphics is commonly presented as being exploratory and about plotting the raw data. I think that’s important, no doubt about it, and I don’t do enough of it. In particular, I need better tools for data cleaning. But, in addition, beyond data exploration, graphics is important for understanding the models we have fit, so I also like the term “exploratory model analysis.”
- I’ve only very rarely made dynamic graphs (oddly enough, the only one I can think of offhand, I made for a research project in 1989 and never published or even showed it to anyone else, I think). This is not a brag. I think dynamic graphics have great potential. I just want to be honest about my limitations. When teaching, it’s good to give the students a good sense of the areas I don’t know about, to help them better map the territory relative to my course material.