## Visualizing correlation matrices

See here. It’s an important issue, but their plot has two huge problems:

1. The big fat circles in the diagonal axis are conveying no information and are, to my eye, a distraction.

2. They forgot to to order the variables, as a result creating a confusing pattern. Try reordering to put the highly-correlated variables together (as Tian did for Figure 8 in our article).

They also gave the variables unreadable abbreviations. This is not specifically an error with the correlation plot but it’s a common mistake that can easily be avoided.

P.S. More here from Eduardo and John.

1. lb says:

I actually disagree with point 1. They serve to set the scale for the correlations we care about. Otherwise your mind has to produce the inscribed circle by itself. Also, if you forget whether black means positive or negative, they provide a quick reminder.

I definitely agree with point 2, however, and I would prefer to only have the lower triangular part, instead of repeating each of the correlations

2. Sanjay says:

I agree with not repeating the correlations above and below the diagonal. One of the commenters in the linked blog points to this page, which has some better uses for the space above the diagonal. Not crazy about the pie charts, but the smoothed fits and scatterplots could be informative.
Also, a minor variation on #2… instead of ordering the variables based on how highly they're actually correlated in the data, you can order them based on some a priori expectation (if you have one) and use the graphic to check whether the data fits your expectation. For example, if you are correlating items on a multidimensional questionnaire, and you group items together by subscale, you should expect to see a pattern of dark triangles just below the diagonal (corresponding to the within-subscale intercorrelations). Or if you have longitudinal panel data and you order the variables sequentially by time, you would see a pattern of diagonal stripes of decreasing darkness if the data are autocorrelated. Etc.

3. I use R's heatmap function, which uses hierarchical clustering to automatically re-order the variables so that correlated ones are close together.

Also see this as a variant on the link: http://addictedtor.free.fr/graphiques/RGraphGalle
Nice if you have a good ordering.

4. Aleks Jakulin says:

My own approach also uses hierarchical agglomerative clustering, but I also reorder the result that comes out of clustering to minimize boundaries. I use only grayscale instead of the terrible Christmas tree colorified look. As an example, a picture the similarities between US Senators.

5. David Smith says:

That's a really neat chart, Aleks. It would be fascinating to see it updated for the 2008 Senate.

6. Dreas Nielsen says:

Here is another approach to visualizing correlations. That is to discard the matrix format and to use link-node diagrams instead to represent the relationships. These are illustrated by the figures here: http://www.integral-corp.com/files/1055/FiguresNi… . These figures use line color and line weight to represent the differing sign and magnitude of the correlation coefficients (or similarity measures, in the second figure). They are subject to criticism from a cognitive-perceptual standpoint, and so are not suitable replacements for a correlation matrix in published work, but they do provide a rapid (and effective, we think) means of distinguishing strong and weak relationships, and thus to focus further analyses. As the first figure shows, contrasts between data sets are easily visualized. The link-node-diagram approach also allows incorporation of spatial relationships for environmental data, as shown by the second figure.

The first of the figures was created using R to perform the calculations and to write a script for the GraphViz 'circo' tool. The spatial figure can also be produced using GraphViz, but we've scripted it using R and a GIS system for more flexibility in setting up the basemap.