Should the points in this scatterplot be binned?

Someone writes:

Care to comment on this paper‘s Figure 4?

I found it a bit misleading to do scatter plots after averaging over multiple individuals. Most scatter plots could be “improved” this way to make things look much cleaner than they are.

People are already advertising the paper using this figure.

The article, Genetic analysis of social-class mobility in five longitudinal studies, by Daniel Belsky et al., is all about socioeconomic status based on some genetics-based score, and here’s the figure in question:

These graphs, representing data from four different surveys, plotting SES vs. gene score, with separate panels for family SES during childhood), show impressive correlations, but they’re actually graphs of binned averages. I agree with my correspondent that it would be better to show one dot per person here rather than each dot representing an average of 10 or 50 people. Binned residual plots can be useful in revealing systematic departures from a fitted model, but if you’re plotting the data it’s cleaner to plot the individual points, not averages. Plotting averages makes the correlation appear visually to be larger than it is.

My only concern is that the socioeconomic index (the y-axis on these graphs) might be discrete, in which case if you plot the raw data they will fall along horizontal bands, making the graph harder to interpret. You could then add jitter to the points, but then that’s introducing a new level of confusion.

So no easy answer, perhaps? Binned residuals misleadingly make the pattern look too clean, but raw data might be too discrete to plot.

Other than the concerns about binning, I think this graph is just great. Excellent use of small multiples, informative, clean, etc. A wonderful example of scientific communication.

36 thoughts on “Should the points in this scatterplot be binned?

  1. Aren’t correlations between averages potentially subject to the ecological fallacy – where the correlations may not hold at the individual level? In fact, the individual correlations are almost always much weaker than the averages and sometimes even go the opposite direction. Should that be a more serious concern than is being stated here?

    • The regression lines and results reported elsewhere in the paper are based on individual data points, the binning was done only for visual representation.

      • “only for visual representation” is a pretty big “only” ! The point of a paper is to convey information to the readers, and thus the point of a visual representation is to convey useful information to help readers understand the data behind the paper’s conclusions and how well they support those conclusions. A graphical choice which artificially increases the apparent strength of correlations is a big deal. It makes a skeptical reader wonder if choices of graphs were made based on “which most strongly supports my argument” rather than “which best demonstrates the data”

        • M:

          There is no ecological fallacy in the models being fit in that paper, but arguably the graph encourages an ecological fallacy in that the most natural interpretation of the graph is to see those big correlations among the aggregates.

  2. For sample sizes of a few hundreds to several thousands, hexagonal binning (see this R package, or others) is a good way to show the major patterns in the data while making it only very slightly coarser. The idea is not new, it was in JASA in 1987.

  3. Bins! Oh, glorious bins! This is of course a bit unrelated to the specific question at hand–but!–there are bins and Gelman.

    In some book–I’ve forgotten which one, I think it was Data Analysis Using Regression and Multilevel/Hierarchical Models–Gelman and Hill were talking about posterior predictive checks for Bernoulli GLM’s. At this point I have to say that it has been a while since reading that book so I might remember EVERYTHING incorrectly, I’m flying with just the instruments. Whatever. The suggestion in the book was to bin the data using some reasonable bin. This isn’t a horrible idea–it was literally among the first ideas I came up when thinking about the issue.

    But butt butterstein. There’s the awkward issue of choosing the bin size (or is it awkward? Am I the awkward one in this relationship). A convenient–I think–way of going around this could be sorting the data and then looking at the cumulative sum of responses. I ran a few simulations in R and this procedure seemed to work fairly well–in the few examples I tried.

    Cumulative distribution functions are regularly used in the analysis of reaction time data. Is there some reasonable reason why they are not used more often, am I going to hell for this, is the demiurge of this world dead.

    *Reference to an old Laestadian hymn.

  4. A very famous binned scatterplot, the so-called “Phillips Curve”, misled (and continues to mislead) generations of economists by showing an artificially created relationship between inflation and unemployment. In the unbinned data, the relationship does not exist. All this is well-documented by Nancy Wulwick. Economics textbooks still present the Phillips Curve as if it is something real that has not been debunked.

    Wulwick, N.J. (1989) ‘Phillips’s approximate regression,’ Oxford Economic Papers 41, 170–
    88
    — (1996) ‘Two econometric replications,’ History of Political Economy 28, 391–439

  5. It’s a nice graph, and I think while the binning is maybe not optimal, it would not normally a big issue.

    The problem issue in this particular case is that this field of study is an ideological minefield (which does not mean that it should not be studied, obviously). There will be people believing in genetic determinism (especially those who also like to link that to race) who will jump on this and use the figure, without the text below that explains the binning. This is probably why it reached the blog so quickly; I first stumbled across it two days ago, where it was being used for just that purpose.

    • Erik:

      Yes. I purposely gave the above post a bland title and didn’t get into those issues because I have nothing much to add in that are, and I wanted this discussion to be more generally useful to people.

      If I’d really wanted clicks, I would’ve given the post a title such as “Do your genes really determine your success in life?” Or, to really be grabby, “Racists vs. Ostriches: Who’s Gonna Win?”

  6. It’s often the case that plotting the individual points is a useless mess when the number of points is high, especially if they take on discrete values.

    My usual technique there is to add random normally distributed noise with carefully chosen scale to eliminate discreteness, to use alpha settings to make regions with many points darker than sparse regions, and to subsample randomly for plotting purposes if needed. Any of those techniques would be better than binning.

    • Alpha scaling (transparency) is a good idea but doesn’t work well with many thousands of points to plot: to avoid saturation you need to make the plotting color very pale indeed.

      But hexbinning works in those settings.

      • I have plotted 10k points on a single graph with small alpha, it works actually a lot better than you might think, provided you want to express what it does express, which is essentially a density plot. I’ve also found that it can work well to put an alpha background with all the points, and then plot in full saturation a small subset of the points over the top. For example if you had 10k points, you’d plot them with maybe alpha=0.04 and then subset out 200 points and plot them with alpha=1.

        I just checked out some hexbinning plots, and in general I didn’t like them as much as the results of this kind of alpha, or alpha with subset fully opaque.

        https://www.meccanismocomplesso.org/en/hexagonal-binning/

        For example in the above link Fig3, if you add normal noise to x and y with scale sd=2 to 4 somewhere in that range, and then you make each point be about 2 units wide, and then plot the whole thing alpha=0.05 or so, you’ll get a nice smooth density plot with no hexagonal structure. It’s basically a kernel density estimate since the alpha means that the darkness adds together and the circular dot of diameter 2 or 3 is the spatial kernel.

        • With all due respect, what I think might work and what you like the look of is not really important.

          Transparency, for all its good points, has weaknesses. One is that devices can only provide a limited range of color intensities, usually 256. If you want more than that you have to approximate, or truncate at least one end of the range of values you want to show.

          Another is that human visual perception of colors is not as accurate as that of areas. (See Bill Cleveland’s empirical work, also the JASA paper I mentioned before)

          Hexbinning, while also not perfect, can do better in both these regards. (And it doesn’t need your jittering step)

        • Hexbinning still is going to use a display limited in it’s color resolution.

          Alpha can be implemented by generating random numbers for the jitter, and plotting as usual, literally 1 line, doesn’t require a complex hex bin membership calculation, that’s an advange

        • Hexbinning still is going to use a display limited in it’s color resolution.

          So what? Using the lattice- or centroid-based versions (i.e. using area, not color, to indicate cell counts) the plots are in black and white – and that’s not surprising given that the method dates to the 80s.

          doesn’t require a complex hex bin membership calculation, that’s an advange

          The membership calculations are actually not hard – and more importantly are already coded in the R package I mentioned before. Implementing a hexbin is trivial.

        • The biggest issue with the hexbin package is that it isn’t producing ggplot2 based plots (ie the system that the vast majority of people I know do all their plotting in), which is why I mentioned the trivial implementation issue, but I see that ggplot2 has a hexbin function. I personally still don’t find it compelling as a visual display vs a kernel density, but for those that do, at least there is a modern version geom_hex

          The general form of the issue is 2d kernel density estimation vs 2d histogram. I prefer the KDE for it’s smoothness and low noise, you prefer the hexagonal histogram… Both have advantages and disadvantages that should be considered.

        • Daniel: all in favor of considering pros and cons but your list of hexbinning’s disadvantages have been that;

          You didn’t like it as much as transparency
          It relies on the device’s color resolution (which it doesn’t)
          It requires complex calculations (which it doesn’t, and the ones it does require are packaged up)
          It’s not available on ggplot2 – which is the “biggest issue”, yet you mention it last. Except it is available – code for the version in which hexagon areas are proportional to cell count is on a gist at github with an example here

        • Whoops, hit return accidentally.

          Yes, you were the one who mentioned that alpha blending is limited by the color resolution of 256 levels, which of course hex binning is also limited by because it’s a display issue. In either case you can assign multiple colors to points with different properties. The graphical color/saturation properties of the graph are completely symmetrical between the two, so that isn’t really an issue in favor of one vs the other either way.

          I don’t have any problem with someone else using hex-binning. I’m encouraged by the fact that it *is* available in ggplot2, although not through the hexbin package.

          You seem extremely promotional of hexbinning. I’m simply offering to the other readers some alternative thoughts. I’m not trying to argue in favor of alpha blending and against hex binning, though you seem to be defending hex binning as if I were. I’m offering opinions on aesthetic, practical, and data analytical issues.

          Hex-binning like all binning is subject to the noise issues that histograms have. If a point strays by epsilon across a boundary of a hex (or a box or a bin or whatever) the count in that hex changes by a finite quantity +- 1, so it’s not a continuous transformation. Alpha blending is continuous, if a point moves by epsilon, it’s “shadow” moves by epsilon, so it is a continuous transformation of the image. For discrete levels of points multiple points can sit at one spot epsilon near the boundary, and so large changes in brightness could occur due to small changes in point position or hex size. The same is true for just plain old hist() in 1D

          So, point by point, for the record, mainly for anyone else who might be following this, because obviously you have your own opinions already:

          1) For most people these days, any graphics need to be doable in ggplot2, and work with its modifications and environment etc, this is neither for or against the hex binning concept but it argues against the hexbin package based on lattice/grid graphics and in favor of geom_hex from ggplot2. Thanks to geom_hex there’s no need for people to figure out how to implement hex binning themselves. My concern about the computational complexity was manual implementation in ggplot2, which I think for most people is not something they’ll do. So all is good whatever the case, if you want a hexbin in ggplot2 use geom_hex.

          2) Histogramming, whether hexagonal or square or octagonal or whatever, is subject to the noisiness of hard boundaries, across which an epsilon of point movement can induce a step-wise change in counts. This can result in artifacts if there are interactions between the structure of the data and the size of the hex and/or the discrete resolution of the data levels. You can imagine that a whole pile of data sit on one discrete level just to one side of a hex, that hex will be bright or big or whatever, and the neighbor small. But an epsilon change in the size of the hexes will make the “next” hex be the bright/big one. The same thing is true for histograms.

          3) Alpha blending (with jitter for discrete cases, without for non-discrete) implements a kind of kernel density estimate, where epsilon changes in point position result in epsilon changes in shadow position. Even better would be if you could use a “point shape” that was itself a gradient. This may be possible, not sure. The resulting alpha blended heatmap has properties similar to doing geom_density or plot(density(…)) in 1D but for the 2D case. It’s a useful tool to have, and aesthetically I prefer it, others may too so that’s why I mention it. When it comes to graphics, aesthetics are important, but individual.

          4) Hexagonal binning may be confusing for people who don’t understand it, whereas a pile of points plotted over the top of each other is perhaps less out of the ordinary. The amount of explanation required in the two cases may be less, depending on the audience. I can count the number of times I’ve seen a hexbin plot in a publication or online on the fingers of zero hands. Familiarity is something that is a nontrivial issue particularly when the audience isn’t from a stats background.

          All this being said, I can see why people choose hexbinning, and it seems like worth trying out for people to see whether they’ll like it.

          That’s pretty much all I’ve got on the topic anymore. Thanks for pointing out hexbin, and getting me to look up the ggplot2 equiv.

        • Daniel — do you not find it a problem to have a vector graphics file (eg PDF) with so many points? I often use a similar approach, but find that if I plot a large number of points or lines, the resulting file is large and can have trouble rendering.

        • Also, because we’re talking about graphical display, it’s pretty rare that more than 10k points are really needed. If I’m generating a graph and I have 100k or 1M points, I’d probably just randomly subset down to 5k or 10k and plot those, and I haven’t had issues with vector files of 5k or 10k points. If you figure maybe at most 20 bytes per point, a 10k point file is a 2MB pdf raw, you can compress the pdf and file size is likely to be more like 200kB

          I just tested it using ggsave of a 50000 point alpha-blended x,y point plot and the pdf = 3MB in size and opens with about a 3 second delay to render in “evince” on linux.

          Code to reproduce:

          library(ggplot2)
          dat = data.frame(x=runif(50000,0,1));
          dat$y = round(10*(dat$x+rnorm(dat$x,0,.2)))
          plt = ggplot(dat,aes(x,y+rnorm(dat$x,0,1)))+geom_point(alpha=.01,size=4)
          ggsave(“/tmp/test.pdf”,plt)

        • @Nick

          I have faced this same problem. The file size can be small but the rendering times in Adobe etc. somehow get slow. Not sure why.

  7. I have a related question, what would you say if the binning were in a multi-level model used for a repeated measures design. Say I have some psychophysics study with 20 people and maybe 500 trials each with a continuous predictor and reaction time response. You could make 20 small multiples but that’s quite a few and gets unwieldy with more participants. And 500 points a plot would be pretty messy with that. The total amount of data can’t fit on one graph or even small multiples. After a check for consistency at the subject level I should think that plotting predicted values with a continuous predictor as a line with predicted values making the predictor variable categorical for points would result in a reasonable plot. Actually, I think those points would be the same as binned residuals but a slightly different meaning showing the cost benefit relationship for fit v. simplicity.

  8. The central problem I see with the graphs is that once you’ve drawn regression lines over the binned data people are going to assume that those regression relationships somehow reflect the analysis that was actually done on the individual values. So if your are going to show the binned scatterplots don’t plot the regression lines. But preferably, show the scatterplots for all the individual data and do not shrink from the fact that there really is a whole lot more variability in your data. Quit trying to depict your modeled relationships as if they are more precise than they really are. Graphs should be used to illuminate the real variation not to try and grab attention by presenting the modeled relationships as more precise than they really are.

    • Graphs should be used to illuminate the real variation not to try and grab attention by presenting the modeled relationships as more precise than they really are.

      This sounds a bit too strong.

      I agree that binning can mask underlying variation, but binning is also very useful to show average relationships. For instance, average temperatures over the course of a year are commonly graphed and are very informative about a cities climate. https://usclimatedata.com/

      • In general, I think giving averages without also giving some indication of variation is a bad idea. For example, average temperatures do give some information about a city’s’climate, but for many practical purposes (e.g., heating and cooling costs), knowing variation is also important.

Leave a Reply

Your email address will not be published. Required fields are marked *