Lorraine Denby and Colin Mallows write:
It is usual to choose to make the bins in a histogram all have the same width. One could also choose to make them all have the same area. These two options have complementary strengths and weaknesses–the equal-width histogram oversmooths in regions of high density and is poor at identifying sharp peaks; the equal-area histogram oversmooths in regions of low density and so does not identify outliers. We describe a compromise approach which avoids both of these defects. We argue that relying on asymptotics of the Integrated Mean Square Error leads to inappropriate recommendations.
I’m so glad they wrote this article (it appeared recently in the Journal of Computational and Graphical Statistics)! I’ve thought for a long time that (a) histogram bars are typically too wide (for example, as set by default in software packages such as S and R), and (b) that the underlying problem was that people think of the goal of the histogram as to closely approximate the density function.
A key benefit of a histogram is that, as a plot of raw data, it contains the seeds of its own error assessment. Or, to put it another way, the jaggedness of a slightly undersmoothed histogram performs a useful service by visually indicating sampling variability. That’s why, if you look at the histograms in my books and published articles, I just about always use lots of bins. I also almost never like those kernel density estimates that people sometimes use to display one-dimensional distributions. I’d rather see the histogram and know where the data are.
Denby and Mallows go far beyond my vague thoughts by considering histograms with varying widths and coming up with a particular algorithm. I’d like to try out their method on my own problems. Is there R package out there?