## Simple graph WIN: the example of birthday frequencies

From Chris Mulligan:

The data come from the Center for Disease Control and cover the years 1969-1988. Chris also gives instructions for how to download the data and plot them in R from scratch (in 30 lines of R code)!

And now, the background

A few months ago I heard about a study reporting that, during a recent eleven-year period, more babies were born on Valentine’s Day and fewer on Halloween compared to neighboring days:

I wrote,

What I’d really like to see is a graph with all 366 days of the year. It would be easy enough to make. That way we could put the Valentine’s and Halloween data in the context of other possible patterns. While they’re at it, they could also graph births by day of the week and show Thanksgiving, Easter, and other holidays that don’t have fixed dates. It’s so frustrating when people only show part of the story.

I was pointed to some tables:

and a graph from Matt Stiles:

The heatmap is cute but I wanted to see the whole year’s pattern, not broken down by month, and I wanted a graph that showed quantitative patterns. Chris Mulligan’s graph (see top of this blog) was much better from my perspective.

– As Chris noted, Valentine’s Day and Halloween do show up but just barely.

– You can also see dips around the Labor Day and Thanksgiving weekends which are spread a bit because the dates vary for these day-of-week holidays.

– I’d consider rescaling the y-axis so the red line=100, then it would be easier for me to get a grip on the scale of the variation.

– I don’t get anything out of the lowess line but it was a clever way for Chris to pull out some extreme dates automatically. (It was my idea to multiplying the 29 Feb counts by 4.)

– The data could be cleaned even further. Here’s how I’d start: go back to the data for all the years and fit a regression with day-of-week indicators (Monday, Tuesday, etc), then take the residuals from that regression and pipe them back into Chris’s program to make a cleaned-up graph. It’s well known that births are less frequent on the weekends, and unless your data happen to be an exact 28-year period, you’ll get imbalance, which I’m guessing is driving a lot of the zigzagging in the graph above.

– The next step would be to go back to some of the questions raised in recent years by economists who have noticed different patterns of birthdays (and thus of conceptions) as a function of age and education levels of parents. I don’t know what’s been done on this since 2009.

– It might be cute to to display the graph in a circle, to connect 31 Dec – 1 Jan, but I don’t recommend it, as this would come close to destroying our ability to see the annual pattern in the data.

The moral of the story

– The direct time-series graph showed patterns clearly, allowing us to make qualitative and quantitative comparisons much better than were possible using the cute heat map or the tables.

– High-resolution graphics can make a difference, even for a problem as simple as displaying a sequence of 366 numbers.

P.S. More here and here.

1. Thanks for the feedback!

>- I’d consider rescaling the y-axis so the red line=100, then it would be easier for me to get a grip on the scale of the variation.
I agree, that would be better.

>- The data could be cleaned even further. Here’s how I’d start: go back to the data for all the years and fit a regression with day-of-week indicators (Monday, Tuesday, etc), then take the residuals from that regression and pipe them back into Chris’s program to make a cleaned-up graph. It’s well known that births are less frequent on the weekends, and unless your data happen to be an exact 28-year period, you’ll get imbalance, which I’m guessing is driving a lot of the zigzagging in the graph above.
This is really helpful. I came to the conclusion that day of week effects were probably driving a lot of the noise, but didn’t come up with a quick way to deal with that given the nature of the data I had.

>- It might be cute to to display the graph in a circle, to connect 31 Dec – 1 Jan, but I don’t recommend it, as this would come close to destroying our ability to see the annual pattern in the data.
I thought about doing exactly that at one point, and came to the same conclusion. Cute, but totally useless. Perhaps adding December to the beginning of the graph, and January to the end, would give you the benefits of the circle graph?

• Andrew says:

Chris:

Yes, maybe a 14-month graph would be a good idea (perhaps using a gray background for the overhanging months to make visually clear what is happening).

• I added a new graph to the bottom of that post that incorporates some of the cosmetic changes.

• D.O. says:

I came to the conclusion that day of week effects were probably driving a lot of the noise, but didn’t come up with a quick way to deal with that given the nature of the data I had.

What’s the problem? You can just add up all births for each day of the week and weigh the data to erase differences. The public dataset even includes day-of-the-week as one of the fields. I do not know R, but it would be just couple of lines of MATLAB code.

2. Christian Hennig says:

To me the smoothed line looks surprisingly bad, being for example quite consistently above the data between April and July in a period that doesn’t look particularly difficult to smooth. I won’t meditate long about its definition to see whether I understand what is going on here but to me it certainly is more of a misleading distraction than of any help.

• Andrew says:

Christian:

I agree that the smoothed line isn’t so good. It served the convenient purpose of allowing Chris’s program to automatically identify some outliers but I don’t think it helps to include the line on the graph.

3. Bill Harris says:

You suggest doing a by-day-of-the-week regression to reduce some of the noise in the data. In the signal processing world, where you don’t necessarily have such an obvious covariate but you know you have a non-integral number of periods of data (in this case, a non-integral number of weeks in a year), the common approach is windowing: multiplying the entire data record by some function of time that reduces the effect of the end data points as compared to those in the middle. Search on Blackman, Hanning, and Hamming (the Hamming distance between the last two names is uncomfortably small!) for names of common windows and thus references to the larger field.

I don’t have time to try it now, but I wonder how windowing would stack up against your by-day regression. I suspect the regression would work better for smaller size samples, while the windowing might work better for cases where the number of unmodeled random effects were large or their covariates largely unknown.

Any thoughts on this? Any experience trying both on the same data?

• Bill Harris says:

By “smaller size samples,” I meant samples that contained fewer full periods of the lowest frequency component in the data.

It may be worth mentioning that the heatmap by Matt Stiles does not represent frequencies–the shades correspond to the ranks of the days (which I thought was a bit of an odd thing to show by a heatmap).

I’ve just seen that he added a clarification post: http://thedailyviz.com/2012/05/18/how-common-is-your-birthday-pt-2/

5. Anom says:

Great post! Can I ask you to post codes for the heatmap graph?

• Andrew says:

No—it’s the time series plot I like, not the heatmap!

6. […] commented on the graph and had some constructive feedback. I made a few cosmetic changes in response: […]

7. Fun data! The lowess smoothed curve should probably be smoothed w.r.t. the periodic data, so that the smoothed value for January includes values from December and vice versa.

8. Tom says:

I’d be interested to see this broken down by region (or state) – are the same trends true across different parts of the US? – is the increase in births in September quite as pronounced in Southern states where I presume there is less chance of being snowed in…?

Are the general trends the same in the Southern hemisphere? (is the data available for Australia?)

9. K? O'Rourke says:

Very nice to able to copy, paste and play with this!

I believe there should always be graphs very close to the raw data with next to no processing.

Here, if only for private view by the analyst, add on the individual year counts (maybe times the number of years).

The meta-analysis the group I worked with suggested a plot for that purpose which has become known as the L’Abbe plot, but many groups avoid using it for reasons I find hard to grasp – why not look at and perhaps show the _raw_ data?

By the way, the plot was suggested by DF Andrews (of Andrews plot) and revised by AS Detsky (who now produces broadway musicals, no doubt similar to managing clinical research groups) but the plot was named after the first author of the meta-analysis paper who has since changed her name.

10. Bill Harris says:

Didn’t Tukey write somewhere about fitted lines added to scatterplots as being propaganda?

• Nick Cox says:

Tukey did talk about propaganda graphs as one kind of graph in the Snedecor Festschrift. That is accessible at

http://www.edwardtufte.com/tufte/tukey

But I don’t think his term would include fitting lines to scatterplots so long as the data remained visible. Mind you, although showing the fitted line only would widely be regarded as dubious in a regression context, many ANOVA people routinely plot only fitted values.

11. Jon Peltier says:

The lowess curve should be continuous from December to January. How I’ve done this in the past is to plot the data twice, duplicating July to December at the beginning and duplicating January to June at the end. Run the lowess using all this data, then crop out the extra six months at beginning and end.

I understand how caesarean and induced births would vary on certain days–Halloween and Valentines Day, but more notably weekends and holidays like Thanksgiving, July 4, Christmas, New Year’s, and ironically, Labor Day–because patients and doctors would rather be home on the holidays. I can’t think of a mechanism for the bump in natural births, unless the reporting is subjective.

• Jon Peltier says:

And I found that heatmap next to useless compared to the line chart.

12. […] writes: Here’s my version of the birthday frequency graph. I used Gaussian process with two slowly varying components and periodic component with decay, so […]

13. Rick Wicklin says:

I’ve blogged about the “holiday effect” of births in the US. I used a box plot to show births by day of the week, and the outliers are all holidays. The discussion of holidays is hereL http://blogs.sas.com/content/iml/2011/09/16/the-effect-of-holidays-on-us-births/
You can find a discussion of the yearly cycle, along with references and a scatter plot with a (periodic) smoother at http://blogs.sas.com/content/iml/2011/09/09/the-most-likely-birthday-in-the-us/

14. […] Andrew Gelman took ‘Births by day of year’ and analyzed few ways of presenting the data – how to present data understandably […]

15. […] Some graphs of number of births by day of year. […]