## Slick time series decomposition of the birthdays data

Here is my plot using the full time series data to make the model.
Data analysis could be made in many different ways, but my hammer is Gaussian process, and so I modeled the data with a Gaussian process with six components
1) slowly changing trend
2) 7 day periodical component capturing day of week effect
3) 365.25 day periodical component capturing day of year effect
4) component to take into account the special days and interaction with weekends
5) small time scale correlating noise
6) independent Gaussian noise

– Day of the week effect has been increasing in 80’s
– Day of year effect has changed only a little during years
– 22nd to 31st December is strange time

I [Aki] will make the code available this week, but we have to first make new release of our GPstuff toolbox, as I used our development code to do this.

I have no idea what’s going on with 29 Feb; I wouldn’t see why births would be less likely on that day. Also, the above graphs are great, but I think the ideal model would have some automatic “ringing” to balance out the highs with the lows. For example, if there are fewer births on 4 Jul, you’d expect to see more on 2-3 Jul and 5-6 Jul.

1. Stephen McKay says:

Feb 29th – I think parents are keen to avoid it given the apparent lack of 3/4 of birthdays for their children to celebrate! Personally, with a child born on Feb 25th in a leap year, I thought it might be a nice & distinctive feature – but others around me were “concerned”. In any case the effect is small-ish? Maybe the number of elective Caesareans could be a factor for the low numbers on these dates?

2. Peter says:

“I have no idea what’s going on with 29 Feb; I wouldn’t see why births would be less likely on that day”

Because having only one birthday party in four years sucks!

• My wife’s brother-in-law was born on Feb. 29; the upside is that at his age he can claim to have had only 18 (I think) birthdays. He’s just entering his best years :-)

3. […] We’re Gonna Have A Good Time Posted on June 19, 2012 by reflectionephemeral var addthis_product = 'wpp-264'; var addthis_config = {"data_track_clickback":true,"data_track_addressbar":false};if (typeof(addthis_share) == "undefined"){ addthis_share = [];}Super sweet graphs at Andrew Gellman’s site on the incidence of birthdays: […]

4. Dan says:

I was born on February 28 and when people find out they often say to me, “Oh wow, you were almost a leap-year baby!” And then I have to reply, “Actually, I was almost a March 1st baby.” They always find that to be disappointing.

5. John Christie says:

Cool, how about a plot indicating conception special dates. It seems clear that the late Sept. early Oct. peak is New Year’s babies.

6. Guy says:

On the increasing effect for day of the week, might this be related to increasing use of c-sections?

7. zbicyclist says:

Re leap day: Note in the data set provided in the post last week there were some errors, e.g. June 31 or Sept 31. These were small relative to the total amount of data but did indicate the presence of a certain amount of sloppiness (in the original data, not in the blog posters). So are there leap day babies who are accidentally entered into this data source as Feb 28 or March 1?

8. CM says:

Although I mostly buy the argument that parents want more birthdays for their kid, another reason for a blip on Feb 29 might be that it has fewer observations and therefore a higher sampling error.

9. Descriptions I’ve read of Gaussian Processes always seems very abstract. I have yet to see anyone explain clearly how to represent them in a concrete manner. Obviously the software involved here does use these representations (as some kind of series or something). anyone have a good reference on standard methods for *representing* gaussian processes in terms of functions that one can easily evaluate numerically?

• Corey says:

The definition of a Gaussian process is a stochastic process such that every finite collection of elements in the index set (e.g., if the index set is time, for every finite collection of time points) the random variables corresponding to those indices have a joint multivariate Gaussian distribution. Every Gaussian process comes equipped with a mean function and a covariance function; these functions are evaluated at the time points to give the mean vector and covariance matrix of the distribution.

My go-to source for this stuff is Gaussian Processes for Machine Learning.

• Awesome, thanks for the link. As I said, the definition of the gaussian process is all right as these things go but the definition tells me nothing much about how to represent draws from this abstract distribution. For example suppose you gave me measurements at 4 time points, and a known covariance function, how could I draw 300 sample *functions* so that the sample functions were IID draws from the gaussian process defined by the measurements… I assume the book you reference will answer that sort of question. Thanks again.

• NU says:

Daniel,

You sample from a multivariate normal distribution conditional on the 4 measured points. See Wikipedia for the formula for the conditional normal distribution (expressed in term of the mean and covariance of the distribution to sample from). In this case Sigma_22 will be the covariance between the 4 training points, Sigma_11 will be the covariance between the prediction points (however many you want), and Sigma_12 will be the covariance between the 4 measurements and the all prediction points. mu_1 and mu_2 will be the prior means (here assumed known and fixed) of the prediction and measured points.

I could post a few lines of R code to do this if you have trouble implementing it.

• I think I could probably implement that with a little fooling around, but it does emphasize my point that although people typically talk about these processes as abstract distributions on function spaces, the truth is most people work with evaluations at some pre-defined finite collection of points. That isn’t necessarily the only way I could think of wanting to deal with these objects.

Off the top of my head, suppose you are solving a differential equation via an adaptive step-size method, and you want to represent one of the forcing terms as a gaussian process in time subject to some measurements, some measurement error, and some idea of the smoothness of the function. The time points you need are potentially not known in advance as they depend on the choices that the adaptive stepsize method makes. You’d like to draw a few hundred *functions* which approximate the gaussian process and can be evaluated at *any* time points. I’ll take a look at the referenced book. I’m sure there are series representations, but I’m not so sure how tractable they are to work with.

• NU says:

The point of Gaussian processes is that you use the covariance function to determine the covariance matrix at whatever finite set of points you care about, and then do the conditional inference described above. If you pick a new set of points, plug them into your covariance function and get a new matrix.

10. Jonathan says:

Linked to by the Freakonomics guys:
“Study: You’re Most Likely To Die On Your Own Birthday”
http://newsfeed.time.com/2012/06/12/study-youre-most-likely-to-die-on-your-own-birthday/

11. Greg says:

This series of posts has been my favorite since I started reading this blog. Seeing the graphs develop has been fascinating. More like this please!

12. Aki Vehtari says:

When experimenting with the model and looking the residuals I did
not see much “ringing” except at the end of December. Changes in the birth
rate due to the special days are very likely due to c-sections. I
think it is likely that due to surgery capacity of hospitals, the
c-sections which are not made at special days usually even out to
many days. Exception is the end of December as there is wider
upward bump before Christmas (in the Day of year effect graph) and
sharp upward bump between Christmas and New Year.

The Matlab code for my figures is now available at
http://becs.aalto.fi/en/research/bayes/gpstuff/demo_births.m
You will also need 1) the cvs files provided by Chris Mulligan and
relase v3.3 http://becs.aalto.fi/en/research/bayes/gpstuff/ and 3)
tight_subplot from the Mathworks file exchange.

I’m not claiming that Gaussian processes would be the best tool for
this analysis (e.g. dynamic models with Kalman filter type
computation would be better), but since I am familiar with GP, it
was more about having fun and to test whether I could see something