Aki writes:
Here’s my version of the birthday frequency graph. I used Gaussian process with two slowly varying components and periodic component with decay, so that periodic form can change in time. I used Student’s t-distribution as observation model to allow exceptional dates to be outliers. I guess that periodic component due to week effect is still in the data because there is data only from twenty years. Naturally it would be better to model the whole timeseries, but it was easier to just use the cvs by Mulligan.
ALl I can say is . . . wow. Bayes wins again. Maybe Aki can supply the R or Matlab code?
P.S. And let’s not forget how great the simple and clear time series plots are, compared to various fancy visualizations that people might try.
P.P.S. More here.
Pingback: Simple graph WIN: the example of birthday frequencies « Statistical Modeling, Causal Inference, and Social Science
please do share code; my students are trying to do similar things these days with variable stars…
ps Can we all please start using ISO 8601 http://en.wikipedia.org/wiki/ISO_8601 for dates? Standards are good.
Thanks for compliments!
Our Gaussian process code for Matlab and R-interface is available at
http://www.becs.aalto.fi/en/research/bayes/gpstuff/
Since your are interested, we will in next few days add to that web page also links to specific code to produce these estimates and figures with Matlab and just the estimates with R (because we are not good with R graphics)
I also prefer ISO 8601 dates and will use ISO 8601 like format MM-DD in the next version (MM-DD does not seem to be part of the standard)
It’s nice but… errr… how exactly does “Bayes win” by this? A frequentist can do pretty much the same thing, can’t he?
By the way, for lovers of the t-distribution: What are the degrees of freedom and how chosen? (Bayes may be somewhere in here but one could do without, I guess…)
Christian:
Just about anything that can be done using statistical method X can be done, with enough effort, using statistical method Y. Nonetheless, the actual computation was performed using a particular method, and I think that’s where the credit should go.
I don’t see where there is any “value added” by Bayes here at all, though. Isn’t this just plain good probability modelling with no use for priors and posteriors (OK, I could imagine where they are but what is lost if they weren’t there)? Am I missing something?
Christian:
The Gaussian process model is a prior distribution for the underlying series. In strict classical inference, you’re not allowed to assign a probability distribution for the parameters of interest. Again, you can do whatever you want and you don’t have to call it Bayesian—you can call it “regularization” or whatever—but this particular calculation happens to have been done using Bayesian methods.
I followed Chris Mulligan’s example and extracted the full time series from BigQuery.
http://www.mechanicalkern.com/static/birthdates-1968-1988.csv
The day_of_year column is coded 1-366 as if there were a leap day in every year, Feb 28th is always 59, Mar 1st is always 61, Dec 31st is always 366. The day_of_week column is coded 1-7, Mon-Sun.
Have fun!
Thanks for assembling this, but the site seems to be inaccessible at the moment.
This is fantastic, Aki. Most interesting is that the residual suggests the ~12/30 spike isn’t as severe as I thought.
Robert: thanks a lot for publishing the data… it had me entertained for a couple of hours!
In case someone wants to play with R/ggplot2, i found the following charts interesting:
data=read.csv(“birthdates-1968-1988.csv”)
data$weekend=data$day_of_week>5
weekdays=c(“Mon”,”Tue”,”Wed”,”Thu”,”Fri”,”Sat”,”Sun”)
data$day_of_week=factor(weekdays[data$day_of_week],ordered=TRUE,levels=weekdays)
data$serial_date=1:nrow(data)
for (i in (1+3):(nrow(data)-3))
data[i,”weekly”]=sum(data[(i-3):(i+3),”births”])/7
for (i in (1+182):(nrow(data)-182))
data[i,”annual”]=(sum(data[(i-182):(i-1),”births”])+sum(data[(i+1):(i+182),”births”]))/364
ggplot(data=data,aes(y=births,x=serial_date,color=day_of_week,alpha=0.9))+geom_point()+geom_line(aes(y=annual,size=”a”,color=”Mon”))+geom_line(aes(y=weekly,size=”b”,color=”Tue”))+scale_color_brewer(palette=”Set2″)+guides(alpha=FALSE,size=FALSE)+scale_size_manual(values=c(1.5,0.75))
ggplot(data=data,aes(y=births/weekly,x=serial_date))+geom_line()+stat_smooth(method=”lm”,se=FALSE)+facet_wrap(~day_of_week,nrow=1)
ggplot(data=data,aes(y=births/annual,x=day_of_year,color=day_of_week))+geom_point(alpha=0.8)+geom_vline(xintercept=c(unique(data[data$day==1,”day_of_year”])-0.5,366.5))+facet_wrap(~weekend,ncol=1)+scale_color_brewer(palette=”Set2″)+guides(alpha=FALSE)+opts(strip.text.x=theme_blank())
Shouldn’t the smoothed graph be constrained to be smooth across the year boundary? It seems like that would make the residual at the very end of the year bigger, which might gratify Chris :-).
Integration over the latent values was made using robust expectation propagation as described in http://www.jmlr.org/papers/volume12/jylanki11a/jylanki11a.pdf. Covariance function and likelihood parameters were estimated by optimizing the marginal posterior. The figure shown here was made using degrees of freedom nu=2 while the optimized nu was about 1.6 producing similar figure.
Smoothing should not be made across the year boundary as there seem to be increasing trend in the number of births. This is more obvious in the full time series. Using full time series will help to get better estimates for the dates near the year boundary.
I’ll check the full time series data next week and will provide then code for both the above figure and full time series.
The Matlab code for my figures is now available at http://becs.aalto.fi/en/research/bayes/gpstuff/demo_births.m You will also need 1) the cvs files provided by Chris Mulligan and Robert Kern (links in the code comments), 2) the latest GPstuff relase v3.3 http://becs.aalto.fi/en/research/bayes/gpstuff/ and 3) tight_subplot from the Mathworks file exchange.
I’ve just spent today revising a paper I’m writing where we look at a joint annual and daily trend for some air quality data. I reckon you could use a tensor product of a cyclic B-spline for the annual trend and a cyclic B-spline or cyclic random walk model for the weekly trend. I’m not totally comfortable with the periodic term presented above. Any excess temporal variation could be captured with an AR(1) error model. If I remember I’ll try to have a look at this come next week.
I would imagine that nearly the same decomposition (which is very informative) could be gotten from the stl code (seasonal-trend-loess) of Cleveland et al; I’ve only used the Fortran version of this but it is in R:
http://stat.ethz.ch/R-manual/R-devel/library/stats/html/stl.html
Yes absolutely, it is very informative, especially considering the minute effort involved. Using Mulligans code it is simply:
t<-ts(100*bdata$smoothbirths/mean(bdata$smoothbirths),frequency=7)
plot(stl(t,s.window=7))
Pingback: Slick time series decomposition of the birthdays data « Statistical Modeling, Causal Inference, and Social Science