Cool-ass signal processing using Gaussian processes (birthdays again)

Aki writes:

Here’s my version of the birthday frequency graph. I used Gaussian process with two slowly varying components and periodic component with decay, so that periodic form can change in time. I used Student’s t-distribution as observation model to allow exceptional dates to be outliers. I guess that periodic component due to week effect is still in the data because there is data only from twenty years. Naturally it would be better to model the whole timeseries, but it was easier to just use the cvs by Mulligan.

ALl I can say is . . . wow. Bayes wins again. Maybe Aki can supply the R or Matlab code?

P.S. And let’s not forget how great the simple and clear time series plots are, compared to various fancy visualizations that people might try.

P.P.S. More here.

19 thoughts on “Cool-ass signal processing using Gaussian processes (birthdays again)

  1. Pingback: Simple graph WIN: the example of birthday frequencies « Statistical Modeling, Causal Inference, and Social Science

  2. Thanks for compliments!

    Our Gaussian process code for Matlab and R-interface is available at
    http://www.becs.aalto.fi/en/research/bayes/gpstuff/

    Since your are interested, we will in next few days add to that web page also links to specific code to produce these estimates and figures with Matlab and just the estimates with R (because we are not good with R graphics)

    I also prefer ISO 8601 dates and will use ISO 8601 like format MM-DD in the next version (MM-DD does not seem to be part of the standard)

  3. It’s nice but… errr… how exactly does “Bayes win” by this? A frequentist can do pretty much the same thing, can’t he?
    By the way, for lovers of the t-distribution: What are the degrees of freedom and how chosen? (Bayes may be somewhere in here but one could do without, I guess…)

    • Christian:

      Just about anything that can be done using statistical method X can be done, with enough effort, using statistical method Y. Nonetheless, the actual computation was performed using a particular method, and I think that’s where the credit should go.

      • I don’t see where there is any “value added” by Bayes here at all, though. Isn’t this just plain good probability modelling with no use for priors and posteriors (OK, I could imagine where they are but what is lost if they weren’t there)? Am I missing something?

        • Christian:

          The Gaussian process model is a prior distribution for the underlying series. In strict classical inference, you’re not allowed to assign a probability distribution for the parameters of interest. Again, you can do whatever you want and you don’t have to call it Bayesian—you can call it “regularization” or whatever—but this particular calculation happens to have been done using Bayesian methods.

  4. Robert: thanks a lot for publishing the data… it had me entertained for a couple of hours!

    In case someone wants to play with R/ggplot2, i found the following charts interesting:

    data=read.csv(“birthdates-1968-1988.csv”)
    data$weekend=data$day_of_week>5
    weekdays=c(“Mon”,”Tue”,”Wed”,”Thu”,”Fri”,”Sat”,”Sun”)
    data$day_of_week=factor(weekdays[data$day_of_week],ordered=TRUE,levels=weekdays)
    data$serial_date=1:nrow(data)
    for (i in (1+3):(nrow(data)-3))
    data[i,”weekly”]=sum(data[(i-3):(i+3),”births”])/7
    for (i in (1+182):(nrow(data)-182))
    data[i,”annual”]=(sum(data[(i-182):(i-1),”births”])+sum(data[(i+1):(i+182),”births”]))/364

    ggplot(data=data,aes(y=births,x=serial_date,color=day_of_week,alpha=0.9))+geom_point()+geom_line(aes(y=annual,size=”a”,color=”Mon”))+geom_line(aes(y=weekly,size=”b”,color=”Tue”))+scale_color_brewer(palette=”Set2″)+guides(alpha=FALSE,size=FALSE)+scale_size_manual(values=c(1.5,0.75))

    ggplot(data=data,aes(y=births/weekly,x=serial_date))+geom_line()+stat_smooth(method=”lm”,se=FALSE)+facet_wrap(~day_of_week,nrow=1)

    ggplot(data=data,aes(y=births/annual,x=day_of_year,color=day_of_week))+geom_point(alpha=0.8)+geom_vline(xintercept=c(unique(data[data$day==1,”day_of_year”])-0.5,366.5))+facet_wrap(~weekend,ncol=1)+scale_color_brewer(palette=”Set2″)+guides(alpha=FALSE)+opts(strip.text.x=theme_blank())

  5. Shouldn’t the smoothed graph be constrained to be smooth across the year boundary? It seems like that would make the residual at the very end of the year bigger, which might gratify Chris :-).

  6. Integration over the latent values was made using robust expectation propagation as described in http://www.jmlr.org/papers/volume12/jylanki11a/jylanki11a.pdf. Covariance function and likelihood parameters were estimated by optimizing the marginal posterior. The figure shown here was made using degrees of freedom nu=2 while the optimized nu was about 1.6 producing similar figure.

    Smoothing should not be made across the year boundary as there seem to be increasing trend in the number of births. This is more obvious in the full time series. Using full time series will help to get better estimates for the dates near the year boundary.

    I’ll check the full time series data next week and will provide then code for both the above figure and full time series.

  7. I’ve just spent today revising a paper I’m writing where we look at a joint annual and daily trend for some air quality data. I reckon you could use a tensor product of a cyclic B-spline for the annual trend and a cyclic B-spline or cyclic random walk model for the weekly trend. I’m not totally comfortable with the periodic term presented above. Any excess temporal variation could be captured with an AR(1) error model. If I remember I’ll try to have a look at this come next week.

    • Yes absolutely, it is very informative, especially considering the minute effort involved. Using Mulligans code it is simply:

      t<-ts(100*bdata$smoothbirths/mean(bdata$smoothbirths),frequency=7)
      plot(stl(t,s.window=7))

  8. Pingback: Slick time series decomposition of the birthdays data « Statistical Modeling, Causal Inference, and Social Science

Comments are closed.