Estimating seasonality with a data set that’s just 52 weeks long

Trying to figure out what are some keywords to research for this problem I’m trying to solve. I need to estimate seasonality but without historical data. What I have are multiple time series of correlated metrics (think department store sales, movie receipts,
etc.) but all of them for 52 weeks only. I’m thinking that if these metrics are all subject to some underlying seasonality, I should be able to estimate that without needing prior years data.

Can I blog this and see if the hive mind responds? I’m not an expert on this one.

My first thought is to fit an additive model including date effects, with some sort of spline on the date effects along with day-of-week effects, idiosyncratic date effects (July 4th, Christmas, etc.), and possible interactions.

Actually, I’d love to fit something like that in Stan, just to see how it turns out. It could be a tangled mess but it could end up working really well!

1. For a quick exploration of how the cycle might look then I would run it through an independent component analysis. That is probably not the best way to do it, but it might give some clues on how to model the cycle.

2. K? O'Rourke says:

The group that I was with in MBA school lost the simulated business strategy game we were in because I had taken a Times Series course from the Stats Dept. argued “we only have one year of past sales data and no way of knowing what date it started on”. We were supposed to presume it was Jan 1st and have some basic knowledge of the business cycle in North America :-(

Lots known about how sales of various things vary over the year (sometimes interacting with location) and don’t forget special day effects.

3. Phil says:

I’m with K?: you have, and have to use, additional information, otherwise you can’t tell if a pattern is a one-time occurrence, an annual pattern, a pattern with some other periodicity, or what. If you look at spending on election ads using data from 2012, and try to apply the same pattern to 2013, you’ll be sadly disappointed.

I have some of the same issues with data on energy consumption in large commercial buildings. Sometimes I just have one year of data, and the energy use was low from Jan-August but then picked up through the end of the year. Is this a periodic pattern (maybe this is a company that does a lot of holiday business) or is it a general level shift (maybe the company permanently added new employees) or is it something else (like the building operator accidentally switched the building so the ventilation is on 24/7, and will switch it back to normal eventually). Without additional information there is just no way to tell: some buildings with data like these are in Category A, some are B, and some are C.

You don’t say what your data are, but I have to think that you have some additional knowledge about what seasonal variation is expected. You need to find a way to get that into the model.

4. Wayne says:

Naive request for clarification: The OP has multiple weekly time series that cover the same 52 weeks and which are theoretically all subject to the same underlying seasonality? So some kind of SEM-ish, VAR-ish approach might be useful?

5. Kaiser says:

The goal is less to use this as a forecasting tool for future years but rather to make short term judgements. For example, if we observe a big shift in series A, is it due to seasonality or is it due to some A-specific factor? If series B,C,D also show a dip in a similar time period, we might believe the seasonality story. If we have historical data, this is just a simple seasonal adjustment. What if we don’t have history?

My hunch is that if we can identify series B, C, D, etc., they would provide some useful information to establish that seasonality factor. As Wayne correctly stated, in doing this, I’d be making an assumption that the selected series B, C, D are subject to the same seasonality pattern as series A. Further, for this assumption to hold, I’m pretty sure that the de-seasoned A,B,C,D would be correlated in some way.

Given the problem setup, it is accepted that any solution would be a “better than nothing” solution.

6. Bill Harris says:

I agree on the additional information, too. I’d add that I think it’s related to the domains of digital signal processing (DSP). See “Sampling rate of human-scaled time series” from this blog on 2010-06-27 (I’d add a link, but links seem to consign me to meet the spam filter :-) ). The comments to that posting alerted me to the domain of compressive sensing, too.

DSP has something to say about how much data you need to get a sufficiently small frequency resolution as well as how fast you have to sample the data to reduce aliasing errors acceptably. Compressive sensing lets you go further, but I’ll leave it to others to fill that in.

7. I agree that you 52 weeks is not enough to really be sure that what ever signals are found will repeat the following year. Think of the shocks that 9/11 or the 2008 crisis had on virtually all of society. However, I still think that it can be extracted with proper assumptions (the models and priors that Andrew suggests).

You can also do the some automated ICA that i suggested in the first comment, and then carefully interpret the extracted common signals to see if it is reasonably to assume that this is a seasonal pattern. E.g. start /end of year has to match. If there are some unexplainable signals then it might be difficult to argue that it really is the seasonal cycle.

8. Lord says:

You might also want some weather data since it isn’t purely periodic and pricing data to distinguish higher cost from higher volume for for sales of variable items like gasoline.

9. Hal Varian says:

The trick is to find another related series that has the same seasonality. Extract the seasonal component from that and use it to do seasonal adjustment for the target series. In practices, this is usually pretty easy.

10. sam mason says:

Not sure if this would help, but I’ve just started on a project looking at bring multiple datasets together and infer common causal drivers behind them. The application is in gene expression, but it may help. A good intro is:

Savage et al., 2010. Discovering transcriptional modules by Bayesian data integration. Bioinformatics

• K? O'Rourke says:

Neat: Yet another term for http://en.wikipedia.org/wiki/Meta-analysis – which probably needs updating.

Off the top of my head, Brad Efron conceptualized Bayes as being the explicit use of any information from outside the study (data set). That information can take many forms and be used for many purposes. (My initial involvement in meta-analysis was to obtain empirical priors for clinical new treatment effects to use in a cost-benefit analysis of funding clinical trials for the NIH.)

The consensus of many commenters here seems to be that for the seasonal component, the information from outside the study described here is abundant, especially credible, easily available and very likely transportable – while being magnitudes of order greater than what is _in_ that study.

Sorry to speculate, in gene expression applications few of these adjectives will be appropriate for a very long time.

As an aside in the wiki entry, it is annoying to discover one’s published work has been declared wrong in wiki – “in reality Pearson computed a simple average in this study and therefore it cannot be considered a meta-analysis”. It would annoy Fisher even more – as he insisted that an analysis of a series of similar agricultural trials should only use a simple average and disregard the with-in study standard errors (except perhaps as a qualitative check on the validity of individual studies.) WG Cochran did seem to be walking on egg shells when he disagreed with Fisher and developed the semi-weighted average in 1937 (based on a Normal-Normal hierarchical model) which he later developed an approximation for that DerSimonian and Laird (1986) perceptively extended to deal with binary outcomes…

11. Tom Fid says:

I don’t have a helpful suggestion, other than to NOT do what a marketing science consultant, whose model I reviewed, did for a big client. They estimated demand with linear regression, using 3 years of data, with a dummy parameter for each of the 52 weeks of the year. Their R^2 was pretty spectacular, but …