Oswaldo Melo writes:
I have learned many of curve fitting models in the past, including their technical and mathematical details. Now I have been working on real-world problems and I face a great shortcoming: which method to use.
As an example, I have to predict the demand of a product. I have a time series collected over the last 8 years. A simple set of (x,y) data about the relationship between the demand of a product on a certain week. I have this for 9 products. And to continue the study, I must predict the demand of each product for the next years.
Looks easy enough, right? Since I do not have the probability distribution of the data, just use a non-parametric curve fitting algorithm. But which one? Kernel smoothing? B-splines? Wavelets? Symbolic regression? What about Fourier analysis? Neural networks? Random forests?
There are dozens of methods that I could use. But which one has better performance remains a mystery. I tried to read many articles in which the authors make predictions based on a time- eries and in most, it
looks like the choice was completely arbitrarily. They would say: “now we will fit a curve to the data using multivariate adaptive regression splines.” But nowhere it’s explained why he used such a method instead of, let’s say, kernel regression or Fourier analysis or a neural network.
I am aware of cross-validation. But am I supposed to try all the dozen methods out there, cross-validate all of them, and see which one performs better? Can cross-validation even be used for all methods – I am not sure. I have mostly seen cross-validation being used within a single method, never between a lot of methods.
I could not find anything on the literature that answers such a simple question. “Which curve fitting model should I use?”
These are good questions. Here are my responses, in no particular order:
1. What is most important about a statistical model is not what it does with the data but, rather, what data it uses. You want to use a model that can take advantage of all the data you have.
2. In your setting with structured time series data, I’d use a multilevel model with coefficients that vary by product and by time. You may well have other structure in your data that you haven’t even mentioned yet, for example demand as broken down by geography or demographic sectors of your consumers; also the time dimension has structure, with different things happening at different times of year. If you want a nonparametric curve fit, you could try a Gaussian process, which plays well with Bayesian multilevel models.
3. Cross-validation is fine but it’s just one more statistical method. To put it another way, if you estimate a parameter or pick a method using cross-validation, it’s still just an estimate. Just cos something performs well in cross-validation, it doesn’t mean it’s the right answer. It doesn’t even mean it will predict well for new data.
4. There are lots of ways to solve a problem. The choice of method to use will depend on what information you want to include in your model, and also what sorts of extrapolations you’ll want to use it for.