I hate polynomials

A recent discussion with Mark Palko [scroll down to the comments at this link] reminds me that I think that polynomials are way way overrated, and I think a lot of damage has arisen from the old-time approach of introducing polynomial functions as a canonical example of linear regressions (for example). There are very few settings I can think of where it makes sense to fit a general polynomial of degree higher than 2. I think that millions of students have been brainwashed into thinking of these as the canonical functions and that this has caused endless trouble later on. I’m not sure how I’d change the high school math curriculum to deal with this, but I do think it’s an issue.

39 thoughts on “I hate polynomials

  1. Polynomials don’t kill people. It is the use of the polynomials that is the problem. In general, higher order polynomials are simply not identified in whatever data set is being used. The estimated coefficients are basically made up numbers.

  2. Indeed, polynomials are much abused. The Runge phenomenon is not as well known as it should be.
    It’s tempting to answer jonathan stray by suggesting Fourier series, but these are just as abused (non-periodic data -> Gibbs phenomenon).

    • I believe you are forgetting just how much Andrew hated this:

      http://statmodeling.stat.columbia.edu/2013/08/05/evidence-on-the-impact-of-sustained-use-of-polynomial-regression-on-causal-inference-a-claim-that-coal-heating-is-reducing-lifespan-by-5-years-for-half-a-billion-people/

      I mostly think of that paper as revealing the dangers of exporting statistical methods from one context to another without carefully considering how the method relates to the actual empirical environment. Personally, I think there are times when a (structurally meaningless) polynomial can be used to get good identification and inference in an RD setting, but the problem in that paper is that “geography” is just not really a good running variable (good meaning has the properties that make RD-type methods convincing).

      That said – there are definitely contexts where people will just plop in as many polynomial terms as it takes to get significance, or significance and the sign they want. I’ve seen it mostly in cases where people feel they have to control very carefully for age effects, but that is just because that’s a problem I’ve worked on.

      • Age effects can be *very* tricky as age likes to interact with everything and it can be difficult to distinguish higher polynomials from other interactions you should look for. This seems to become even worse in multilevel and nonlinear models.

        • Agreed. Its almost like people stop thinking about “age” as a process of human development the minute they can stick it on the right hand side of a regression equation. And then of course the more complicated the model, and the more age-dependent the outcome, the more sensitive the estimates are to specification of the outcome-age profile.

        • Age is also a proxy for periods or cohorts in observational data, making it even more tricky. I do wonder which approach is the most reasonable one when you are modeling an effect of x on y and you just want to correct for age but you do not have a strong theory in which way age *exactly* interferes with y or the effect of x on y. What would you do? Include as many interactions as possible? Just control for age as a linear predictor? I’m just starting to think about this issue but the effect of age seems to be something highly troublesome and difficult to model even if you do not really care about age itself.

        • You know, on that point, I think that we have not fully explored that problem in the context of repeated cross-sections. In those cases, age and cohort are intimately linked, but not co-linear. The age-cohort-period problem becomes much more interesting when, within any given age, you have multiple cohorts.

          Also – just because you can add a linear control for age and it doesn’t get dropped, doesn’t mean its controlling for age. Often, and again depending on the type of cross-sectional data you have, I think you end up actually controlling for something like calendar time, because once you’ve included cohort dummies (which subsume age in the single, instantaneous cross-sectional data world) what’s left is essentially the progress of real time, not age time. And then all of this depends on where your identifying variation is coming from (over real time, over age time, fixed relative to survey time, fixed relative to cohort time, etc.).

          I haven’t even found a convenient vocabulary for trying to talk with people about how various controls might be getting at age-time v. calendar-time. The closest I’ve come is trying to get people to think about the outcome-age profile as an object of interest, and to think about how and why that might bend or shift over time.

        • Interesting thoughts but I’m not sure if I understand what you mean by “progress of real time” instead of “age time”. Could you elaborate a little bit? Time in all its forms (cohorts, periods, age) as a variable is still quite undervalued in Social Research at least in Sociology (I do know that you (?) Econs are more aware of it). My guess is the reason is that we still mostly have only cross-sectional data and there we cannot really distinguish between different time effects. That’s not really an excuse, though, because our cross-sectional data are still affected by all of them.

          Why do you think that we have not fully explored the problem in the context of repeated cross-sections? At least there is some way to distinguish between cohorts and age (at least if there are either enough repetitions or the time intervals are big enough). Of course panel/repeated-within-people designs are far superior but we usually don’t have those for example in a comparable way for international comparisons.

        • Which conclusions? If you mean that you can do an RD-type analysis using a linear control in the running variable, then yes, so long as you restrict yourself to regions right around the discontinuity and have a lot of observations there. In fact, I think current best practice is to use a local-linear (or local-polynomial) regression across the running variable, which amounts to essentially the same thing (depending on bandwidth choice), but shows you the relationship in the data at points further from the cut-off as well.

          Of course, if there was ever a method that was designed for the “eye test”, an RD is it. So if you don’t see it in the data, then no specification is going to be convincing. That said – if you do “see” the effect, I’m less concerned about what polynomial you fit, so long as it looks about right around the cutoff (that’s why you should always plot your RD with the actual observations (or bins of observations).

        • I meant, to get the big jump in life expectancy across the river doesn’t really need a cubic. You can get that sort of result by fitting quadratics or straight lines as well. Maybe I’m wrong.

        • Well apparently not quite so much. Here’s Andrew’s paper (which has a mistake, I think):

          http://www.stat.columbia.edu/~gelman/research/unpublished/regression_discontinuity_2.pdf

          Table S.9 in the supplemental material gives the authors’ results trying other models. The cubic
          adjustment gave an estimated effect of 5.5 years with standard error 2.4. A linear adjustment gave
          an estimate of 1.6 years with standard error 1.7. Figure 2 here shows the relevant part of the table.

          But then Andrew copies and pastes the wrong part of the table. He shows the fairly stable estimates on particulate matter, not the reduced-form effects on life-expectancy (or the IV/2SLS estimates on life-expectancy that instrument with the dicontinuity).

          Looking at the right estimates (on life expectancy), the linear and quadratic estimates are much smaller (but of the same sign), and the cubic and higher estimates are pretty stable for both the OLS and IV (though smaller for the IV). You can sorta see why in the picture.

          *Sidenote: Andrew: why does the caption to Figure 2/Figure S.9 say that the regressions don’t include covariates? They are there in the note on the PNAS version, along with a “smootheness on observables” across the discontinuity graph.

        • Jrc:

          Yes, it looks like you’re right that I put in the wrong part of the table, I’ll fix in the revision. Thanks for pointing this out.

      • OTOH, I’ve seen rich models with a lot of structure & hierarchical levels where by tuning parameters or selecting features one still gets to the same goal: Getting conclusions one wants.

        • @Rahul

          I run a classifier on a collection of your comments in this blog.

          The algorithm is uncertain whether to classify you as a Cartesian skeptic or a card carrying polemicist.

          I also find that “treating” a thread with one of your comments is associated with 3 comments or more per post (95% CI 1-6).

          In addition your comments are twice as likely to elicit a response from Andrew relative to all other comment.

          ;-)

        • Comments Sailer, who lives in his own universe and composes 10,000 blog comments a day, is an outlier adn should not have been counted.

  3. 2009: See The Alberta oil boys network spins global warming into cooling, which uses a sixth-order polynomial.
    Deep Climate’s animation cycles regression lines from linear through sixth-order.

    2012: Roy Spencer, Ph.D: UAH Global Temperature Update for March 2012: +0.11 deg. C
    “The 3rd order polynomial fit to the data (courtesy of Excel) is for entertainment purposes only, and should not be construed as having any predictive value whatsoever.”

    Roy Spencer’s Entertaining Polynomial Suggests End-of-Days David Appell’s commentary notes that the caveat sometimes didn’t get copied, and even Spencer sometimes did not. Unfortunately, he linked to a URL that gets updated, but via Wayback, we find:
    caveat-less polynomial appeared by Oct 6, 2011.
    That persisted through Nov 1, 2012.

    Creating such a graph, with a disappearing caveat, easily misleads people, since we tend to remember the appearance of the graph, not comments, especially those that get lost.

    • Fourier approximations ought to be used only when the data are known or believed to represent a nearly periodic phenomenon. I’ve used them successfully to represent the light curves of variable stars, for example.

      For more general data, if they are clearly nonlinear and possibly bumpy, I would generally prefer cubic splines. Although they use cubics, they are not subject to the objections that Andrew mentions because the requirement of continuity in value and first derivative at all of the nodes tends to insulate them from overfitting in each interval between nodes. The choice of node position and number is perhaps an art; I don’t have much experience with splines, although in those cases that I have used them it was adequate to eyeball the data for getting reasonable positioning of the nodes.

      If there were clear asymptotes, you could use a spline multiplied by an exponential, for example, to get the asymptotic behavior right and the behavior in the non-asymptotic region right. If the phenomenon were periodic with superimposed asymptotic behavior (like a decaying ringing phenomenon) you could use a Fourier polynomial to represent the periodic part, multiplied by an appropriate exponential.

      • BTW, obviously you can’t use splines to make predictions. They are fine for understanding what’s happening where the data are, but basically useless outside that interval.

        This doesn’t apply to Fourier polynomials, but as I say the applicability of such approximations is limited to a specialized subset of phenomena.

      • “Fourier approximations ought to be used only when the data are known or believed to represent a nearly periodic phenomenon”

        Although that’s a very good place to use Fourier series, it’s by no means the only one. I’ve used them very successfully for example with approximating even functions (the sine terms go to zero) or odd functions (cosine terms go to zero) on finite intervals.

        As long as the domain of interest could reasonably be considered as a piece of a larger domain over which an “extended” function could be periodic… the sin/cos basis can be very useful. The biggest issue is when the function can grow unbounded, since sin, cos have bounded range. The other issue is when the function is discontinuous. And that includes discontinuity at the boundary of the domain over which you’re pretending the function is periodic. So for example you can fourier approximate y=x on [-1,1] but convergence at the endpoints of the domain will be slow and have ringing.

        Richard Hamming’s books on Digital Filters, and Numerical Methods for Scientists and Engineers are two very accessible books that give a lot of useful theory. They’re a bit dated perhaps, but very practical.

        In my opinion it’s the conglomeration of noisy finite data and highly flexible bases of any kind (be they Fourier, Polynomial, Radial Basis, spline, or other ad-hoc basis functions) which cause the problems that Andrew rails against. The ability to fit wiggles in the data which are purely noise means poor results due to overfitting, regardless of the basis.

        Regularizing the fit can help avoid fitting the wiggles. One simple way to regularize is to simply limit the number of terms (ie. only use low order polynomials) but there are other useful ways of regularizing. For example, putting strong priors on certain coefficients. The continuity and continuity of derivative requirements in spline fitting are another kind of regularization.

        One thing I’ve found to be useful is combining radial basis functions with prior information on regularization. If you know a function might be likely to have details in certain regions, you can place more RBF knots in that region using your prior (on location and scale parameters) for example.

  4. Social scientist laments a mathematical model doesn’t work outside its scope of applicability. Breaking news.

    Surely everyone knows that polynomials are useful approximations of functions locally, and that most simulation/prediction (which extrapolation is) has exponentially growing error components; people looking at asymptotes of fitted polynomes are just doing it wrong.

    • Sigs:

      Unfortunately, “everyone” doesn’t realize the problem with polynomials. I see them mistakenly used all the time, including in high-profile papers on serious topics (as illustrated in my link).

      Indeed, your comment that “Surely everyone . . .” is contradicted by your later statement that “people are just doing it wrong.”

      Finally, almost nothing on this blog is “breaking news”; as I’ve stated many times, most of the posts here are on a 1 or 2 month delay. So if you’re looking for breaking news, you’re in the wrong place.

  5. From the abstract: “Without such a clear underlying pattern, the validity estimated coefficient from the discontinuity is much less clear.” I’m having trouble reading this sentence — am I missing something, or is there a typo somewhere around “the validity estimated coefficient”?

  6. I was trying to think which cubic or higher polynomial I’ve seen used as an arbitrary fitting function.

    One example is the Shomate eq. used by NIST extensively in their Thermodynamics database of materials to fit temperature dependence of entropies, specific heats etc. The enthalpy eq. uses as high as 4th degree terms. Not sure if this is entirely ad hoc or has some theoretical basis.

    Gas Phase Heat Capacity (Shomate Equation)

    Cp° = A + B*t + C*t2 + D*t3 + E/t2

    H° − H°298.15= A*t + B*t2/2 + C*t3/3 + D*t4/4 − E/t + F − H

    S° = A*ln(t) + B*t + C*t2/2 + D*t3/3 − E/(2*t2) + G

    Cp = heat capacity (J/mol*K)
    H° = standard enthalpy (kJ/mol)
    S° = standard entropy (J/mol*K)
    t = temperature (K) / 1000.

    • Polynomials are theoretically a complete basis for continuous functions and the Weierstrass Approximation Theorem says that if f(x) is continuous on [a,b], there exists a sequence of functions p_n(x) where each p_n is a polynomial of order n, and the sequence converges **uniformly** on a given closed interval [a,b]. Existence however does not imply that we know how to calculate the coefficients. That’s particularly true if we only have f(x) at a finite set of points.

      If you have only a finite set of equally spaced points, you can always define a polynomial of some degree that goes through *all* the points exactly. Whether that polynomial behaves well between the points is the subject of Runge’s phenomenon which is described fairly well on wikipedia.

      When it comes to statistical data, we typically have the function plus a bunch of errors evaluated at a finite set of points. It’s the presence of the errors and the finite set of points that makes the statistical fitting problem different from the Weierstrass problem.

      For NIST they have a dense set of points, and very small measurement errors. Typically MANY more points than coefficients in their polynomials. So they’re free to choose those polynomials using something like least squares which would eliminate the Runge issue. Given the small measurement noise and large number of data points, I’d say this is one case where high degree polynomials are not a problem.

  7. Pingback: I hate polynomials « Statistical Modeling, Causal Inference, and …360 Haters | 360 Haters

Leave a Reply to Rahul Cancel reply

Your email address will not be published. Required fields are marked *