Skip to content

A linear regression example, and a question

Here is one of my favorite homework assignments. I give students the following twenty data points and ask them to fit y as a function of x1 and x2.

x1 x2 y
0.4 19.7 19.7
2.8 19.1 19.3
4.0 18.2 18.6
6.0 5.2 7.9
1.1 4.3 4.4
2.6 9.3 9.6
7.1 3.6 8.0
5.3 14.8 15.7
9.7 11.9 15.4
3.1 9.3 9.8
9.9 2.8 10.3
5.3 9.9 11.2
6.7 15.4 16.8
4.3 2.7 5.1
6.1 10.6 12.2
9.0 16.6 18.9
4.2 11.4 12.2
4.5 18.8 19.3
5.2 15.6 16.5
4.3 17.9 18.4

[If you want to play along, try to fit the data before going on.]

The usual solution

Students will fit a linear regression model, which in this case fits well, with an R-squared of 97%.

The true model

Actually, however, the data were simulated from the “Pythagorean” model, y^2 = x1^2 + x2^2. (We used the following code in R, using the runif command, which draws a random sample from a uniform distribution:

x1 <- runif (n=20, min=0, max=10) x2 <- runif (n=20, min=0, max=20) y <- sqrt (x1^2 + x2^2) It is striking that the linear model, y = 0.71 + 0.49 x1 + 0.86 x2, fits these data so well. What is the point of this example? At one level, it shows the power of multiple regression---even when the data come from an entirely different model, the regression can fit well. There is also a cautionary message, showing the limitations of any purely data-analytic method for finding true underlying relations. As we tell the students, if Pythagoras knew about multiple regression, he might never have discovered his famous theorem. A question

Does anyone know where this example comes from? We included it in our book, Teaching Statistics: A Bag of Tricks, but I know I’ve seen it before–I just can’t remember where.


  1. Sam Cook says:

    RonCook commented:

    I ran your regression problem through Excel and yes a linear regression gives a very good fit. Even more unfortunate is that visual inspection of the fitted data does not give the user any help indeterming that a linear fit in not the right model to fit the data. In the absence of knowing what phsycial system underlies the generation of the data one of the things that I have been doing and trying to train my coworkers to do is to use non-supervised NN methods to look at the data before applying an incorrect model to the data. Specifically I have been using both GMDH (group method of data handling) and Kohonens self organzing maps to take a look at the data and let the data itself guide me to the type of model(s) that should overlaid on the data

  2. Kaleberg says:

    One who practices the black art of numerical analysis would not be at all surprised at your result. For x1 and x2 in the ranges you specified it may be that the regression derived model is just as good or better than the Pythagorean model. Numerical analysts often use simple linear or lower order models of complex functions to avoid wasting computrons and to minimize error. If you can turn two multiplications and a square root extraction into one multiplication and an addition, do it.

    It's all an artifact of limiting the range of the values. You surely don't think that your computer uses a Taylor expansion to compute the sine function. It gets the angle into the range 0 to pi/2 (or whatever) and then uses an optimized polynomial for that range. Where does that polynomial come from? Someone did a higher order regression to minimize the error across the range of values!

    Of course, thinking like a numerical analyst can cause your brain to explode.