Frank Hansen writes:
I [Hansen] signed up for my first marathon race. Everyone asks me my predicted time. The predictors online seem geared to or are based off of elite runners. And anyway they seem a bit limited.
So I decided to do some analysis of my own.
I was going to put together a web page where people could get their race time predictions, maybe sell some ads for sports gps watches, but it might also be publishable.
I have 2 requests which obviously I don’t want you to spend more than a few seconds on.
1. I was wondering if you knew of any sports performance researchers working on performance of not just elite athletes, but the full range of runners.
2. Can you suggest a way to do multilevel modeling of this. There are several natural subsets for the data but it’s not obvious what makes sense. I describe the data below.
3. Phil (the runner/co-blogger who posted about weight loss) might be interested.
I collected race results for the Chicago marathon and 3 shorter races: Chicago Half Marathon, Soldier Field 10 Miler, Ravenswood 5k. I collected data from 2003 through 2009. Within each year I matched results for finishers between each shorter race and that year’s marathon based on full name and age. I used python to scrape web pages for the results.
Of course in a particular year a given marathoner may have run more than one of the shorter races. At this point I am ignoring that, treating them as independent records even though they have the same marathon finish data.
I would think that knowing several shorter races to predict a marathon time would help, but demanding several matches really cuts down the data.
I also collected weather data, so I know the temperature, humidity, wind speed near 8 am for each race (in Chicago).
I end up with around 13,000 records. A record contains a marathon time, a short race time, the type of short race, the temperature, humidity and wind speed difference between the short race and the marathon. I also know the age and sex of the marathon finisher.
Taking logs helps the R-squared, but this way it’s easier to interpret.
nt.form <- "mar.pace ~ short.pace + short.race.type + age + sex + temp.dif + humid.dif +wind.dif -1"
lm(formula = int.form, data = full.dat)
Min 1Q Median 3Q Max
-510.061 -36.867 -5.632 34.116 510.552
Estimate Std. Error t value Pr(>|t|)
short.pace 0.999389 0.006703 149.087 < 2e-16 ***
short.race.typehalf 82.630974 4.242505 19.477 < 2e-16 ***
short.race.typerw 106.133301 4.347218 24.414 < 2e-16 ***
short.race.typesf10 89.458519 4.209498 21.252 < 2e-16 ***
age 0.321860 0.064960 4.955 7.33e-07 ***
sexM 8.444752 1.286381 6.565 5.41e-11 ***
temp.dif 1.516766 0.051981 29.179 < 2e-16 ***
humid.dif 0.128886 0.041519 3.104 0.00191 **
wind.dif -1.534700 0.150816 -10.176 < 2e-16 ***
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 ” ” 1
Residual standard error: 65.79 on 13004 degrees of freedom
Multiple R-squared: 0.9895, Adjusted R-squared: 0.9895
F-statistic: 1.368e+05 on 9 and 13004 DF, p-value: < 2.2e-16
In the regression results the marathon and short race “pace” variable is in seconds per mile, so the short.race.typehalf equal to 82 means roughly add 82 seconds to your half marathon mile pace to get the marathon mile pace, and so on for the inde[endent variables. Temperature is in Fahrenheit, Humidity in %, Wind Speed in mph.
Marathon day for 2009 was really cold, predicting pace for 2009 based on a fit of the other years has larger errors than predicting 2008 using a fit for the non-2008 data.
My main piece of advice is to never ever ever ever ever use “summary” to display regression outputs in R. Only use “display” or “coefplot”. Unless, that is, you care that your standard error is “4.242505″ or that your p-value is “4.242505″ or that your F-statistic is “1.368e+05″. I don’t. But, then again, I’m a Bayesian.