## Looking for data on speed and traffic accidents—and other examples of data that can be fit by nonlinear models

For the chapter in Regression and Other Stories that includes nonlinear regression, I’d like a couple homework problems where the kids have to construct and fit models to real data. So I need some examples. We already have the success of golf putts as a function of distance from the hole, and I’d like some others. One thing that came to mind today, because I happened to see a safety warning poster on the bus reminding people not to drive too fast, is data on speed and traffic accidents.

But I’m interested in other examples too. Just about anything interesting with data on x and y where there’s no simple linear relation, and where log and logit transformations don’t do the trick either. The outcome can be discrete or continuous, either way.

There’s gotta be lots and lots of good examples but for some reason I’m drawing a blank. So I could use your help. I need not just the examples but also the data.

Thanks! Anyone who comes up with a good suggestion will be rewarded with a free Stan sticker.

1. Shosh says:

A few things that might be fun:

The taxi/limousine commission publishes trip data that economics love analyzing in papers; now conveniently accessible in an R package: https://cran.rstudio.com/web/packages/nyctaxi/vignettes/nyc_taxi.html

Traffic stops (by police) data from the Stanford project on this: https://openpolicing.stanford.edu/data/

Data from toxicity studies by the National Oceanic & Atmospheric Administration: https://www.diver.orr.noaa.gov/deepwater-horizon-nrda-data#

Cardiovascular mortality rates by year/sex/county: http://ghdx.healthdata.org/record/united-states-cardiovascular-disease-mortality-rates-county-1980-2014

Instances of Dengue fever by time/location in Sri Lanka: https://cran.r-project.org/web/packages/mozzie/index.html

2. Jonathan Auerbach and Rob Trangucci contributed a very nice talk on speed limits and pedestrian safety to Stan Con 2017. Link with data, model, and Rmd below.

Long live reproducible research!

https://github.com/stan-dev/stancon_talks/tree/master/2017/Contributed-Talks/01_auerbach

3. Erin Jonaitis says:

Maybe something involving seasonality, e.g. hours of daylight, or blood levels of Vitamin D?
(For a small amount of extra screwiness you could do sunrise/set times. You could be some junior programmer’s first exposure to the annoyance that is daylight saving time!)

• Erin Jonaitis says:

P.S. re “kids” — what is the target audience for this textbook? Sorry if you’ve already mentioned this elsewhere, but I couldn’t find it immediately.

• Andrew says:

Erin:

The book is self-contained, but I expect it will mostly be used as a textbook for courses in regression (traditionally a second course in statistics for undergraduates or a first course for graduate students).

4. Jonathan (another one) says:

The teams, score and date of every NCAA Division 1 and 3 hockey game since 2002. Suitable for fitting Bradley-Terry models, Poisson regression models for goals scored and all manner of home/road, strength and other effects. https://github.com/octonion/hockey/tree/master/ncaa/csv

5. T says:

Varian’s titanic survivor example here: http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.pdf
This example lends itself nicely to trees, but maybe you don’t want to go there. Could use it to demonstrate importance of interactions though.

6. Hernan Bruno says:

During the Rio Olympics, I was watching the Air Rifle 10m. It is obvious that in this and other disciplines, the winner is declared by the highest score, but this is not statistically significant if we were to do a test. Check the wikipedia page

https://en.wikipedia.org/wiki/Shooting_at_the_2016_Summer_Olympics_%E2%80%93_Women%27s_10_metre_air_rifle#final

https://www.dropbox.com/s/2cy484b0thx2xtu/airrifle2016.csv?dl=0

I decided to check it. But with repeated observations, I needed to run a stan model, as each shooter has her own mean and probably her own standard deviation. Notice that there are two stan codes below. One assumes that the score (0-11) is normally distributed. The second assumes that the distance from the center of the target is distributed Rayleigh, which is the correct distribution for a distance from a center if symmetry in all directions is assumed. Stan supports the Rayleigh distribution, so you can run it. I think it is an interesting example of the Rayleigh distribution. And by the way, these shooters shot 14-20 times (notice that they don’t all shoot the same number of times because of the elimination system). We would need them to shoot 2000 times to start seeing some non-overlapping intervals here.

What do you think? Don’t be harsh with the crappy coding, I was watching sports on my couch when I coded this. Also notice some old-fashioned Stan notation.

```library(tidyr)

# Data capture a bit strange, change name of winner.
colnames(data_airrifle)[1] <- "Thrasher"

# Make it "long" and delete the NA
data_airrifle <- gather(data_airrifle)
data_airrifle <- data_airrifle[!is.na(data_airrifle[, 2]), ]

#list of competitors
competitor_id = as.factor(data_airrifle\$key)

levels(competitor_id) <- unique(data_airrifle\$key)

# create id from lastname
id = as.numeric(competitor_id)
score = data_airrifle\$value
score2 =11- data_airrifle\$value
K= length(competitors)
N = length(id)

da.ar = list(score = score, score2 = score2, id=id, N=N, K=K)

stan.code = "
data {
int N;
int K;
int id[N];
real score[N];
}
parameters {
vector[K] gamma;
vector[K] sigma;
real mu_gamma;
real sigma_gamma;
}
model{
gamma ~ normal(mu_gamma, sigma_gamma);
for(i in 1:N){
score[i] ~ normal(gamma[id[i]], sigma[id[i]]);
}
}
"
library(rstan)
out = stan(model_name = "Air Rifle Olympics", model_code = stan.code, data=da.ar , iter=1000, chains=2, verbose=FALSE)

# Now using the Rayleigh distribution

stan.code = "
data {
int N;
int K;
int id[N];
real score2[N];
}
parameters {
vector[K] gamma;
real mu_gamma;
real sigma_gamma;
}
model{
gamma ~ normal(mu_gamma, sigma_gamma);
for(i in 1:N){
score2[i] ~ rayleigh(gamma[id[i]]);
}
}
"
library(rstan)
out2 = stan(model_name = "Air Rifle Olympics", model_code = stan.code, data=da.ar , iter=1000, chains=2, verbose=FALSE)
print(out2)
plot(out, pars="gamma")
```
• Hernan Bruno says:

I just see this comment posted and I notice the indentation did not come out.

7. Terry says:

Income distribution is an inverted U shape with a mass point at zero. There is also a curious high-frequency oscillation suggesting that income is often rounded off to a multiple of \$5,000 (bins are \$2,500 wide).

This is a good illustration of the exhortation to actually look at your data rather than blindly fit a model.

Data are at: https://www2.census.gov/programs-surveys/cps/tables/finc-07/2017/finc07.xls. Data is for frequency of household income for incomes from zero to \$100,000. Relevant data is in cells B11-B50. Be careful to use only cells B11-B50 because the bin size is different for cells B51-B54. (This change in bin size could be another example of something to watch out for.)

Website with information about data is at: https://www.census.gov/data/tables/time-series/demo/income-poverty/cps-finc/finc-07.html

• Terry says:

This data is probably not what you want, though, because it is basically a pdf of income rather than an x-y relationship amenable to regression techniques.

8. Julien says:

What about trying to measure the gravitational acceleration constant (9.81 m/s^2)? Data would consist of measures of the time for an object to drop from different heights, and the model would have to look like Newton’s law of gravitation. A fun thing would be to make the students build the dataset themselves with a clock and a tape!

9. Ed Hagen says:

I had students collect data with their smartphones, which are packed full of sensors, for exactly this sort of problem. There are free apps for both iPhones and Androids that make it easy to collect the raw data, which student can then email to themselves in csv format. Specifically, I had them swing their phones on the charging cable like a pendulum, and then fit the acceleration data using a damped harmonic oscillator model. But there are many other sensors too that would easily generate data with non-linear relationships:

https://en.wikipedia.org/wiki/IPhone#Sensors

10. awinter says:

New Mexico pueblo populations!

For a small pystan project I have New Mexico pueblo populations from ~1790 – 1950. I added latitude, longitude, and elevation data. I have mean rainfall data which will be added later today.

The csv or ods sheet are here:
https://github.com/bioinfonm/pystan_musings/tree/master/beginners_excercise/new_mexico_pueblo_populations

11. Rodney Sparapani says:

How about the mcycle data set in the MASS R package?

Description:

A data frame giving a series of measurements of head acceleration
in a simulated motorcycle accident, used to test crash helmets.

Usage:

mcycle

Format:

‘times’ in milliseconds after impact.

‘accel’ in g.

Source:

Silverman, B. W. (1985) Some aspects of the spline smoothing
approach to non-parametric curve fitting. _Journal of the Royal
Statistical Society series B_ *47*, 1-52.

References:

Venables, W. N. and Ripley, B. D. (1999) _Modern Applied
Statistics with S-PLUS._ Third Edition. Springer.

12. bjs12 says:

For a single Y vs. a single X, how about child pedestrian fatalities vs. day of the year (reminiscent of births vs. day of year, as previously discussed on this blog)?

Data from FARS. Presumably additional X variables available there too.

Seasonality plus a major spike on Halloween :(

13. Andrew [not Gelman] says:

Temperature’s effect on electricity consumption is an easy non-linear example. It can be approximated with a quadratic, but more advanced methods are more appropriate.

Some data at the link below. There are interactions with day of week and hour, which makes it a bit more interesting.

14. Greg Ashton says:

Data on driving speeds in different countries obtained using Google-maps estimated journey time between random zip codes:

This was a project I started as a grad student for fun and never got around to actually analysing – a summary of what I did manage to do can be found here

15. ojm says:

Here’s code to generate some (synthetic) nonlinear time series data via Shalizi:

http://bactra.org/reviews/fan-yao/examples.R

A classic real-world data set for nonlinear time series is Nicholson’s ‘sheep blowfly’ data. Googling gave me this link:

https://www.stat.berkeley.edu/~brill/blowfly97I.html

Wood analysed these here: https://www.nature.com/nature/journal/v466/n7310/abs/nature09319.html

I’d be interested in seeing alternative analyses.

• ojm says:

Something that is perhaps more along the lines you’re specifically looking for is:

http://epubs.siam.org/doi/pdf/10.1137/130929230

They drop an object, measure its trajectory and fit the solution of Newton’s equations of motion with constant gravity and an empirical drag model. This solution is nonlinear.

They also use prior and posterior predictive checks etc. Good example of the potential issues of model misspecification too.

16. Brian Gawalt says:

I’m late to the party but — traffic flows! They’re a great case of “tipping point” nonlinearities.

I found a good set linked to from an intro-to-R Berkeley stats assignment I googled up:

“Data was collected by loop detectors at one particular location of eastbound Interstate
80 in Sacramento, California. There are six columns and 1740 rows in the data set. The
rows correspond to successive five minute intervals from March 14 to 20, 2003, where the
data values in a row report the flow (number of cars) and occupancy (the proportion of
time there was a car over the loop) in each of three lanes on the freeway. Lane 1 is the
leftmost lane, lane 2 is in the center, and lane 3 is the rightmost. The data set can be
found at http://www.stat.berkeley.edu/classes/s133/data/flow-occ.csv The original data are
from the Freeway Performance Measurement System (PEMS) website (Freeway Performance
Measurement System (PEMS) website)”

17. Belay Birlie says:

O2K {nlstools}: Oxygen kinetics during 6-minute walk test data set is a nice free dataset

18. ZC says:

Health spending (using MEPS, or the RAND insurance experiment data, which I think is public)–people love to use log(y+1) as a kludge, but really should be fitting something that explicitly accounts for masses at zero…