Skip to content
 

Looking for data on speed and traffic accidents—and other examples of data that can be fit by nonlinear models

[cat picture]

For the chapter in Regression and Other Stories that includes nonlinear regression, I’d like a couple homework problems where the kids have to construct and fit models to real data. So I need some examples. We already have the success of golf putts as a function of distance from the hole, and I’d like some others. One thing that came to mind today, because I happened to see a safety warning poster on the bus reminding people not to drive too fast, is data on speed and traffic accidents.

But I’m interested in other examples too. Just about anything interesting with data on x and y where there’s no simple linear relation, and where log and logit transformations don’t do the trick either. The outcome can be discrete or continuous, either way.

There’s gotta be lots and lots of good examples but for some reason I’m drawing a blank. So I could use your help. I need not just the examples but also the data.

Thanks! Anyone who comes up with a good suggestion will be rewarded with a free Stan sticker.

28 Comments

  1. Shosh says:

    A few things that might be fun:

    The taxi/limousine commission publishes trip data that economics love analyzing in papers; now conveniently accessible in an R package: https://cran.rstudio.com/web/packages/nyctaxi/vignettes/nyc_taxi.html

    Traffic stops (by police) data from the Stanford project on this: https://openpolicing.stanford.edu/data/

    Data from toxicity studies by the National Oceanic & Atmospheric Administration: https://www.diver.orr.noaa.gov/deepwater-horizon-nrda-data#

    Cardiovascular mortality rates by year/sex/county: http://ghdx.healthdata.org/record/united-states-cardiovascular-disease-mortality-rates-county-1980-2014

    Instances of Dengue fever by time/location in Sri Lanka: https://cran.r-project.org/web/packages/mozzie/index.html

  2. Jonathan Auerbach and Rob Trangucci contributed a very nice talk on speed limits and pedestrian safety to Stan Con 2017. Link with data, model, and Rmd below.

    Long live reproducible research!

    https://github.com/stan-dev/stancon_talks/tree/master/2017/Contributed-Talks/01_auerbach

  3. Erin Jonaitis says:

    Maybe something involving seasonality, e.g. hours of daylight, or blood levels of Vitamin D?
    (For a small amount of extra screwiness you could do sunrise/set times. You could be some junior programmer’s first exposure to the annoyance that is daylight saving time!)

  4. Jonathan (another one) says:

    The teams, score and date of every NCAA Division 1 and 3 hockey game since 2002. Suitable for fitting Bradley-Terry models, Poisson regression models for goals scored and all manner of home/road, strength and other effects. https://github.com/octonion/hockey/tree/master/ncaa/csv

  5. T says:

    Varian’s titanic survivor example here: http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.pdf
    This example lends itself nicely to trees, but maybe you don’t want to go there. Could use it to demonstrate importance of interactions though.

  6. Hernan Bruno says:

    During the Rio Olympics, I was watching the Air Rifle 10m. It is obvious that in this and other disciplines, the winner is declared by the highest score, but this is not statistically significant if we were to do a test. Check the wikipedia page

    https://en.wikipedia.org/wiki/Shooting_at_the_2016_Summer_Olympics_%E2%80%93_Women%27s_10_metre_air_rifle#final

    The csv file you can download here, courtesy of yours truly
    https://www.dropbox.com/s/2cy484b0thx2xtu/airrifle2016.csv?dl=0

    I decided to check it. But with repeated observations, I needed to run a stan model, as each shooter has her own mean and probably her own standard deviation. Notice that there are two stan codes below. One assumes that the score (0-11) is normally distributed. The second assumes that the distance from the center of the target is distributed Rayleigh, which is the correct distribution for a distance from a center if symmetry in all directions is assumed. Stan supports the Rayleigh distribution, so you can run it. I think it is an interesting example of the Rayleigh distribution. And by the way, these shooters shot 14-20 times (notice that they don’t all shoot the same number of times because of the elimination system). We would need them to shoot 2000 times to start seeing some non-overlapping intervals here.

    What do you think? Don’t be harsh with the crappy coding, I was watching sports on my couch when I coded this. Also notice some old-fashioned Stan notation.

    library(tidyr)
    data_airrifle <- read.csv("", sep=";", header=TRUE)
    
    # Data capture a bit strange, change name of winner.
    colnames(data_airrifle)[1] <- "Thrasher"
    
    # Make it "long" and delete the NA
    data_airrifle <- gather(data_airrifle)
    data_airrifle <- data_airrifle[!is.na(data_airrifle[, 2]), ]
    
    #list of competitors
    competitor_id = as.factor(data_airrifle$key)
    
    levels(competitor_id) <- unique(data_airrifle$key)
    
    # create id from lastname
    id = as.numeric(competitor_id)
    score = data_airrifle$value
    score2 =11- data_airrifle$value
    K= length(competitors)
    N = length(id)
    
    da.ar = list(score = score, score2 = score2, id=id, N=N, K=K)
    
    stan.code = "
    data {
    	int N;
    	int K;
    	int id[N];
    	real score[N];
    	}
    parameters {
    	vector[K] gamma;
    	vector[K] sigma;
    	real mu_gamma;
      real sigma_gamma;
    }
    model{
    	gamma ~ normal(mu_gamma, sigma_gamma);
    	for(i in 1:N){
    		score[i] ~ normal(gamma[id[i]], sigma[id[i]]);
    	}
    }
    "
    library(rstan)
    out = stan(model_name = "Air Rifle Olympics", model_code = stan.code, data=da.ar , iter=1000, chains=2, verbose=FALSE)
    
    # Now using the Rayleigh distribution
    
    stan.code = "
    data {
    	int N;
    	int K;
    	int id[N];
    	real score2[N];
    	}
    parameters {
    	vector[K] gamma;
    	real mu_gamma;
           real sigma_gamma;
    }
    model{
    	gamma ~ normal(mu_gamma, sigma_gamma);
    	for(i in 1:N){
    		score2[i] ~ rayleigh(gamma[id[i]]);
    	}
    }
    "
    library(rstan)
    out2 = stan(model_name = "Air Rifle Olympics", model_code = stan.code, data=da.ar , iter=1000, chains=2, verbose=FALSE)
    print(out2)
    plot(out, pars="gamma")
    
  7. Terry says:

    Income distribution is an inverted U shape with a mass point at zero. There is also a curious high-frequency oscillation suggesting that income is often rounded off to a multiple of $5,000 (bins are $2,500 wide).

    This is a good illustration of the exhortation to actually look at your data rather than blindly fit a model.

    Data are at: https://www2.census.gov/programs-surveys/cps/tables/finc-07/2017/finc07.xls. Data is for frequency of household income for incomes from zero to $100,000. Relevant data is in cells B11-B50. Be careful to use only cells B11-B50 because the bin size is different for cells B51-B54. (This change in bin size could be another example of something to watch out for.)

    Website with information about data is at: https://www.census.gov/data/tables/time-series/demo/income-poverty/cps-finc/finc-07.html

  8. Julien says:

    What about trying to measure the gravitational acceleration constant (9.81 m/s^2)? Data would consist of measures of the time for an object to drop from different heights, and the model would have to look like Newton’s law of gravitation. A fun thing would be to make the students build the dataset themselves with a clock and a tape!

  9. Ed Hagen says:

    I had students collect data with their smartphones, which are packed full of sensors, for exactly this sort of problem. There are free apps for both iPhones and Androids that make it easy to collect the raw data, which student can then email to themselves in csv format. Specifically, I had them swing their phones on the charging cable like a pendulum, and then fit the acceleration data using a damped harmonic oscillator model. But there are many other sensors too that would easily generate data with non-linear relationships:

    https://en.wikipedia.org/wiki/IPhone#Sensors

  10. awinter says:

    New Mexico pueblo populations!

    For a small pystan project I have New Mexico pueblo populations from ~1790 – 1950. I added latitude, longitude, and elevation data. I have mean rainfall data which will be added later today.

    The csv or ods sheet are here:
    https://github.com/bioinfonm/pystan_musings/tree/master/beginners_excercise/new_mexico_pueblo_populations

  11. Rodney Sparapani says:

    How about the mcycle data set in the MASS R package?

    Description:

    A data frame giving a series of measurements of head acceleration
    in a simulated motorcycle accident, used to test crash helmets.

    Usage:

    mcycle

    Format:

    ‘times’ in milliseconds after impact.

    ‘accel’ in g.

    Source:

    Silverman, B. W. (1985) Some aspects of the spline smoothing
    approach to non-parametric curve fitting. _Journal of the Royal
    Statistical Society series B_ *47*, 1-52.

    References:

    Venables, W. N. and Ripley, B. D. (1999) _Modern Applied
    Statistics with S-PLUS._ Third Edition. Springer.

  12. bjs12 says:

    For a single Y vs. a single X, how about child pedestrian fatalities vs. day of the year (reminiscent of births vs. day of year, as previously discussed on this blog)?

    Data from FARS. Presumably additional X variables available there too.

    Seasonality plus a major spike on Halloween :(

  13. Andrew [not Gelman] says:

    Temperature’s effect on electricity consumption is an easy non-linear example. It can be approximated with a quadratic, but more advanced methods are more appropriate.

    https://www.researchgate.net/figure/226527334_fig2_Fig-3-Non-linear-relationship-between-the-filtered-electricity-load-of-a-distribution

    Some data at the link below. There are interactions with day of week and hour, which makes it a bit more interesting.

    https://www.iso-ne.com/isoexpress/web/reports/load-and-demand/-/tree/zone-info

  14. Greg Ashton says:

    Data on driving speeds in different countries obtained using Google-maps estimated journey time between random zip codes:

    https://github.com/ga7g08/GoogleMapsAPI_experiment

    This was a project I started as a grad student for fun and never got around to actually analysing – a summary of what I did manage to do can be found here

    https://github.com/ga7g08/GoogleMapsAPI_experiment/blob/master/Results.md

  15. ojm says:

    Here’s code to generate some (synthetic) nonlinear time series data via Shalizi:

    http://bactra.org/reviews/fan-yao/examples.R

    A classic real-world data set for nonlinear time series is Nicholson’s ‘sheep blowfly’ data. Googling gave me this link:

    https://www.stat.berkeley.edu/~brill/blowfly97I.html

    Wood analysed these here: https://www.nature.com/nature/journal/v466/n7310/abs/nature09319.html

    I’d be interested in seeing alternative analyses.

    • ojm says:

      Something that is perhaps more along the lines you’re specifically looking for is:

      http://epubs.siam.org/doi/pdf/10.1137/130929230

      They drop an object, measure its trajectory and fit the solution of Newton’s equations of motion with constant gravity and an empirical drag model. This solution is nonlinear.

      They also use prior and posterior predictive checks etc. Good example of the potential issues of model misspecification too.

  16. Brian Gawalt says:

    I’m late to the party but — traffic flows! They’re a great case of “tipping point” nonlinearities.

    I found a good set linked to from an intro-to-R Berkeley stats assignment I googled up:

    “Data was collected by loop detectors at one particular location of eastbound Interstate
    80 in Sacramento, California. There are six columns and 1740 rows in the data set. The
    rows correspond to successive five minute intervals from March 14 to 20, 2003, where the
    data values in a row report the flow (number of cars) and occupancy (the proportion of
    time there was a car over the loop) in each of three lanes on the freeway. Lane 1 is the
    leftmost lane, lane 2 is in the center, and lane 3 is the rightmost. The data set can be
    found at http://www.stat.berkeley.edu/classes/s133/data/flow-occ.csv The original data are
    from the Freeway Performance Measurement System (PEMS) website (Freeway Performance
    Measurement System (PEMS) website)”

    That link is: http://www.stat.berkeley.edu/classes/s133/data/flow-occ.csv

    Inspired by: http://jliszka.github.io/2013/10/01/how-traffic-actually-works.html

  17. Belay Birlie says:

    O2K {nlstools}: Oxygen kinetics during 6-minute walk test data set is a nice free dataset

  18. ZC says:

    Health spending (using MEPS, or the RAND insurance experiment data, which I think is public)–people love to use log(y+1) as a kludge, but really should be fitting something that explicitly accounts for masses at zero…

Leave a Reply