Data-analysis assignments for BDA class?

In my Bayesian data analysis class this fall, I’m planning on doing some lecturing and class discussion, but the core of the course will be weekly data-analysis assignments where they do applied statistics using Stan (to fit models) and R (to pre-process the data and post-process the inferences).

So, I need a bunch of examples. I’d appreciate your suggestions. Here’s what I’ve got so far:

Classic examples:

8 schools
Arsenic in Bangladesh

Modern classics:

World Cup
Speed dating
Hot hand
Gay rights opinions by age
The effects of early childhood intervention in Jamaica

I’m also not clear on how to set things up: Do I just throw them example after example and have them try their best, or do I start with simple one- and two-parameter models and then go from there?

One idea is to go on two parallel tracks, with open-ended real-data examples that follow no particular order, and fake-data, confidence-building examples that go through the chapters in the book.

Anyway, any suggestions of yours would be appreciated. Thanks.

22 thoughts on “Data-analysis assignments for BDA class?

  1. I think that it’s good to avoid fake data as much as possible.

    Are you going to use the GSS for the gay rights issue? I think there are some really interesting questions related to the cluster structure of their sample. I believe there was one year where they were high in terms of clusters within Utah and that made the whole US look more conservative. But that could be an urban legend.

    After the post the other day, the worm data sounds ripe for students to work on.

  2. You might get somewhere with some of the publicly available genomics datasets. Biologists these days are putting tons of stuff online. But you’d need some biology expertise to interpret them.

    I think the biggest thing in doing data analysis with Bayesian methods is that Bayesian methods are great for implementing models involving causal scientific processes, but to do that, you need to have some background on the science. So for example, remember the Lynx dataset where someone fit an ODE and you were super impressed:

    http://statmodeling.stat.columbia.edu/2012/01/28/the-last-word-on-the-canadian-lynx-series/

    That kind of stuff is now very do-able in Stan, if you understand the underlying reasoning behind the ODE.

    Perhaps have the students write a short description of their main scientific interests, and then try to get them to, with your guidance, design a stepwise approach to building better and better models of things in their topic of interest.

  3. I think the radon examples from ARM are great, because they let a student work through sequentially adding complexity to the model (start with simple county-level partial pooling. Then add regressions for basement, then for uranium content of the soil).

  4. I have an example on my blog about dropping balls of paper and timing their fall. The effect of drag is significant. You can do inference on the drag coefficient via Stan’s ODE solver:

    http://models.street-artists.org/2013/04/18/dropping-the-ball-paper-experiments-in-dynamics-and-inference-part-2/

    That link describes the physics, and has a data set for actual dropping balls.

    http://models.street-artists.org/2013/04/26/1719/

    Shows an analysis I did in the pre-Stan days using a custom mcmc with an R ode solver.

  5. I really like the 2-track idea. I think that learning the tools themselves – the commands/coding, data manipulation, organization of analysis files, etc – is easier when there aren’t a bunch of tricky data issues and you’ve built the data yourself (so you know the structure and can think through how the data is setup). And I like it when those start off simple and then you do something that messes up the analysis and then they have to see how some fix/correction works.

    So for instance, maybe they start with a fake simple random sample with iid errors, and then you add some heteroskedasticity, then some sort of selection into treatment or omitted variables, and then they run the “wrong” model and the “right” model and can see how it affects their results. I distinctly remember a homework assignment from my second year of grad school where we made simple data, ran OLS, and looked at the standard error. Then we duplicated each observation like 20 times and ran OLS and compared SEs, and then we used cluster robust estimators recovered the original OLS estimates. It made an impression and made a point that I haven’t forgotten in 10 years. I forget most things in 10 minutes.

    Also, a couple of new-ish datasets/issues from the development econ world that might be interesting along with suggested methodological lessons:

    Multi-level modeling: Six Randomized Evaluations of Microcredit
    https://www.aeaweb.org/articles.php?doi=10.1257/app.20140287

    Meta-analysis: Global Warming and Violence:
    http://emiguel.econ.berkeley.edu/research/quantifying-the-influence-of-climate-on-human-conflict

    Competing Explanations: The Great Indian Child Height Controversy:

    Kickoff (nothing to see here folks): http://www.epw.in/special-articles/does-india-really-suffer-worse-child-malnutrition-sub-saharan-africa.html

    Not so fast 1 (open defecation): http://www.susana.org/en/resources/library/details/1795

    Not so fast 2 (gender discrimination): http://www.nber.org/papers/w21036

    • jrc:

      Off topic (when am I not?) but the strong conclusions from Meta-analysis: Global Warming and Violence are simply not credible to me from my grasp of meta-analysis of non-randomized studies (e.g. confirmation bias version of forking paths) or even published literature of randomized studies (or is that the reason for it being suggested?).

      (This paper at least starts to raise some concerns – One effect to rule them all? A comment on climate and conflict. H Buhaug, et. al. but there likely would be a lot more issues – Greenland S, O’ Rourke K: Meta-Analysis. Page 652 in Modern Epidemiology, 3rd ed. Edited by Rothman KJ, Greenland S, Lash T. Lippincott Williams and Wilkins; 2008)

      • I’m pretty sure anyone who sometimes has a “?” in their name is allowed to be off-topic as much as they want.

        My knowledge of meta-analytics techniques is marginally above 0, but your critiques sound totally reasonable to someone who does (err…tries to do) non-experimental causal inference. It just struck me as a study where there was a lot of organized and available data and an interesting if never perfectly answerable research question. Which I think makes good coursework, if used in counter-point to situations where the data is actually known to be well-behaved: the ideal and the messy, messy reality.

  6. Seems hugely biased towards Social Sci examples.

    Might want some Physics / Chemistry / Engineering applications to appease the hard science-ey students (if any).

  7. Will this course cover bayesian inference for stochastic processes often encountered in ecological or evolutionary modeling, such as general birth-death process or Wright-Fisher process? Maybe Andrew is less familiar with the latter, but the former (birth-death process) is very frequently used in economic or financial modeling as well.

  8. Begin with guided simple examples (1-2 parameter models as mentioned above) to get the mechanics down. Then progress to more open-ended and more complex models. I wonder if making it a bit competitive might be a benefit–with discussion about the process that each individual or team went through to arrive at their model. Fake data is okay to make a point and get the hang of the process. In some ways, the data generation process itself is worth getting familiar with. Depending on the composition of the class, it may be good to throw in some presentations–focused not just on the technical portion but also focusing on making the model results digestible by a less technical audience.

  9. Prof. Gelman, why don’t you do a MOOC? As far as I know there are no Bayesian analysis online courses. Yours could be the only and best source for a long time.

  10. Data from this paper was made available by Andrew Vickers.
    Interesting data which allows for various techniques to be used to verify or question the hypothesis set in the paper.
    The original paper is => Acupuncture for chronic headache in primary care: large, pragmatic, randomised trial
    Author(s): Andrew J Vickers, Rebecca W Rees, Catherine E Zollman, Rob McCarney, Claire
    Smith, Nadia Ellis, Peter Fisher and Robbert Van Haselen
    Source: BMJ: British Medical Journal, Vol. 328, No. 7442 (27 March 2004), pp. 744-747

    The dataset is here.
    http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1489946/#S1

  11. In terms of a choice between “two parallel tracks,” in harkening back to my own grad school experience, those classes that strayed furthest from the assigned material and into the professor’s typically narrowly defined area of expertise were the least rewarding.

    I really like the notion of giving them exercises that are open-ended, with no “correct” solution as this is closer to what really goes on in any practitioner’s world.

    So, my recommendation would be to give them open-ended examples based on both real- and fake-data that adhere closely to the chapters in the book.

Leave a Reply

Your email address will not be published. Required fields are marked *