Milan Griffes writes:

I work at GiveWell, which you’ve kindly written about in the past.

I wanted to ask for your current thoughts on the best way to learn statistics outside of formal education since it’s been a few years since your last post on this topic. Do you have any advice for someone with some math background (e.g. Linear Algebra, Analysis) who is just starting to learn statistics? Recommendations of good online courses, book, lecture notes, or other pedagogical devices?

I have two goals for learning statistics:

1. Learning about a field that seems interesting and useful (from my outside perspective).

2. Be able to understand and form opinions about the statistical analyses I encounter in my work at GiveWell (mostly public health and development economics literature, such as our review of evidence on deworming).

My current plan is to work through Dr. Blitzstein’s online course on probability and read a couple intro to stats books, to give myself some theoretical grounding. Then, I think I’d take an online course focused specifically on social science regressions, read some books on the topic, and try to independently analyze some of the sources GiveWell has already looked at. What do you think of this plan?

Thanks very much for taking the time to engage with this. Any thoughts (or your readers’) would be much appreciated.

I can’t say that I have any great advice but maybe the commenters will have some good suggestions. I do like Joe Blitzstein so his course seems like a good start. Some other quick thoughts:

– I recommend my own books, for reals. They’re readable and have lots of applied examples, and they should be pretty easy to follow if you have a math background.

– Best way to learn is to work on a problem you care about, either in your work or in some other topic you care about (sports, politics, movies, whatever). Here I’d suggest jumping in and fitting a model in Stan and making graphs in R, following the model of our World Cup analysis (first try here, follow-up here). I don’t mean you have to fit that particular model, just that you can jump in and go.

I second the Joe Blitzstein book, “Introduction to Probability”, which has a corresponding video class: http://projects.iq.harvard.edu/stat110/home

I’d also suggest, “Statistical Rethinking” by Richard McElreath as I am deeply in love with the style of the book. It also has a corresponding video course:

http://xcelab.net/rm/statistical-rethinking/

Second “Statistical Rethinking” by Richard McElreath – very thoughtful metaphors and a guided pathway to Stan.

(Also similar/in line with what I was heading for here https://phaneron0.wordpress.com/2012/11/23/two-stage-quincunx-2/ )

Especially if you’re interested in social sciences, I might recommend “Mostly Harmless Econometrics” by Angrist and Pischke. Its relatively approachable — particularly with a math background.

Its solidly grounded from a theory perspective in almost everything you’d encounter at givewell. Regression, IV, Fixed effects, DinD, and panels.

G

My (generally positive) review of Mostly Harmless Economics is here.

This Book is Not good! It reduces statistics to gimmicks. Seriously, all the hype about “clever instruments” without care about everything else is damaging terribly economists.

Speaking as a non-economist, I liked MHE; I gather from the reading I’ve done that careless/sloppy/overhyped concern with finding clean (clever) instruments is a major meta-issue in econometrics, but I didn’t get the feeling that Angrist and Pischke themselves were going overboard. To support this, Noam Scheiber says in his piece “How freakonomics is ruining the dismal science”

> The early practitioners of this approach–Angrist, Krueger, Card–had well-earned reputations as crafty researchers. But, by and large, all three men used their creativity to chip away at important questions. It was only in the late ’90s that the signs of overreach became apparent.

To stretch an analogy a bit, blaming Angrist & Pischke for abuse of instrumental variables feels like blaming Fisher/Neyman/Pearson for abuse of p-values.

I can strongly recommended MIT’s 6.041 on edx – it covers all the probability basics and then actually goes to Bayesian statistics before introducing frequentist approaches! https://www.edx.org/course/introduction-probability-science-mitx-6-041x-1

I self-taught before I learned of Andrew’s books. Johnson and Wichern’s “Applied Multivariate Statistical Analysis” was what I used. It was useful enough that I could use it work out answers to the problems I needed to solve. That said, having self-taught on J&W I don’t go back to it too often. I do pick up BDA3 periodically though. Having transitioned from R&D to management – for the most part – I don’t pick up reference texts nearly as often as I used to but when I do I go to BDA3 or Gelman & Hill before I go to J&W.

Given a couple hours for the wheels to turn I remember a few other statistics-related texts which were useful for self-study. My of my work involved making classification decisions. Stork, Duda, and Hart’s “Pattern Classification” is excellent. It’s a good intro text if you’ve got a solid advanced undergrad math background. I still pick it up from time to time. McLachlan and Peel’s “Finite Mixture Models” and McLachlan and Krishnan’s “The EM Algorithm and Extensions” were also useful. (It was about 10 years ago that I was digging into those texts.)

Last one: Clive Rodgers, “Inverse Methods for Atmospheric Sounding”. His application is atmospheric remote sensing but the methods are generally applicable. From the publisher’s blurb on Amazon: “Inverse theory is treated in depth from an estimation-theory point of view, but practical questions are also emphasized, for example designing observing systems to obtain the maximum quantity of information, efficient numerical implementation of algorithms for processing of large quantities of data, error analysis and approaches to the validation of the resulting retrievals…” Lots of fun with covariance matrices and regularization methods.

Identify an interest and find materials that use/teach stats for that. There is no better way to learn than through a hobby/interest. Like baseball. There are fun books like Teaching Statistics Using Baseball by Jim Albert. That one in particular is terrific.

A great book to start with that teaches critical thinking in the context of statistics is “Statistics” by Freedman, Pisani and Purves. After that, go to Freedman’s “Statistical Models: Theory and Practice”.

> Best way to learn is to work on a problem you care about, either in your work or in some other topic you care about…

I couldn’t agree more. Nothing’s a better motivator than having a problem you need to solve.

> Here I’d suggest jumping in and fitting a model in Stan and making graphs in R…

IDL was the entrenched language at work so that’s what I worked in. It’s C-like and well-suited for image processing but basic graphics were awful (better now 15+ years later). They were bad enough that I used to save out IDL results and load them into Igor Pro when I needed to make graphs. I’m now 50/50 IDL/MATLAB. As was IDL at my previous job, MATLAB is currently the lingua franca at work. I liked Igor Pro better for graphics but as an overall analysis plus graphics package I do like MATLAB. (I even coughed up $150 for a home license and $50 each for a couple Toolboxes.) That stated, I like what I’ve seen of R and, if I had the opportunity to start from scratch and were unconstrained by work, I’d probably give it a shot. (FWIW, I’m a “late adopter” of new technology. If not pushed along by external forces I’d probably still program in FORTRAN and create graphs on paper;-)

About the goal to: “understand and form opinions about the statistical analyses I encounter”:

Read on past statistical frauds & scandals is a great way. Or the sort of fishing & forking paths that Andrew’s blog covers. Or retracted papers.

Getting an intimate knowledge of how people abuse the tools is a great way to form better opinions of other analyses you must review.

For someone dealing with public health, I would also recommend some reading on research and experimental design. Unfortunately, I don’t have any books to recommend here. My own grad school experience reading Cook and Campbell was mildly traumatic. Fortunately the prof made the material accessible.

As someone without a strong math background, I found Andy Field’s Discovering Statistics Using R to be a life-saver. He explains concepts in an accessible way (with pretty humorous examples) and has posted a lot of supplementary material on both his personal site and on the Sage site.

The normal path to statistics, through Frequentist concepts, is a minefield of ad-hoc weirdness. Some of the stuff is really kind of cool (eg. kernel density estimation, bootstrapping, clustering) but mostly it’s a mess.

On the other hand, if you start out thinking about Bayesian ideas (a distribution measures how plausible a particular value of an observed or unknown quantity is) then building models is about asking yourself “what do I know about what can happen?” after you get past that initial concept, “doing statistics” is more about encoding what you already know. One of the best things about Stan is that it lets you encode pretty much anything (though discrete parameters are its downfall, a bit). That freedom really alters how you view statistics. Before that freedom, you spend your time shoehorning your problem into something that your software can solve. Afterwards, you spend time just describing your problem and hoping you have enough computer power.

Fundamentally, Bayesian statistics is just accounting for uncertainty within mathematical modeling, and it can be done in ANY mathematical model, so you go back to thinking about the actual stuff going on.

“Social science regressions” are typically an attempt to describe regularity in data via linear combinations of simple functions (typically just linear functions). y = y0 + a*x1 + b*x2 + c*x3 +d*f(x4) etc. Sometimes that’s a reasonable model, but I think the dominance of this type of model in social science is more down to this kind of thing being easily solvable than that it fits the situation. The typical example would be the now famous on this blog coal-burning in China example. The big problem there was failure to really model the situation in anything like realistic detail (ie. spatial variation through time). What was needed was a thought process about how the history of the policy produced differences across the river through time. What we got was a 1-D basis expansion with a discontinuous function thrown in, and an estimate of it’s coefficient.

So, anyway, I suggest reading about mathematical modeling in general. That’s going to include at least calculus, ordinary differential equations, representation of functions in basis expansion (fourier etc), scaling and dimensional analysis, maybe something related to agent based models, dimension reduction… Maybe ecology would be a good place to start. My impression is that it’s an area where people are doing real mathematical modeling based on mechanistic ideas of what drives changes in ecosystems and adding in uncertainty through Bayesian calculations is relatively accepted. Most of social sciences could probably be described as “the ecosystem of humans”. So I think this analogy is a good one.

Of course, if what you mainly want to do is read other people’s papers with regressions and understand what they did (as opposed to say having an opinion about what they SHOULD have done)… Then my advice may be overkill.

I swear, I wrote this at about 4 am, then went back to bed and dreamed that economists from UC Berkeley were conspiring with the CIA to kill me in a “The Fugitive” style movie. So, anyway, take it all with a grain of salt, and don’t write blog posts at 4am.

Could you take the data from the China-coal-burning paper & actually show the sort of realistic model *you* think ought to have been fit?

I suspect no, because I think it really requires a more in-depth study getting additional data in order to do a good job. Even if they do have sufficient data to deal with the situation in a more realistic way, the time it would take is prohibitive. Presumably they are the ones who got research grants (or at least university salaries) to study this stuff. If it were just a matter of spending an afternoon or something, then yeah I’d be tempted, but I suspect it’s more of a wade in hip deep and spend a month or two looking at the data that is available, breaking it down, studying different models, looking for additional data to supplement them. etc

Some of the issues that I think they should have addressed are already discussed (by me and others) in comments at the previous blog entry (linked above). So, someone who wanted to use that problem as a project could start to wade in and look at the issues and work on it.

To reiterate some of the issues:

1) The policy went into effect in 1950 and ran to 1980. You’d expect to see a transient response at the initiation of the policy which then equilibriates over a period of several years. By the 2000’s the self-selection into living location based on preferences that include smogginess would be expected to be fully complete… The policy is discontinuous in TIME as well as space, but is analyzed at a single far future point in time (relative to the policy change).

2) the 1-D nature of their model makes no sense for a large 2-D area. Weather patterns will be important for exposure, not just “how far north of the river are you”

3) The policy provided economic benefits for the north, those may well be extant now as higher levels of development, such as density of hospitals, average education of the populations, etc. The more developed areas may attract a different population. It’s been 60 years since this policy was put in place. You’d want to model the effect of the economic benefits on development, and that would include migration for both health and economic reasons.

So, taking an “ecological” point of view. You’ve got a region where some resources are being provided, and you’ve got agents that can move around, and you’ve got a pollutant being generated in the same region as the resources, and you’ve got an extended timeframe of exposure to the pollutant, and you’ve got an uncertain “damage” response to the pollution, and you’ve got agents that respond to that damage via changes in their behavior. That all sounds like a dynamic time-varying process that requires a dynamic time-varying model.

I have found the Sage Publications “Quantitative Applications in the Social Sciences” incredibly useful. There are loads of them, they are all quite short and you can usually get them on Amazon. The only thing to watch out for is that some of them are quite old.

https://uk.sagepub.com/en-gb/eur/series/Series486

For anyone in the UK, Sheffield University offers an MSc in statistics by distance learning, which I have been looking at.

I came into this having been a math/comp sci major as an undergrad who did topology and abstract algebra, but wasn’t very good at analysis and never took a matrix class (just linear algebra, as in Galois theory, not determinants and eigenvalues). I’d also been doing applied natural language processing, so I’d done maximum likelihood for conditional random fields and HMMs and things like that, but I didn’t really get the probability theory behind it all.

The critical steps for me were learning measure theory and what a random variable was and then learning simulation and MC/MCMC methods. Without measure theory, it was hard to follow all the notation, which is notoriously vague in statistics (particularly expectation notation and Bayesian overloading of notation for random variables and values); the bottleneck is that you need to have done some analysis and algebra (ideally topology) to really get sample spaces and events.

After the basic probability theory, it was easy to learn Monte Carlo methods and open up the world of applied Bayesian modeling. For that, I found BUGS invaluable because it reduced to code rather than the usual squirrelly narrative in a stats paper. I really liked Gelman and Hill’s regression book after that for the same reason and for all the insight into practical modeling, though I found all the point estimation stuff in the first half confusing (where they use lm/glm and glm/glmer).

As a very first book, I love the intro to Bulmer’s

Principles of Statistics. It then veers off into frequentist estimation and hypothesis testing, but as far as that goes, is the clearest explanation I’ve seen. It’s the right size, too, not one of these doorstops used for intro stats classes.My favorite intro to probability theory is in a surprising place—the appendix to Anderson and Moore’s

Optimal Filtering. It’s just a very tight 10 or 15 pages of definitions. It probably wouldn’t work if you don’t know a bit of topology and aren’t used to sequences of definitions math-book style. I just like that it’s so concise and properly defines everything in exactly the sequence you need.I was driven into this by wanting to understand multiple rater models for data annotation problems in natural language processing. I knew I needed to use hierarchical modeling. Here I was lucky and knew Andrew, so he let me hang out in his and Jennifer Hill’s multiple imputation reading group. Nothing like hanging out with experts to tidy up lots of little misunderstandings and learn to properly talk the talk. They helped me rediscover Dawid and Skene’s (1979) model (as Andrew says, everything you come up with was discovered by psychometricians decades ago), but that meant I was on the right track in thinking about models and modeling; I never could get the Bayesian version published in an NLP venue (other than as an example in the Stan manual chapter on latent discrete parameters).

Seems obvious that one’s background and interests are going to have a big effect here.

Was wondering if you think dealing with non-finite sets (measure theory) is necessary or just convenient?

I’m interested in this question too, as I have a particular viewpoint of my own, but would like to hear other’s. In particular, perhaps Bob mainly didn’t have experience with “calculus type” mathematics and the real analysis / measure theory stuff helped him get that experience? But, if you’d taken a bunch of Calculus, and ODEs and soforth at an undergrad level, perhaps the formalism of measure theory would be a different story.

“The critical steps for me were learning measure theory and what a random variable was…”

Random variable, sure. Maybe even its definition as a function. But measure theory? Way overkill.

Two self-paced, free courses from Stanford University:

– Probability and Statistics

http://online.stanford.edu/course/probability-and-statistics-self-paced

– Statistical Reasoning

http://online.stanford.edu/course/statistical-reasoning-self-paced

From Bayesian perspective but also for general statistical thinking, I really enjoyed the references covered in this effort:

“The paper’s goal is not so much to teach readers how to actually perform Bayesian data analysis — there are other papers in the special issue for that — but to facilitate readers in their quest to understand basic Bayesian concepts.”

http://alexanderetz.com/2016/02/07/understanding-bayes-how-to-become-a-bayesian-in-eight-easy-steps/

I enjoyed Pawitan’s “In All Likelihood” for its ability to provide a big picture overview of statistics without getting too bogged down in advanced mathematics.