Changing everything at once: Student-centered Learning, computerized practice exercises, evaluation of student progress, and a modern syllabus to create a completely new introductory statistics course

Andrew Gelman, Department of Statistics, Columbia University

It should be possible to improve the much-despised introductory statistics course in several ways: (1) altering the classroom experience toward active learning, (2) using adaptive software to drill students with questions at their level, repeating until students attain proficiency in key skills, and (3) standardized pre-tests and post-tests, both for measuring individual students’ progress and for comparing the effectiveness of different instructors and different teaching strategies. All these ideas are well established in the education literature but do not seem to be part of the usual statistics class. We would like to implement all these changes in the context of (4) a restructuring of the course content, replacing hypothesis testing, p-values, and the notorious “sampling distribution of the sample mean” with ideas closer to what we see as good statistical practice. We will discuss our struggles in this endeavor. This work is joint with Eric Loken.

The talk’s not actually happening until 16 May. But I thought, this time why not post something five months early instead of three months late (as I’m actually writing this in mid-Sept).

Anyway, if you happen to be at this conference (which, as the name suggests, will be entirely online), I hope you like my talk. I’d link to the conference webpage but it’s so horribly ugly (ironic given that it’s for an electronic conference) that I’ll spare you.

“a restructuring of the course content, replacing hypothesis testing, p-values, and the notorious “sampling distribution of the sample mean” “

People hate introductory statistics with a white hot burning passion because most of them sense it’s complete horse manure. They’re brow beat into silence by all the math types saying it’s the greatest thing since bacon wrapped shrimp, but their mind and body still balk at accepting it. Those 99% of students who regurgitate stat 101 back up like it’s 6 month old milk aren’t the weird ones who need correcting. The weirdos are the 1% who think it all makes perfect sense and go on to become statisticians.

Subjects don’t get better because of classroom gimmicks. Calculus didn’t go from being an esoteric subject that only a dozen people mastered to being taught to millions of freshman because of classroom gimmicks. It happened because people gained a far deeper insight into the subject and then wrote textbooks which reflected that deeper/simpler understanding. A good example is Euler and his calculus textbook which was the standard for a century or more.

So if you want to improve the introductory stat course, then spend 99.9999% of your effect on setting the foundations of statistics right. Introductory statistics isn’t awful because teachers lack gimmicks, and statistics in general isn’t hard because it’s a hard subject. Both are true because statisticians have monumentally screwed up the subject. Statisticians massively failed at their job. It needs to be fixed and all the gimmicks in the world wont affect diddly squat otherwise.

And if that’s asking to much of statistics community (it almost certainly is), then there needs to be a Bayes only option for statistics majors and minors, that starts off on day 1 with Bayesian statistics and doesn’t look back. Let Frequentism die on the vine.

“on setting the foundations of statistics right”.

i.e. Teach mostly Jaynes?

I’d say the first day of the course, hand out a copy of Cox’s original paper. Mention that it might have some technical hinks but that they can be fixed for essentially all cases we care about (and hand out a copy of a more recent review of these issues for extra credit). Cox’s paper is totally and utterly readable and understandable by undergrads.

http://scitation.aip.org/content/aapt/journal/ajp/14/1/10.1119/1.1990764

(unfortunately, behind a hefty paywall, but university students would get it via their uni’s library/subscription)

Then, proceed with probability as a measure of relative plausibility. teach least-squares linear regression as a maximum a-posterior estimate with a normal error model and improper flat prior. Work a LOT on data visualization, graphing, different kinds of charts, and their usefulness. Then, move on to some topics like regression on bounded variables (logistic etc) and on transformations of data.

Move on from point estimates to a couple of simple Stan models for actual laboratory experiments carried out by the students (a combo of surveys of the students, and some physics-y measurement apparatus type stuff).

Finish with constructing plausibility measures from repeatable experiments using histograms and fitting distributions to data, and then a final project where you simultaneously estimate an error distribution for a non-normal measurement error model together with a nonlinear regression. Include the topic of checking how well your model fits the data.

It sounds like a plausible way to get started at least.

You’d need to include maximum entropy concepts, so when you teach linear regression with normal errors, discuss the IID normal error model as being a maximum entropy solution to the choice of distribution over the errors, both the IID and the normal parts. Also, when fitting distributions to data, give examples where the family can be chosen by maximum entropy considerations (for example, exponential/gamma for the sum of n positive values where the expected value for each item is known).

Though, if you asked what I really think, rather than how I would teach an intro-stats course. I think there shouldn’t BE an intro stats course. There should be more or less two INTENSIVE LAB courses, one for physical sciences, and one for social sciences.

Within those lab courses, the concepts needed to evaluate data and build models should be taught using Cox/Bayesian reasoning. This will motivate the concepts a lot more. Of course, this would be a really intensive course. Probably 4-6 credit, meet every day, MWF for combo lecture+computer lab where physical and data-analytic concepts are taught, and lab data is analyzed, and T,Th in a physical lab doing experiments for 4 hours, or for social sci designing and carrying out surveys, or exploring widely available public datasets and coming up with questions and associated analysis methods.

I suspect it’s really hard to organize such a cross disciplinary course within existing university organizations, but perhaps small colleges with less cross-departmental politics would do a better job.

Daniel, What you describe sounds lovely but I teach both undergraduate and graduate courses in statistics (not to statistics majors or statistics PhD students) and what you describe is unfortunately completely unrealistic. Most of my undergraduates have trouble calculating the mean of three integers in their head let alone deal with what you are describing. I would wager that most instructors of undergraduate courses at public universities encounter something similar as suggested by the fact that my graduate students would also struggle mightily with the material that you outline.

Numeracy among the general college population is very low.

Well, I do come from an engineering + math background, most of the engineering, physics, chem students probably CAN do this level of stuff, but my experience as a TA suggests that many of them wouldn’t want to and would complain loudly.

I agree it seems unrealistic in todays environment, but then, it’s an ideal case, no point in setting the ideal point lower just because you can’t achieve it :-)

@mark I sometimes say “not only do I do statistics without calculus, I do statistics without algebra!” But for those classes the GAISE framework which thankfully talks about levels A, B and C is great. Daniel’s course would start where students have that ABC knowledge, which most undergraduates do not.

As an alternative data point, I’m setting up to run a lab experiment with 5th graders in which we’ll do a bayesian regression by hand to determine the weight per length of soda straws (they’ll be cut to various lengths, and each measurement will be polluted by a small number of extra soda fragments added to the weigh-boat). We’ll get posterior samples for the slope of the line using a very simplistic model.

I’ll blog it, I suspect the 5th graders will have no problem with it, whether they understand what they’re doing or not will be interesting to evaluate afterwards. The only real mathematical requirement is addition, subtraction, multiplication, and division. I’m going to try to arrive at the model using a calculus type motivation (if we know the weight of a small slice of straw, how would we calculate the weight of a long piece of straw… how does the accordion portion of the straw change the model? etc).

it’s an experiment, it’s entirely possible it will fall flat, but I hope not.

That’s fantastic … really I think there has been a big mistake in thinking that regression should come late. What I’d do is plot it all first though and let students freehand or with a rule draw what they think is a good line. Looking forward to seeing that post.

The theory is develop the conceptual model first. If you take a thin slice of length 1mm out of the straw, no matter where you take it, it will weigh essentially the same amount. If you take a long straw, and cut it into 1mm slices, it will not change weight. The implication is that the weight of a straw of length L is the sum of the weights of 1mm slices and the number of slices is about (L/1mm)… Integral calculus using “infinitesimal numbers”. We’ll mention the effect of the “accordion part” on the model, but won’t actually work with the accordion part.

The weight per length defines a line, the line must logically go exactly through (0,0).

The measurements always have “something extra” in the weigh boat, so the measurement error is always a positive number, so the line can never go above any data point, instead all the data points are above the actual line…. and because it’s always 5 pieces polluting the weight measurement, each data point should be about the same distance above the line, or at least within a narrow range determined by a simple uniform prior over the length of the straw fragments polluting the weight measurements.

from that we’ll graph the data, draw candidate lines, calculate individual errors, and accept or reject the line based on the criteria we have for the errors, a-la ABC method. At the end we’ll have 5 or 10 candidate slopes from the posterior of this very simple bayesian model (uniform prior, uniform measurement error likelihood)

I want to move on from that to a quick back-of-the-envelope or order-of-magnitude calculation for the total mass of soda straws used by kids in the US for birthday parties right at the end.

Isn’t this a bit of a stretch for 5th graders? Is this like the 5-th Grade US Olympiad? :)

Rahul: I think it’s a mistake to underestimate kids. It’s not like they’ll be graded on this, so it’s purely an opportunity for them to think about and try to understand how it works. In truth though, it’s not like anyone looks at 5th grade grades either, so even if they were graded on it… I just don’t see how they’re well served by not having opportunities to explore beyond a little walled garden of learning.

My 4 year old got so far as to do all the measurements and plot the points on a graph, so I really don’t see that it’s going to be a problem for a class full of 5th graders. They’ll be using a calculator for the actual number crunching.

@Daniel are you going to cut the straws? I think with the straw cutting there’s an opportunity to think about measurement error especially if you let the kids cut straws to a specified number of inches or centimeters themselves.

@Elin, yes, we’ll cut the straws up to a variety of lengths. The kids will measure the straws, and will have an opportunity to think about measurement error there.

I’ll have a bunch of small pieces, between say 5 and 25 mm and will select 5 at random to put in the weigh boat with their carefully measured cut length.

This has got to be the great Bayesian troll: https://twitter.com/BayesLaplace. Though I can’t say I disagree, I marvel at the energy to so categorically attack frequentist statistics at every opportunity. It’s a shame statistics didn’t first take the moniker “game theory.” It could have worked and made the subject more attractive to many. In that alternate universe, Moneyball also hits theaters before A Beautiful Mind.

I’m not the one who trolled you.

Your introductory stat course was originally formulated by fanatical Frequentists who dismissed Bayesians based on philosophical prejudices. On that basis, they first created then indoctrinated the material in introductory stat courses, and enforced it in applications (often through underhanded methods that should have got them fired).

There were a mass of theoretical problems with their methods. These were pretty well known 60+ years ago to anyone who cared to look, and have been rediscovered every decade since by each new crop of thoughtful students. Now in 2015, it’s obvious to everyone they’re a massive practical failure as well that has done unimaginable damage to every branch of science of it’s been heavily applied to and incalculable damage to the people dumb enough to followed advice based on p-value heavy research.

So I’m not the troll. The people trolling you are the frequentists who continue to teach introductory stat covering the same old crap because they don’t have the integrity to admit they got it wrong.

Some more trolling just for you JD:

https://twitter.com/BayesLaplace/status/682763542500511744

I don’t think it is so much about frequentism. I remember it: The null hypothesis is that two means are equal. So how do we pick an alternative hypothesis predicted by our theory? Oh, it’s that the two means aren’t equal. I see. Aren’t we supposed to try disproving a theory though? It doesn’t seem to make sense. Whatever, it must make sense somehow since everyone is doing it and that is what the book says. Anyway, I have 5 other classes and no time.

Fast forward ~5 years and I have a real project and real data: Wait, there are a bunch of reasons the means can be different. This really doesn’t make sense. WTF, after like 500 hours looking and asking people no one has explained why the null hypothesis is not the research hypothesis. Doesn’t anyone ever explain this? This is crazy. Am I crazy? Seriously, am I the crazy?

Thank Omniscient Jones I eventually discovered Paul Meehl.

what is wrong with teaching the “sampling distribution of the sample mean?”

Ggg:

See here.

Re Point 2 and possibly 3

http://donaldclarkplanb.blogspot.co.uk/

Andrew, you have said in the past that you don’t even teach the sampling distribution of the mean. I never understood why not.

Here’s how I explain it in a non-technical introduction:

Imagine that you do the same experiment 1000 times. Suppose that each experiment has sample size n. Take n to be 100. Assume also that you are taking random samples from a normal distribution with mean mu and standard deviation sigma.

So, what will happen is that we have, for the first experiment, n data points: x1,…,xn.

Now calculate the mean of this data; call it xbar1.

Do the second experiment, and calculate the mean of the second experiment xbar2.

And so on, all the way to 1000 experiments.

What you get now is 1000 sample means: xbar1,…,xbar1000. If you plot the distribution of these means, the distribution will be approximately normal. This distribution will be approximately centered around mu and will have standard deviation sigma/sqrt(n). This standard deviation is called the sampling distribution of the sample means. If we were to do a very large number of repeated experiments (even larger than 1000), we would find that the sampling distribution has the distribution N(mu, sigma/sqrt(n)).

(I present the normal distribution in terms of mu and sigma to not confuse students when we transition to R).

I think it’s a mistake to not teach frequentist methods first. If you want to change the world of statistical practice (and I do, I wish we would just give up frequentist methods and move on to using Bayes already), you can’t do it by ignoring the mainstream practice. It has to be a two-step procedure: make sure the “standard” theory is clear and explain the problems with it, and then explain the Bayesian approach. This gives a much more comprehensive picture to the new user.

And indeed, when I have a huge amount of data (which has a well-defined meaning in my own research context), even I, a committed user of Stan, am not going to lose much by quickly fitting linear mixed models using lmer to understand what’s going to happen once I laboriously fit Stan models over several days. In fact, I could even stop right there, with lmer, and not lose much. Even the Gelman and Hill chapters begin that way, and there’s a good reason for that.

The power of Bayes seems to me to become apparent when you have little data but a lot of prior knowledge, or when you want to relax or change the underlying assumptions about how the data are generated. And also for building computational process models. This is not something the new user, who at this moment may not need to go beyond doing a paired t-test, is even going to appreciate. They have to eventually encounter the limitations of the frequentist approach and then realize how much further they can get using Bayes, but this takes time.

Shravan:

For better or for worse, I do teach sampling distributions. And I discuss the sampling distribution of comparisons and regression coefficients. What I don’t spend time on is the sampling distribution of the sample mean. Who cares about the sample mean?

I don’t understand; the sampling distribution of a comparison (say between two conditions) is going to be the sampling difference of the difference of the means. How is that different from the sampling difference of the means?

If I have two means I am comparing, mu1 and mu2, to examine the sampling distribution of the comparison of these two, I am just comparing delta = mu1-mu2. The sampling distribution of the sample delta is no different from the sampling distribution of the sample means. Maybe I misunderstood something.

Not exactly true in that e.g. normally in the dreaded intro stats you consider the issue of unequal and equal variances and paired and independent comparisons. Basically I’m for doing interesting things with students which means comparisons of different kinds.

Andrew I’m interested in whether you have considered shifting to bootstrap approach.

Well, in an introduction for non-statisticians I would not immediately go into unequal variances until much later, after the basic ideas have sunk in (through simulation, for example). In the MSc in Statistics I did at Sheffield, the topic of unequal variance is introduced fairly quickly after introducing the t-test, but I’m talking about teaching people who are never going to become professional statisticians and have limited (very very limited) background in high-school mathematics.

Yet there is it in basically every intro to stats for social and behavioral sciences text.

That’s true. Upon reflection, that’s probably why I don’t use any textbook in my intro class.

@Elin +1

The key to a modern introductory stats lesson is teaching students how to write a FOR loop.

Shraven:

We can do better than “Imagine that …”. Two possibilities:

1) do an in-class activity where each student draws a sample and calculated the sample mean (or do this with proportions first — e.g., each person gets a sample of 30 M and M’s and counts how many are blue. To increase interest, ask someone what color they think there are most of, and work with that colof).

2) Use one of the online simulations (e.g., http://wise1.cgu.edu/sdmmod/sdm_applet.asp) to illustrate — the best ones allow you to illustrate one-by-one with a small number of samples, then simulate a bunch at once.

Also, I quickly realized that few students really “get” the idea of sampling distribution the first time around, so always review it in courses that have a first course as prerequisite.

I agree that working toward understanding of the frequentist approach is necessary for people to see the problems with it and thereby be more open to Bayesian methods.

That is indeed how I have been teaching it (through simulation), at least since 2004. I didn’t mention simulation here because it seemed orthogonal to the point of what one should or should not teach.

I do an in class exercise where I give a “population” of 5 numbers (20, 21, 22, 23, 24 usually). Then we get all the permutations of size 2 and the students calculate the means and put them on post it notes. Then we make a post it plot and also calculate the mean and standard deviation. Repeat with size 3 if desired.

I think with the non mathy students you are really doing this to help them see a few things.

1. Sample statistics vary.

2. That the variation has patterns when we use probability sampling.

3. That most random samples give estimates that are close to the population parameter but some give estimates that are pretty bad.

4. In any given study, you don’t know if you have one of the pretty bad ones or not.

5. Increasing sample size for random sampling gives us less chance of getting one of the really bad samples.

To me, for the students that I teach, if they know these things I will be happy and when they are gone they will be able to understand what they read in the news better.

Also, sampling distributions provide a good foundation for understanding why random assignment is powerful and for a more in depth discussion of sampling later. In fact I’m thinking that my exercise could be a decent way to transition to simulation. Meaning you could do the same thing but get the distribution from simulations.

I do think doing a simulation is a great approach too, but my students tend to need the really concrete first.

But sample statistics are like alchemy – you have all the sample outcomes but you some how magically get more by reducing all the sample outcomes to a couple specific summaries- whatever for?

The “whatever for” does follow from a very complex theory that _works_ only for convenient unrealistic assumption or when you know you are in asymptopia land …

David’s a-la ABC method uses all sample outcomes to mechanistically get the prior from the posterior and could be used to explicate that very complex theory rather directly but its a lot of work and its hard to guess all the ways it could be miss-understood.

If our business is to enable others to think about and deal with uncertainties as they _ought_ to – a few good courses is not going to be adequate.

Andrew can you talk about who the student audience for this class is? I think this is some of the confusion we’re seeing in the comments.

Have you figured out the pre and post measures yet? We used LOCUS last semester and it was interesting.