Skip to content

In defense of stories and classroom activities, from a resubmission letter from 1999

I was going through my files looking for some old data (which I still haven’t found!) and came across a letter from 1999 accompanying the submission of a revision of this article with Glickman.

Here’s a part of the letter, a response to some questions of one of the reviewers:

With regard to the comment that “You present absolutely no evidence that any of these demonstration methods is actually helpful. For at least a couple of these demonstrations you need to collect data to see if your tools are helping in understanding the concept. I will let you worry about how to measure this but this is a must”:

Of course, your statement is true, but consider the alternative, which is to do examples like this on the blackboard. We haven’t seen Moore & McCabe or Mosteller or anyone else conducting experiments to show that class-participation demos are _not_ better than straight lectures. And, given this state of uncertainty, we think that it’s useful to consider this alternative approach to teaching this material.

We agree that it would be a good idea for someone to collect data on the effectiveness of various teaching approaches. As all are well aware, this is a potentially huge research project. In the meantime, we think that presenting a bunch of demos in an easy-to-use format is potentially a major contribution. Our feeling is that a paper like this should have either (a) some really cool stuff that people can go out and use right away, or (b) some perhaps-boring stuff but with some evidence that it “works” (e.g., studies showing that students learn better when they work in groups). We think that there is room in the literature for papers like ours of type (a) and also other papers of type (b).

You might also notice that all the papers of the form, “A new proof of the central limit theorem” or whatever, never seem to have evidence of whether they are effective in class. Why? Because it seems evident that if such a new proof can increase statistical understanding, then it’s a good thing and can in some way be usefully integrated into a course. We think this is similar with the demos in our paper: they are ultimately about increasing understanding by focusing on the fact that statistics is, in reality, a participatory process with many actors. This is a deep truth which is obscured when a professor merely does blackboard material. (We have added this point in the conclusion to our article.)

. . .

Finally, the referee writes, “I think this paper needs more work so that it is not just a set of interesting stories.” Actually, I think that interesting stories (with useful directions) is not a bad thing. I wouldn’t want all the Teacher’s corner articles to be like that, but the occasional such article, if of high quality, is a contribution, I believe, in that people might actually read the article and use it to improve their teaching.

I continue to hold and express this pluralistic attitude toward research and publication.

Can anyone guess what went wrong here?

OK, here’s a puzzle for all of you. I received the following email:

Dear Professor Gelman:

The editor of ** asked me to write to see if you would be willing to review MS ** entitled


We are hoping for a review within the next 2-3 weeks if possible. I would appreciate if you confirm whether you are willing to advise me on this by clicking on the url below


This site will also not only allow you to choose an alternative due date, but also to suggest alternative referees if you are unable to review.

If you choose to review the manuscript you can upload your report and cover letter via our secure online form at


This is a secure form and your report will be transmitted anonymously. You should supply either the title or the MS number, **, to ensure that your report is properly filed.

Thanks for your assistance. I very much value your advice.


I’ve omitted identifying details as there’s no point in embarrassing the journal editor. We all make mistakes, and this is not a big one.

Anyway, here’s the riddle: What was horribly wrong about the above email?

And here’s a hint: There’s no way you can figure out the problem merely from what I’ve sent you above. You’ll have to guess.

And another hint: The email came from a legitimate journal, not one of those “predatory” or spam journals.

I’ll give the answer tomorrow, but I’m guessing some of you will figure this out right away.

P.S. OK, OK, you win. Everybody guessed it already (see comments). I guess this puzzle was too easy.

Are Ivy League schools overrated?

I won’t actually answer the above question, as I am offering neither a rating of these schools nor a measure of how others rate them (which would be necessary to calibrate the “overrated” claim). What I am doing is responding to an email from Mark Palko, who wrote:

I [Palko] am in broad agreement with this New Republic article by William Deresiewicz [entitled "Don't Send Your Kid to the Ivy League: The nation's top colleges are turning our kids into zombies"] and I’ll try to blog on it if I can get caught up with more topical threads. I was particularly interested in the part about there being a “non-aggression pact” outside of the sciences.

This fits in with something I’ve noticed. I know this sounds harsh, but when I run across someone who is at the top of their profession and yet seems woefully underwhelming, they often have Ivy League BAs in non-demanding majors (For example, Jeff Zucker, Harvard, History. John Tierney, Yale, American Studies). My working hypothesis is that, while everyone who graduates from an elite school has an advantage in terms of reputation and networks, the actual difficulty of completing certain degrees isn’t that high relative to non-elite schools. Thus a history degree from Harvard isn’t worth that much more than a history degree from a Cal State school.

And David Brooks graduated from the University of Chicago with a degree in history . . .

In all seriousness, I don’t know if I agree with the claim in the headline of that article Palko links to.

I was very impressed by some of the Harvard undergrads I taught. Then again, they were statistics majors. In the old days, statistics might have been considered the soft option compared to math, but I don’t think that’s the case anymore. If anything, math majors are sometimes the sleepwalkers who happened to be good at math in school and never thought of stepping off the track. Anyway, it’s hard for me to make any general statements considering that I don’t teach many undergrads at all at Columbia.

Palko responded:

Yeah, I don’t want to put down Harvard grads, even the history majors. I’m sure that a disproportionate number of the brightest, most promising young historians are working on Harvard B.A. What’s more, I suspect most of them are developing valuable relationships with some of the most important names in their field.

What I’m wondering about is the popular notion that Ivy League schools are hard to get into and hard to get through. The first part is certainly true and the second appears to be true for STEM (which also has an additional self-selection bias). I’m not just not sure if it holds for all fields.

I don’t think there’s any question that selection bias, networking opportunities and halo effects play a large role here. What if they account for most of the benefit of attending an elite school for most students? This is worrisome from both sides: students are twisting themselves into knots to meet artificial and frankly somewhat odd selection criteria; and we’re giving the students who meet these odd criteria huge advantages in terms of wealth, career, and influence.

That can’t be good.

No, I didn’t say that!

Faye Flam wrote a solid article for the New York Times on Bayesian statistics, and as part of her research she spent some time on the phone with me awhile ago discussing the connections between Bayesian inference and the crisis in science criticism. My longer thoughts on this topic are in my recent article, “The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective,” but of course many more people will get the short version that appeared in the newspaper.

That’s fine, and Flam captured the general “affect” of our discussion—the idea that Bayes allows the use of prior information, and that p-values can’t be taken at face value. As I discuss below, I like Flam’s article, I’m glad it’s out there, and I’m grateful that she took the time to get my perspective.

Unfortunately, though, some of the details got garbled.
Continue reading ‘No, I didn’t say that!’ »

Some general principles of Bayesian data analysis, arising from a Stan analysis of John Lee Anderson’s height


God is in every leaf of every tree. The leaf in question today is the height of journalist and Twitter aficionado Jon Lee Anderson, a man who got some attention a couple years ago after disparaging some dude for having too high a tweets-to-followers ratio. Anderson called the other guy a “little twerp” which made me wonder if he (Anderson) suffered from “tall person syndrome,” that problem that some people of above-average height have, that they think they’re more important than other people because they literally look down on them.

After I raised this issue, a blog commenter named Gary posted an estimate of Anderson’s height using information available on the internet:

Based on this picture: he appears to be fairly tall. But the perspective makes it hard to judge.

Based on this picture: he appears to be about 9-10 inches taller than Catalina Garcia.

But how tall is Catalina Garcia? Not that tall – she’s shorter than the high-wire artist Phillipe Petit And he doesn’t appear to be that tall… about the same height as Claire Danes: – who according to Google is 5′ 6″.

So if Jon Lee Anderson is 10″ taller than Catalina Garcia, who is 2″ shorter than Philippe Petit, who is the same height as Claire Danes, then he is 6′ 2″ tall.

I have no idea who Catalina Garcia is, but she makes a decent ruler.

I happened to run across that comment the other day (when searching the blog for Tom Scocca) and it inspired me to put out a call for the above analysis to be implemented in Stan. A couple of other faithful commenters (Andrew Whalen and Daniel Lakeland) did this. But I wasn’t quite satisfied with either of those efforts (sorry, I’m picky, what can I say? You must’ve known this going in). So I just did it myself.


Before getting to my model, let me emphasize that nothing fancy is going on. I’m pretty much just translating Gary’s above comment into statistical notation.

Here’s my Stan program:

transformed data {
  real mu_men;
  real mu_women;
  real sigma_men;
  real sigma_women;
  mu_men <- 69.1;
  mu_women <- 63.7;
  sigma_men <- 2.9;
  sigma_women <- 2.7;
parameters {
  real Jon;
  real Catalina;
  real Phillipe;
  real Claire;
  real Jon_shoe_1;
  real Catalina_shoe_1;
  real Catalina_shoe_2;
  real Phillipe_shoe_1;
  real Phillipe_shoe_2;
  real Claire_shoe_1;
model {
  Jon ~ normal(mu_men,sigma_men);
  Catalina ~ normal(mu_women,sigma_women);
  Phillipe ~ normal(mu_men,sigma_men);
  Claire ~ normal(66,1);
  (Jon + Jon_shoe_1) - (Catalina + Catalina_shoe_1) ~ normal(9.5,1.5);
  (Catalina + Catalina_shoe_2) - (Phillipe + Phillipe_shoe_1) ~ normal(2,1);
  (Phillipe + Phillipe_shoe_2) - (Claire + Claire_shoe_1) ~ normal(0,1);
  Jon_shoe_1 ~ beta(2,2);
  Catalina_shoe_1 / 4 ~ beta(2,2);
  Catalina_shoe_2 / 4 ~ beta(2,2);
  Phillipe_shoe_1 ~ beta(2,2);
  Phillipe_shoe_2 ~ beta(2,2);
  Claire_shoe_1 / 4 ~ beta(2,2);

Hey! Html ate some of my code! I didn’t notice till a commenter pointed this out. In the declarations, the “shoe” variables should be bounded: “angle bracket lower=0,upper=1 angle bracket” for the men’s shoes, and “angle bracket lower=0,upper=4 angle bracket” for the women’s shoes.

I’ll present the results in a moment, but first here’s a quick discussion of some of the choices that went into the model:

- I got the population distributions of heights of men and women from a 1992 article in the journal Risk Analysis, “Bivariate distributions for height and weight of men and women in the United States,” by J. Brainard and D. E. Burmaster, which is the reference that Deb Nolan and I used for the heights distribution in our book on Teaching Statistics.

- I assumed that men’s shoe heights were between 0 and 1 inches, and that women’s shoe heights were between 0 and 4 inches, in all cases using a beta(2,2) distribution to model the distribution. This is a hack in so many ways (for one thing, nobody in these pictures is barefoot so 0 isn’t the right lower bound; for another, some men do wear elevator shoes and boots with pretty high heels) but, as always, ya gotta start somewhere.

- I took the height comparisons as stated in Gary’s comment, giving a standard deviation of 1 inch for each, except that I gave a standard deviation of 1.5 inches for the “9 or 10 inches” comparison between Jon and Claire, since that seemed like a tougher call.

- Based on the statement that Claire was 66 inches tall, I gave her a prior of 66 with a standard deviation of 1.


I saved the stan program as “heights.stan” and ran it from R:

heights <- stan_run("heights.stan", chains=4, iter=1000)
                 mean se_mean  sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
Jon              74.3     0.1 1.8  70.6  73.1  74.3  75.5  77.9   877    1
Catalina         65.1     0.1 1.5  62.3  64.1  65.1  66.2  67.9   754    1
Phillipe         66.1     0.0 1.3  63.7  65.2  66.1  67.0  68.6   813    1
Claire           65.5     0.0 1.0  63.8  64.9  65.5  66.1  67.5  1162    1
Jon_shoe_1        0.5     0.0 0.2   0.1   0.4   0.5   0.7   0.9  1658    1
Catalina_shoe_1   1.5     0.0 0.8   0.2   0.9   1.4   2.1   3.3  1708    1
Catalina_shoe_2   2.6     0.0 0.8   0.9   2.1   2.7   3.2   3.8  1707    1
Phillipe_shoe_1   0.5     0.0 0.2   0.1   0.3   0.5   0.6   0.9  1391    1
Phillipe_shoe_2   0.5     0.0 0.2   0.1   0.4   0.5   0.7   0.9  1562    1
Claire_shoe_1     1.6     0.0 0.8   0.2   1.0   1.5   2.2   3.3  1390    1
lp__            -21.6     0.1 2.5 -27.6 -23.1 -21.2 -19.8 -17.8   749    1

OK, everything seems to have converged, and it looks like Jon is somewhere between 6'1" and 6'4".

Tables are ugly. Let's make some graphs:

sims <- extract(heights,permuted=FALSE)
mon <- monitor(sims,warmup=0)
png("heights1.png", height=170, width=500)
subset <- 1:4
coefplot(rev(mon[subset,"mean"]), rev(mon[subset,"sd"]), varnames=rev(dimnames(mon)[[1]][subset]), main="Estimated heights in inches (+/- 1 and 2 s.e.)\n", cex.main=1, cex.var=1, mar=c(0,4,5.1,2))

png("heights2.png", height=180, width=500)
subset <- 5:10
coefplot(rev(mon[subset,"mean"]), rev(mon[subset,"sd"]), varnames=rev(c("Jon", "Catalina 1", "Catalina 2", "Phillipe 1", "Phillipe 2", "Claire")), main="Estimated shoe heights in inches (+/- 1 and 2 s.e.)\n", cex.main=1, cex.var=1, mar=c(0,4,5.1,2))

That is:



Model criticism

OK, now let's do some model criticism. What's in this graph that we don't believe, that doesn't make sense?

- Most obviously, some of the intervals for shoe height go negative. But that's actually not our model, it's coming from our crude summary of inference as +/- 2 sd. If instead we used the simulated quantiles directly, this problem would not arise.

- Catalina's shoes are estimated to be taller in her second picture (the one with Phillipe) than in the first, with Jon. But that's not so unreasonable, given the pictures. If anything, perhaps the intervals overlap too much. But that is just telling us that we might have additional information from the photos that is not captured in our model.

- The inferences for everyone's heights seem pretty weak. Is it really possible that Phillipe Petit could be 5'9" tall (as is implied by the upper bound of his 95% posterior interval)? Maybe not. Again, this implies that we have additional prior information that could be incorporated into the model to make better predictions.

Fitting a model, making inferences, evaluating these inferences to see if we have additional information we could include: That's what it's all about.

Software criticism

Finally, let's do the same thing with our code. What went wrong during the above process:

- First off, my Stan model wasn't compiling. It was producing an error at some weird place in the middle of the program. I couldn't figure out what was going on. Then, at some point in cutting and pasting, I realized what had happened: my text editor was using a font in which lower-case-l and the number 1 were indistinguishable. And I'd accidentally switched one for the other. I changed the font and fixed the problem.

- Again Stan gave an error, this time even more mysterious:

Error in compileCode(f, code, language = language, verbose = verbose) :
Compilation ERROR, function(s)/method(s) not created!

Agreeing to the Xcode/iOS license requires admin privileges, please re-run as root via sudo. In addition: Warning message:
running command ‘/Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB file3d6829c6b35e.cpp 2> file3d6829c6b35e.cpp.err.txt' had status 1

I posted the problem on stan-users and Daniel Lee replied that Apple had automatically updated Xcode and I needed to do a few clicks on my computer to activate the permissions.

- Then it ran, indeed, it ran on the first try, believe it or not!

- There were some issues with the R code. The calls to coefplot are a bit ugly, I had to do a bit of fiddling to get everything to look OK. It would be better to be able to do this directly from rstan, or at least to be able to make these plots with a bit less effort.

- Umm, that's about it. Actually the programming wasn't too bad.


I like Bayesian (Jaynesian) data analysis. You lay out your model step by step, and when the inferences don't seem right (either because of being in the wrong place, or being too strong, or too weak), you can go back and figure out what went wrong, or what information is available that you could throw into the model.

P.S. to Andrew Whalen and Daniel Lakeland: Don't worry, you've still earned your Stan T-shirts. Just email me with your size, and your shirts will be in the mail.

On deck this week

Mon: Some general principles of Bayesian data analysis, arising from a Stan analysis of John Lee Anderson’s height

Tues: Are Ivy League schools overrated?

Wed: Can anyone guess what went wrong here?

Thurs: What went wrong

Fri: 65% of principals say that at least 30% of students . . . wha??

Sat: Carrie McLaren was way out in front of the anti-Gladwell bandwagon

Sun: Anova is great—if you interpret it as a way of structuring a model, not if you focus on F tests

People used to send me ugly graphs, now I get these things

Antonio Rinaldi points me to this journal article which reports:

We found a sinusoidal pattern in CMM [cutaneous malignant melanoma] risk by season of birth (P = 0.006). . . . Adjusted odds ratios for CMM by season of birth were 1.21 [95% confidence interval (CI), 1.05–1.39; P = 0.008] for spring, 1.07 (95% CI, 0.92–1.24; P = 0.40) for summer and 1.12 (95% CI, 0.96–1.29; P = 0.14) for winter, relative to fall. . . . In this large cohort study, persons born in spring had increased risk of CMM in childhood through young adulthood, suggesting that the first few months of life may be a critical period of UVR susceptibility.

Rinaldi expresses concern about multiple comparisons, along with skepticism about the hypothesis that in Sweden 2-3 months old babies get some sunshine completely naked.

P.S. Some of the comments below are fascinating, far more so than the original paper! Maybe we should call this the “stone soup” or “Bem” phenomenon, when a work that is fairly empty of inherent interest (and likely does not represent any real, persistent pattern) gets a lot of people thinking furiously about a topic.

“An exact fishy test”

Macartan Humphreys supplied this amusing demo. Just click on the link and try it—it’s fun!

Here’s an example: I came up with 10 random numbers:

> round(.5+runif(10)*100)
 [1] 56 23 70 83 29 74 23 91 25 89

and entered them into Macartan’s app, which promptly responded:


You chose the numbers 56 23 70 83 29 74 23 91 25 89

But these are clearly not random numbers. We can tell because random numbers do not contain patterns but the numbers you entered show a fairly obvious pattern.

Take another look at the sequence you put in. You will see that the number of prime numbers in this sequence is: 5. But the `expected number’ from a random process is just 2.5. How odd is this pattern? Quite odd in fact. The probability that a truly random process would turn up numbers like this is just p=0.074 (i.e. less than 8%).

Try again (with really random numbers this time)!

ps: you might think that if the p value calculated above is high (for example if it is greater than 15%) that this means that the numbers you chose are not all that odd; but in fact it means that the numbers are really particularly odd since the fishy test produces p values above 15% for less than 2% of all really random numbers. For more on how to fish see here.

MA206 Program Director’s Memorandum

United States Military Academy

A couple years ago I gave a talk at West Point. It was fun. The students are all undergraduates, and most of the instructors were just doing the job for two years or so between other assignments. The permanent faculty were focused on teaching and organizing the curriculum.

As part of my visit I sat in on an intro statistics class and did a demo for them (probably it was the candy weighing but I don’t remember). At that time I picked up an information sheet for the course: “Memorandum for Academic Year (AY) 13-02 MA206 Students, United States Military Academy.” Lots of details (as one would expect in that military-bureaucratic ways), also this list of specific objectives of the course:

1. Understanding the notion of randomness and the role of variability and sampling in making inference.

2. Apply the axioms and basic properties of probability and conditional probability to quantify the likelihood of events.

3. Employ models using discrete or continuous random variables to answer basic probability questions.

4. Be able to draw appropriate conclusions from confidence intervals.

5. Construct hypothesis tests and draw appropriate conclusions from p-values.

6. Apply and assess linear regression models for point estimation and association between explanatory and dependent variables.

7. Critically evaluate statistical arguments in print media and scientific journals.

This is all ok except for items 4 and 5, I suppose.

Also, at the end, a list of rules, beginning with:

a. All cadets are expected to maintain proper military bearing and appearance during instruction in accordance with appropriate regulations.

b. Respect others in the classroom – No profanity, unprofessional jokes, or unprofessional computer items . . .

e. Jackets are not permitted in the classroom . . .

g. Drinks must be inside a closed container (plastic bottle with a top, for example) or in the Dean-approved mug . . .

and ending with this:

j. Rules common to blackboards, written work, and examinations:

1) Draw and label figures or graphs when appropriate.

2) Report numerical answers using the appropriate number of significant digits and units of measure.

Now those are some rules I can get behind. They should be part of every statistics honor code.

Free Stan T-shirt to the first “little twerp” who does a (good) Bayesian analysis of Jon Lee Anderson’s height

Cata w Jon Lee Anderson

I’d like to see a Stan implementation of the analysis presented in this comment by Gary from a year and a half ago.