Skip to content

Are Ivy League schools overrated?

I won’t actually answer the above question, as I am offering neither a rating of these schools nor a measure of how others rate them (which would be necessary to calibrate the “overrated” claim). What I am doing is responding to an email from Mark Palko, who wrote:

I [Palko] am in broad agreement with this New Republic article by William Deresiewicz [entitled "Don't Send Your Kid to the Ivy League: The nation's top colleges are turning our kids into zombies"] and I’ll try to blog on it if I can get caught up with more topical threads. I was particularly interested in the part about there being a “non-aggression pact” outside of the sciences.

This fits in with something I’ve noticed. I know this sounds harsh, but when I run across someone who is at the top of their profession and yet seems woefully underwhelming, they often have Ivy League BAs in non-demanding majors (For example, Jeff Zucker, Harvard, History. John Tierney, Yale, American Studies). My working hypothesis is that, while everyone who graduates from an elite school has an advantage in terms of reputation and networks, the actual difficulty of completing certain degrees isn’t that high relative to non-elite schools. Thus a history degree from Harvard isn’t worth that much more than a history degree from a Cal State school.

And David Brooks graduated from the University of Chicago with a degree in history . . .

In all seriousness, I don’t know if I agree with the claim in the headline of that article Palko links to.

I was very impressed by some of the Harvard undergrads I taught. Then again, they were statistics majors. In the old days, statistics might have been considered the soft option compared to math, but I don’t think that’s the case anymore. If anything, math majors are sometimes the sleepwalkers who happened to be good at math in school and never thought of stepping off the track. Anyway, it’s hard for me to make any general statements considering that I don’t teach many undergrads at all at Columbia.

Palko responded:

Yeah, I don’t want to put down Harvard grads, even the history majors. I’m sure that a disproportionate number of the brightest, most promising young historians are working on Harvard B.A. What’s more, I suspect most of them are developing valuable relationships with some of the most important names in their field.

What I’m wondering about is the popular notion that Ivy League schools are hard to get into and hard to get through. The first part is certainly true and the second appears to be true for STEM (which also has an additional self-selection bias). I’m not just not sure if it holds for all fields.

I don’t think there’s any question that selection bias, networking opportunities and halo effects play a large role here. What if they account for most of the benefit of attending an elite school for most students? This is worrisome from both sides: students are twisting themselves into knots to meet artificial and frankly somewhat odd selection criteria; and we’re giving the students who meet these odd criteria huge advantages in terms of wealth, career, and influence.

That can’t be good.

No, I didn’t say that!

Faye Flam wrote a solid article for the New York Times on Bayesian statistics, and as part of her research she spent some time on the phone with me awhile ago discussing the connections between Bayesian inference and the crisis in science criticism. My longer thoughts on this topic are in my recent article, “The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective,” but of course many more people will get the short version that appeared in the newspaper.

That’s fine, and Flam captured the general “affect” of our discussion—the idea that Bayes allows the use of prior information, and that p-values can’t be taken at face value. As I discuss below, I like Flam’s article, I’m glad it’s out there, and I’m grateful that she took the time to get my perspective.

Unfortunately, though, some of the details got garbled.
Continue reading ‘No, I didn’t say that!’ »

Some general principles of Bayesian data analysis, arising from a Stan analysis of John Lee Anderson’s height


God is in every leaf of every tree. The leaf in question today is the height of journalist and Twitter aficionado Jon Lee Anderson, a man who got some attention a couple years ago after disparaging some dude for having too high a tweets-to-followers ratio. Anderson called the other guy a “little twerp” which made me wonder if he (Anderson) suffered from “tall person syndrome,” that problem that some people of above-average height have, that they think they’re more important than other people because they literally look down on them.

After I raised this issue, a blog commenter named Gary posted an estimate of Anderson’s height using information available on the internet:

Based on this picture: he appears to be fairly tall. But the perspective makes it hard to judge.

Based on this picture: he appears to be about 9-10 inches taller than Catalina Garcia.

But how tall is Catalina Garcia? Not that tall – she’s shorter than the high-wire artist Phillipe Petit And he doesn’t appear to be that tall… about the same height as Claire Danes: – who according to Google is 5′ 6″.

So if Jon Lee Anderson is 10″ taller than Catalina Garcia, who is 2″ shorter than Philippe Petit, who is the same height as Claire Danes, then he is 6′ 2″ tall.

I have no idea who Catalina Garcia is, but she makes a decent ruler.

I happened to run across that comment the other day (when searching the blog for Tom Scocca) and it inspired me to put out a call for the above analysis to be implemented in Stan. A couple of other faithful commenters (Andrew Whalen and Daniel Lakeland) did this. But I wasn’t quite satisfied with either of those efforts (sorry, I’m picky, what can I say? You must’ve known this going in). So I just did it myself.


Before getting to my model, let me emphasize that nothing fancy is going on. I’m pretty much just translating Gary’s above comment into statistical notation.

Here’s my Stan program:

transformed data {
  real mu_men;
  real mu_women;
  real sigma_men;
  real sigma_women;
  mu_men <- 69.1;
  mu_women <- 63.7;
  sigma_men <- 2.9;
  sigma_women <- 2.7;
parameters {
  real Jon;
  real Catalina;
  real Phillipe;
  real Claire;
  real Jon_shoe_1;
  real Catalina_shoe_1;
  real Catalina_shoe_2;
  real Phillipe_shoe_1;
  real Phillipe_shoe_2;
  real Claire_shoe_1;
model {
  Jon ~ normal(mu_men,sigma_men);
  Catalina ~ normal(mu_women,sigma_women);
  Phillipe ~ normal(mu_men,sigma_men);
  Claire ~ normal(66,1);
  (Jon + Jon_shoe_1) - (Catalina + Catalina_shoe_1) ~ normal(9.5,1.5);
  (Catalina + Catalina_shoe_2) - (Phillipe + Phillipe_shoe_1) ~ normal(2,1);
  (Phillipe + Phillipe_shoe_2) - (Claire + Claire_shoe_1) ~ normal(0,1);
  Jon_shoe_1 ~ beta(2,2);
  Catalina_shoe_1 / 4 ~ beta(2,2);
  Catalina_shoe_2 / 4 ~ beta(2,2);
  Phillipe_shoe_1 ~ beta(2,2);
  Phillipe_shoe_2 ~ beta(2,2);
  Claire_shoe_1 / 4 ~ beta(2,2);

Hey! Html ate some of my code! I didn’t notice till a commenter pointed this out. In the declarations, the “shoe” variables should be bounded: “angle bracket lower=0,upper=1 angle bracket” for the men’s shoes, and “angle bracket lower=0,upper=4 angle bracket” for the women’s shoes.

I’ll present the results in a moment, but first here’s a quick discussion of some of the choices that went into the model:

- I got the population distributions of heights of men and women from a 1992 article in the journal Risk Analysis, “Bivariate distributions for height and weight of men and women in the United States,” by J. Brainard and D. E. Burmaster, which is the reference that Deb Nolan and I used for the heights distribution in our book on Teaching Statistics.

- I assumed that men’s shoe heights were between 0 and 1 inches, and that women’s shoe heights were between 0 and 4 inches, in all cases using a beta(2,2) distribution to model the distribution. This is a hack in so many ways (for one thing, nobody in these pictures is barefoot so 0 isn’t the right lower bound; for another, some men do wear elevator shoes and boots with pretty high heels) but, as always, ya gotta start somewhere.

- I took the height comparisons as stated in Gary’s comment, giving a standard deviation of 1 inch for each, except that I gave a standard deviation of 1.5 inches for the “9 or 10 inches” comparison between Jon and Claire, since that seemed like a tougher call.

- Based on the statement that Claire was 66 inches tall, I gave her a prior of 66 with a standard deviation of 1.


I saved the stan program as “heights.stan” and ran it from R:

heights <- stan_run("heights.stan", chains=4, iter=1000)
                 mean se_mean  sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
Jon              74.3     0.1 1.8  70.6  73.1  74.3  75.5  77.9   877    1
Catalina         65.1     0.1 1.5  62.3  64.1  65.1  66.2  67.9   754    1
Phillipe         66.1     0.0 1.3  63.7  65.2  66.1  67.0  68.6   813    1
Claire           65.5     0.0 1.0  63.8  64.9  65.5  66.1  67.5  1162    1
Jon_shoe_1        0.5     0.0 0.2   0.1   0.4   0.5   0.7   0.9  1658    1
Catalina_shoe_1   1.5     0.0 0.8   0.2   0.9   1.4   2.1   3.3  1708    1
Catalina_shoe_2   2.6     0.0 0.8   0.9   2.1   2.7   3.2   3.8  1707    1
Phillipe_shoe_1   0.5     0.0 0.2   0.1   0.3   0.5   0.6   0.9  1391    1
Phillipe_shoe_2   0.5     0.0 0.2   0.1   0.4   0.5   0.7   0.9  1562    1
Claire_shoe_1     1.6     0.0 0.8   0.2   1.0   1.5   2.2   3.3  1390    1
lp__            -21.6     0.1 2.5 -27.6 -23.1 -21.2 -19.8 -17.8   749    1

OK, everything seems to have converged, and it looks like Jon is somewhere between 6'1" and 6'4".

Tables are ugly. Let's make some graphs:

sims <- extract(heights,permuted=FALSE)
mon <- monitor(sims,warmup=0)
png("heights1.png", height=170, width=500)
subset <- 1:4
coefplot(rev(mon[subset,"mean"]), rev(mon[subset,"sd"]), varnames=rev(dimnames(mon)[[1]][subset]), main="Estimated heights in inches (+/- 1 and 2 s.e.)\n", cex.main=1, cex.var=1, mar=c(0,4,5.1,2))

png("heights2.png", height=180, width=500)
subset <- 5:10
coefplot(rev(mon[subset,"mean"]), rev(mon[subset,"sd"]), varnames=rev(c("Jon", "Catalina 1", "Catalina 2", "Phillipe 1", "Phillipe 2", "Claire")), main="Estimated shoe heights in inches (+/- 1 and 2 s.e.)\n", cex.main=1, cex.var=1, mar=c(0,4,5.1,2))

That is:



Model criticism

OK, now let's do some model criticism. What's in this graph that we don't believe, that doesn't make sense?

- Most obviously, some of the intervals for shoe height go negative. But that's actually not our model, it's coming from our crude summary of inference as +/- 2 sd. If instead we used the simulated quantiles directly, this problem would not arise.

- Catalina's shoes are estimated to be taller in her second picture (the one with Phillipe) than in the first, with Jon. But that's not so unreasonable, given the pictures. If anything, perhaps the intervals overlap too much. But that is just telling us that we might have additional information from the photos that is not captured in our model.

- The inferences for everyone's heights seem pretty weak. Is it really possible that Phillipe Petit could be 5'9" tall (as is implied by the upper bound of his 95% posterior interval)? Maybe not. Again, this implies that we have additional prior information that could be incorporated into the model to make better predictions.

Fitting a model, making inferences, evaluating these inferences to see if we have additional information we could include: That's what it's all about.

Software criticism

Finally, let's do the same thing with our code. What went wrong during the above process:

- First off, my Stan model wasn't compiling. It was producing an error at some weird place in the middle of the program. I couldn't figure out what was going on. Then, at some point in cutting and pasting, I realized what had happened: my text editor was using a font in which lower-case-l and the number 1 were indistinguishable. And I'd accidentally switched one for the other. I changed the font and fixed the problem.

- Again Stan gave an error, this time even more mysterious:

Error in compileCode(f, code, language = language, verbose = verbose) :
Compilation ERROR, function(s)/method(s) not created!

Agreeing to the Xcode/iOS license requires admin privileges, please re-run as root via sudo. In addition: Warning message:
running command ‘/Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB file3d6829c6b35e.cpp 2> file3d6829c6b35e.cpp.err.txt' had status 1

I posted the problem on stan-users and Daniel Lee replied that Apple had automatically updated Xcode and I needed to do a few clicks on my computer to activate the permissions.

- Then it ran, indeed, it ran on the first try, believe it or not!

- There were some issues with the R code. The calls to coefplot are a bit ugly, I had to do a bit of fiddling to get everything to look OK. It would be better to be able to do this directly from rstan, or at least to be able to make these plots with a bit less effort.

- Umm, that's about it. Actually the programming wasn't too bad.


I like Bayesian (Jaynesian) data analysis. You lay out your model step by step, and when the inferences don't seem right (either because of being in the wrong place, or being too strong, or too weak), you can go back and figure out what went wrong, or what information is available that you could throw into the model.

P.S. to Andrew Whalen and Daniel Lakeland: Don't worry, you've still earned your Stan T-shirts. Just email me with your size, and your shirts will be in the mail.

On deck this week

Mon: Some general principles of Bayesian data analysis, arising from a Stan analysis of John Lee Anderson’s height

Tues: Are Ivy League schools overrated?

Wed: Can anyone guess what went wrong here?

Thurs: What went wrong

Fri: 65% of principals say that at least 30% of students . . . wha??

Sat: Carrie McLaren was way out in front of the anti-Gladwell bandwagon

Sun: Anova is great—if you interpret it as a way of structuring a model, not if you focus on F tests

People used to send me ugly graphs, now I get these things

Antonio Rinaldi points me to this journal article which reports:

We found a sinusoidal pattern in CMM [cutaneous malignant melanoma] risk by season of birth (P = 0.006). . . . Adjusted odds ratios for CMM by season of birth were 1.21 [95% confidence interval (CI), 1.05–1.39; P = 0.008] for spring, 1.07 (95% CI, 0.92–1.24; P = 0.40) for summer and 1.12 (95% CI, 0.96–1.29; P = 0.14) for winter, relative to fall. . . . In this large cohort study, persons born in spring had increased risk of CMM in childhood through young adulthood, suggesting that the first few months of life may be a critical period of UVR susceptibility.

Rinaldi expresses concern about multiple comparisons, along with skepticism about the hypothesis that in Sweden 2-3 months old babies get some sunshine completely naked.

P.S. Some of the comments below are fascinating, far more so than the original paper! Maybe we should call this the “stone soup” or “Bem” phenomenon, when a work that is fairly empty of inherent interest (and likely does not represent any real, persistent pattern) gets a lot of people thinking furiously about a topic.

“An exact fishy test”

Macartan Humphreys supplied this amusing demo. Just click on the link and try it—it’s fun!

Here’s an example: I came up with 10 random numbers:

> round(.5+runif(10)*100)
 [1] 56 23 70 83 29 74 23 91 25 89

and entered them into Macartan’s app, which promptly responded:


You chose the numbers 56 23 70 83 29 74 23 91 25 89

But these are clearly not random numbers. We can tell because random numbers do not contain patterns but the numbers you entered show a fairly obvious pattern.

Take another look at the sequence you put in. You will see that the number of prime numbers in this sequence is: 5. But the `expected number’ from a random process is just 2.5. How odd is this pattern? Quite odd in fact. The probability that a truly random process would turn up numbers like this is just p=0.074 (i.e. less than 8%).

Try again (with really random numbers this time)!

ps: you might think that if the p value calculated above is high (for example if it is greater than 15%) that this means that the numbers you chose are not all that odd; but in fact it means that the numbers are really particularly odd since the fishy test produces p values above 15% for less than 2% of all really random numbers. For more on how to fish see here.

MA206 Program Director’s Memorandum

United States Military Academy

A couple years ago I gave a talk at West Point. It was fun. The students are all undergraduates, and most of the instructors were just doing the job for two years or so between other assignments. The permanent faculty were focused on teaching and organizing the curriculum.

As part of my visit I sat in on an intro statistics class and did a demo for them (probably it was the candy weighing but I don’t remember). At that time I picked up an information sheet for the course: “Memorandum for Academic Year (AY) 13-02 MA206 Students, United States Military Academy.” Lots of details (as one would expect in that military-bureaucratic ways), also this list of specific objectives of the course:

1. Understanding the notion of randomness and the role of variability and sampling in making inference.

2. Apply the axioms and basic properties of probability and conditional probability to quantify the likelihood of events.

3. Employ models using discrete or continuous random variables to answer basic probability questions.

4. Be able to draw appropriate conclusions from confidence intervals.

5. Construct hypothesis tests and draw appropriate conclusions from p-values.

6. Apply and assess linear regression models for point estimation and association between explanatory and dependent variables.

7. Critically evaluate statistical arguments in print media and scientific journals.

This is all ok except for items 4 and 5, I suppose.

Also, at the end, a list of rules, beginning with:

a. All cadets are expected to maintain proper military bearing and appearance during instruction in accordance with appropriate regulations.

b. Respect others in the classroom – No profanity, unprofessional jokes, or unprofessional computer items . . .

e. Jackets are not permitted in the classroom . . .

g. Drinks must be inside a closed container (plastic bottle with a top, for example) or in the Dean-approved mug . . .

and ending with this:

j. Rules common to blackboards, written work, and examinations:

1) Draw and label figures or graphs when appropriate.

2) Report numerical answers using the appropriate number of significant digits and units of measure.

Now those are some rules I can get behind. They should be part of every statistics honor code.

Free Stan T-shirt to the first “little twerp” who does a (good) Bayesian analysis of Jon Lee Anderson’s height

Cata w Jon Lee Anderson

I’d like to see a Stan implementation of the analysis presented in this comment by Gary from a year and a half ago.

“Derek Jeter was OK”


Tom Scocca files a bizarrely sane column summarizing the famous shortstop’s accomplishments:

Derek Jeter was an OK ballplayer. He was pretty good at playing baseball, overall, and he did it for a pretty long time. . . . You have to be good at baseball to last 20 seasons in the major leagues. . . . He was a successful batter in productive lineups for many years. . . . He was not Ted Williams or Rickey Henderson. Spectators did not come away from seeing Derek Jeter marveling at the stupendous, unimaginable feats of hitting they had seen. But he did lots and lots of damage. He got many big hits and contributed to many big rallies. Pitchers would have preferred not to have to pitch to him. . . . His considerable athletic abilities allowed him to sometimes make spectacular leaping and twisting plays on misjudged balls that better shortstops would have played routinely. People enjoyed watching him make those plays, and that enjoyment led to his winning five Gold Gloves. That misplaced acclaim, in turn, helped spur more advanced analysis of defensive play in baseball, a body of knowledge which will ensure that no one ever again will be able to play shortstop as badly as Jeter for as long as he did. And that gave fans something to argue about, which is an important part of sports.

Scocca keeps going in this vein:

Regardless, on balance, Jeter’s good hitting helped his team more than his bad fielding hurt it. The statistical ledger says so—by Wins Above Replacement, according to Baseball Reference, his glovework drops him from being the 20th most productive position player of all time to the 58th. Having the 58th most productive career among non-pitchers in major-league history is still a solid achievement.

And still more:

When [Alex] Rodriguez showed up in the Bronx, Jeter would not yield the job. It was a selfish decision and the situation hurt the team. But powerful egos, misplaced competitiveness, and unrealistic self-appraisals are common features in elite athletes. Whatever wrong Jeter may have done in the intrasquad rivalry, it was the Yankees’ fault for not managing him better.

And this:

Like most star athletes of his era, he kept his public persona intentionally blank and dull . . . Depending on their allegiances, baseball fans could imagine him to be classy or imagine him to be pissy, and the limited evidence could support either conclusion.

I love this Scocca post because its hilariousness (which is intentional, I believe) is entirely contingent on its context. Sportswriting is so full of hype (either of the “Jeter is a hero” variety or the “Jeter’s no big whoop” variety or the “Hey, look at my cool sabermetrics” variety or the “Hey, look at what a humanist I am” variety) that it just comes off (to me) as flat-out funny to see a column that just plays it completely straight, a series of declarative sentences that tell it like it is.

Of course, if all the sportswriters wrote like this, it would be boring. But as long as all the others feel they need some sort of angle, this pitch-it-down-the-middle style will work just fine. The confounding of expectations and all that.

P.S. Also this from a commenter to Scocca’s post:

He also inspired people to like baseball again after the lockout and didn’t juice.

Waic for time series

Helen Steingroever writes:

I’m currently working on a model comparison paper using WAIC, and
would like to ask you the following question about the WAIC computation:

I have data of one participant that consist of 100 sequential choices (you can think of these data as being a time series). I want to compute the WAIC for these data. Now I’m wondering how I should compute the predictive density. I think there are two possibilities:

(1) I compute the predictive density of the whole sequence (i.e., I consider the whole sequence as one data point, which means that n=1 in Equations (11) – (12) of your 2013 Stat Comput paper.)
(2) I compute the predictive density for each choice (i.e., I consider each choice as one data point, which means that n=# choices in Equations (11) – (12) of your 2013 Stat Comput paper.)

My quick thought was that Waic is an approximation to leave-one-out cross-validation and this computation gets more complicated with correlated data.

But I passed the question on to Aki, the real expert on this stuff. Aki wrote:

This a interesting question and there is no simple answer.

First we should consider what is your predictive goal:
(1) predict whole sequence for another participant
(2) predict a single choice given all other choices
(3) predict the next choice given the choices in the sequence so far?

If your predictive goal is

(1) then you should note that WAIC is based on an asymptotic argument and it is not generally accurate with n=1. Watanabe has said (personal communication) that he thinks that this is not sensible scenario for WAIC, but if (1) is really your prediction goal, then I think that this is might be best you can do. It seems that when n is small, WAIC will usually underestimate the effective complexity of the model, and thus would give over-optimistic performance estimates for more complex models.

(2) WAIC should work just fine here (unless your model says that there is no dependency between the choices, ie. having 100 separate models with each having n=1). Correlated data here means just that it is easier to predict a choice if you know the previous choices and the following choices. This may make difference between some models small compared to scenario (1).

(3) WAIC can’t handle this, and you would need to use a specific form of cross-validation (I think I should write a paper on this).