Skip to content

A research project for you! Using precursor data to evaluate the Leicester odds.

OK, here’s a research project for someone who’s interested in sports statistics. It’s from this comment by Paul in a recent thread:

What I would like to see (has anyone done it?) is an analysis of the performance of EPL teams that had similar pre-season odds to Leicester over the last 15-20 years or so. Even just a plot of the season points behind the champion for that year would give a better idea of just how much of an outlier Leicester was.

As Paul says, this is related to my idea of estimating low probabilities by modeling from precursor data.

So here’s a research project for you which, if successful, is guaranteed to get some attention and could even be influential in getting people to understand the idea of retrospective evaluation of prospective odds:

Step 1: Put together a dataset of points, goal differential, and pre-season betting odds from a bunch of past English soccer league seasons.

Step 2: Make some graphs.

Step 3: Fit some models.

Step 4: Make some graphs.

Do all 4 steps. And then, even if the later steps aren’t perfect, the data are out there and other people can fit their own models.

NBA is hiring; no height requirement

Jason Rosenfeld writes:

I’m looking to hire a basketball analyst to join my basketball analytics team here at the NBA League Office in NYC. Looking for someone who is graduating now or graduated recently (probably better suited for an undergrad, though grad students are welcome to reach out as well). Looking for a background in stats, or CS, or math, etc.

The job description is here.

Freak Punts on Leicester Bet

I went over to the Freakonomics website and found this story about Leicester City’s unexpected championship.

Here’s Stephen Dubner:

At the start of this season, British betting houses put Leicester’s chances of winning the league at 5,000-to-1, which seemed, if anything, perhaps too generous. My [Dubner’s] son Solomon again:

SOLOMON DUBNER: What would you say if I told you the Cleveland Browns were going to win the Super Bowl next season?

STEPHEN J. DUBNER: I would say that you’ve started smoking marijuana.

SOLOMON DUBNER: Exactly, and that’s what you would have said if I said Leicester had won the league.

And yet: Leicester did win the league.

OK, so I looked up the odds for the Cleveland Browns winning the Super Bowl next year. I googled *betting odds cleveland browns super bowl*, and this came up:

Screen Shot 2016-05-26 at 11.19.23 AM

In case you’re curious, the full table is here. The Browns are the biggest underdogs. Odds range from 7.5 to 1 (Patriots) to 200 to 1, with the top teams being Patriots, Seahawks, Steelers, Packers, Broncos, and Panthers, all of which have better than 15 to 1 odds.

If the Browns have 200-1 odds, I’d estimate their probability of winning the Super Bowl as being a bit less than 1/200, because of the well-known “longshot bias,” that people overbet longshots and as a results the odds on longshots aren’t as extreme as they usually are. The striking thing about Leicester City is that this seems to be one longshot where the odds were set too high.

Remember, the point about Leicester City was not that it had long odds, it was that the odds were sooooo long. If they’d had 200-1 odds, they’d be right where the Browns are, in the “smoking marijuana” realm. Even 500-1 and they’d be close to that. 5000-1, that’s another order of magnitude. I don’t know that anyone who’s studied this example would characterize those 5000-1 odds as “perhaps too generous.” The other part of the story was that, as the season went on, bookies were too slow to adjust the odds downward.

I have to admit to being a bit disappointed. Dubner might not believe me here, but in all sincerity I have no desire to find fault with Freakonomics. I clicked over to that page with the hope that they’d done this story right, that they’d’ve dug into the fascinating story of how the oddsmakers got this one wrong, how they went against the traditional longshot bias. I’d bet that Nick Goff, the bookmaker who wrote a couple posts on this topic (see link just above), would be a great radio interview.

I hope the Freakonomics team can do better next time. As I’ve written many times, I respect and admire the outreach that is done by the Freakonomics franchise and I’d like them to live up to their potential. Especially in an example such as betting odds which would seem to be right in their wheelhouse.

P.S. Some commenters argue that the 5000-1 odds were, a priori, fair. I’m doubtful but I’m also open to the possibility that I’m wrong on this one. My key point, I suppose, is that the odds are not necessarily fair, and in writing about Leicester’s surprising success, one should at least consider that the odds were, even prospectively, off.

“The Natural Selection of Bad Science”

That’s the title of a new paper by Paul Smaldino and Richard McElreath which presents a sort of agent-based model that reproduces the growth in the publication of junk science that we’ve seen in recent decades.

Even before looking at this paper I was positively disposed toward it for two reasons. First because I do think there are incentives that encourage scientists to follow the forking paths toward statistical significance and that encourage journalists to publish this sort of thing. And I also see incentives for scientists and journals (and even the Harvard University public relations office; see the P.P.S. here) to simply refuse to even consider the possibility that published results are spurious. The second reason I liked this paper before even reading it is that the second author recently wrote an excellent textbook on Bayesian statistics which in fact I just happened to recommend to a student a few hours ago.

I have some problems with the details of Smaldino and McElreath’s model—in particular, I hate the whole “false positives” thing, and I’d much prefer a model in which effects are variable, as discussed in this recent paper. But overall I think this paper could have useful rhetorical value; I place it in the same category as Ioannidis’s famous “Why most published research findings are false” paper, in that I agree with its general message, even if it’s working within a framework that I don’t find congenial.

In short, I agree with Smaldino and McElreath that there are incentives pushing scientists to conduct, publish, promote, and defend bad science, and I think their model is valuable in demonstrating how that can happen. People like me who have problems with the particulars of the model can create their own models of the scientific process, and I think they (we) will come to similar conclusions.

I hope the Smaldino and McElreath paper gets lots of attention because (a) this can motivate more work in this area, and (b) by giving a systemic explanation for the spread of junk science, it lets individual scientists off the hook somewhat. This might encourage people to change the incentives and it also gives a sort of explanation for why all these well-meaning researchers can be doing so much bad work. One reason I’ve disliked discussions of “p-hacking” is that it makes the perpetrators of junk science out to be bad guys, which in turn leads individual researchers to think, Hey, I’m not a bad guy and my friends aren’t bad guys, we’re not p-hacking, therefore our work is ok. I’m hoping that ideas such as the garden of forking paths and this new paper will give researchers permission to critically examine their own work and the work of their friends and consider the possibility that they’re stuck in a dead end.

I fear it’s too late for everyone involved in himmicanes, beauty and sex ratios, ovulation and clothing, embodied cognition, power pose, etc.—but maybe not too late for the next generation of researchers, or for people who are a little less professionally committed to particular work.

The way we social science now

This is a fun story. Jeff pointed me to a post on the sister blog by Christopher Hare and Robert Lupton, entitled “No, Sanders voters aren’t more conservative than Clinton voters. Here’s the data.”

My reaction: “Who would ever think that Sanders supporters are more conservative than Clinton supporters? That’s counterintuitivism gone amok.”

It turned out that Hare and Lupton were responding to a recent newspaper op-ed by Chris Achen and Larry Bartels, who had written:

In a survey conducted for the American National Election Studies in late January, supporters of Mr. Sanders were more pessimistic than Mrs. Clinton’s supporters about “opportunity in America today for the average person to get ahead” and more likely to say that economic inequality had increased.

However, they were less likely than Mrs. Clinton’s supporters to favor concrete policies that Mr. Sanders has offered as remedies for these ills, including a higher minimum wage, increasing government spending on health care and an expansion of government services financed by higher taxes.

Hare and Lupton argue that Achen and Bartels had made a mistake in their data analysis:

The study asked Democrats, independents, and Republican respondents alike to say which Democratic primary candidate they preferred: Hillary Clinton, Martin O’Malley, Bernie Sanders, “another Democratic candidate,” or none.

More than twice as many Republican respondents chose Sanders as chose Clinton.

That means that in analyzing this group of Sanders “supporters,” Achen and Bartels were examining a group that may well have been farther to the right than actual Sanders voters. We don’t believe that the ANES Republican respondents were actually Sanders backers. We think it’s far more likely that they just strongly dislike Hillary Clinton.

Hare and Lupton look at several issue attitudes (ugly-ass bar charts but, hey, no one’s perfect) and conclude:

On most issues, actual Sanders supporters – Democrats and independents who voted for him or would be likely to, as opposed to Republicans who are holding their noses and selecting the Democrat they dislike least – are indeed to the left of Clinton supporters. Primary voters are in fact able to pick the candidate whose positions they find most ideologically compatible. And that lines up more accurately with other scholarly evidence.

I’m with Hare and Lupton on the substance, as I’ve said before in other contexts (for example in my disagreement with Bartels’s claim that flashing a subliminal smiley face on a computer screen can induce big changes in attitudes toward immigration), but I think they’re being a bit unfair in titling their post. As far as I could tell, Achen and Bartels never claimed that Sanders voters are more conservative than Clinton voters. Achen and Bartels’s main point was that the Sanders/Clinton choice is predicted more from demographics than from issue attitudes:

Exit polls conducted in two dozen primary and caucus states from early February through the end of April reveal only modest evidence of ideological structure in Democratic voting patterns, but ample evidence of the importance of group loyalties.

Mr. Sanders did just nine points better, on average, among liberals than he did among moderates. By comparison, he did 11 points worse among women than among men, 18 points worse among nonwhites than among whites and 28 points worse among those who identified as Democrats than among independents.

It is very hard to point to differences between Mrs. Clinton and Mr. Sanders’s proposed policies that could plausibly account for such substantial cleavages. They are reflections of social identities, symbolic commitments and partisan loyalties.

This seems reasonable enough, indeed it’s broadly consistent with a “reasoning voter” model, in that Clinton and Sanders really aren’t so far apart on issues. This is related to the general point that primary elections are hard to predict.

So, overall, I don’t think Achen/Bartels and Hare/Lupton are that far apart in their analysis of the Clinton-Sanders divide, even though the two pairs of political scientists are coming from much different perspectives.

This is social science

What interests me most about this story, though, is not the content but the medium. This is a political science debate, conducted by serious academic researchers using real data, and it’s taking place in the newspapers: in particular, a New York Times online op-ed and a Washington Post blog.

Traditionally we in academia have thought of op-eds as a way to publicize our research: we do a study, write a paper, publish the journal article, and then try (usually without much success) to get some news coverage or to get some op-eds. I remember with Red State Blue State we wrote the research article, we wrote the academic book, then we notified journalists and we wrote some newspaper articles—not a lot, but I did get something in the Wichita Eagle, I think, which was appropriate because we did have some Kansas material in our book.

Nowadays, though, more and more, we don’t bother with the research article. What’s the point of knocking yourself out writing a jargon-free academic article, then struggling through a years-long referee and revision process, then finally reaching success! and finding that nobody reads the journal anyway. I’d rather post—that is, publish—my findings here and on the sister blog, thus reaching more people than would read the journal and getting useful comments to boot. Even when we do write academic papers, we’ll still typically present our work online first, to get the ideas out there and to get immediate feedback.

That’s what Achen and Bartels did. Actually they published in online the New York Times which has more prestige and, I assume, much more readership than we get here, but it’s the same general idea.

And then Hare and Lupton read that op-ed, did their own analysis, and published right back. This is great. What would’ve taken two years using traditional academic publishing, took only one week under this new regime.

As with old-style publishing, there remains a link-forward problem: Anyone who reads Hare and Lupton can click to read Achen and Bartels, but the reverse does not hold. Of the thousands of people who read Achen and Bartels when it came out, most will never see Hare and Lupton. It would make sense for Achen and Bartels to add a link to their article so that future readers can see Hare and Lupton’s critique, but (a) I don’t know if the Times likes to add links a week after an article appears, and (b) Achen and Bartels might not want to link to criticism. Perhaps, though, they will do the bloggy good thing of following up with a post discussing Hare and Lupton.

Under the old system, Achen and Bartels could’ve submitted a paper to a journal, and it’s possible that the problem with including Republicans in the data analysis would’ve been caught in the review process. So that would’ve been good. More likely, though, they never would’ve submitted this as is. Rather, they would’ve had to bundle it with some other findings. One thing that’s not so much discussed when it comes to academic publishing is the importance of framing and packaging: You need to make the case that what you’ve done is a big deal. On balance I don’t know if this need to package is good or bad. It’s bad in that it encourages hype, and it discourages the publication of solid but not counterintuitive results. But it’s good in that it pushes researchers to place their work in context and to think about the big picture. Maybe all this blogging is not so good for my own research, for example; I don’t know.

From this perspective, the modern practice of publishing research in the newspaper and not in scholarly journals is just a continuation of a trend toward shorter, less contextual pieces. Little snippets of research that need to be gathered up and synthesized.

Anyway, I think this is an interesting case study. It’s a great example of post-publication review. I’m glad that Achen and Bartels published their findings right away, giving Hare and Lupton a chance to correct some of their analysis (even if, as I said, I think Hare and Lupton caricatured Achen and Bartels’s work a bit). All on the foundation of shared, public data.

But the new system is far from perfect. Two problems are access and a bias toward findings that are counterintuitive (i.e., often false). Access: If you’re Andrew Gelman or Tyler Cowen or one of our friends, you can post your results whenever and as often as you’d like. Chris Achen and Larry Bartels are pretty well connected too and were able to post at the New York Times. Other researchers might have a more difficult time getting people to read their papers. Yes, academic journals have gatekeepers too but of a different sort. Bias toward the counterintuitive: Newspapers want to publish news, so it’s a plus to make a claim that at first might seem surprising. As noted, this is also a bias toward error.

In any case, this is the mode of publication we’re moving toward, so it’s good for us to understand its strengths and weaknesses, in order to do better. For example, the Monkey Cage has made efforts both to widen access and to favor the true over the sensational.

P.S. Achen and Bartels follow up, pointing out some problems in Hare and Lupton’s analysis. I like to see these responses by blog; this is so much cleaner than having to fight with journals to get letters published.

“Replication initiatives will not salvage the trustworthiness of psychology”


So says James Coyne, going full Meehl.

I agree. Replication is great, but if you replicate noise studies, you’ll just get noise, hence the beneficial effects on science are (a) to reduce confidence in silly studies that we mostly shouldn’t have taken seriously in the first place, and (b) to provide an disincentive for future researchers and journals to publish exercises in noise mining.

In his article, Coyne discusses lots that we’ve seen before, but I find it helpful to see it all in one place. In particular, you get a sense that junk science is at the core of academic psychology. Yes, we can (and should) laugh at himmicanes etc., but within the field of psychology this is no joke.

To say that psychology has problems is not to let other fields off the hook. Biology is notorious for producing “the gene for X” headlines, political science and economics have informed us with a straight face that elections are determined by football games and shark attacks, and so on.

Or take statistics. I don’t think statistics journals publish so many out-and-out wrong results, but I would say that most articles published in our journals are useless.

And I’m not saying psychology is all bad, or mostly bad, just that the bad stuff is out there, continually promoted and defended by the Association for Psychological Science and leading journals in the field.

The only way I’d change Coyne’s article is to add a slam at the field of statistics, which in many ways has prospered by selling our methods as a way of converting uncertainty into certainty. I think we are very much to blame. See the end of this article (the second column on page 55).

Who marries whom?

Screen Shot 2016-02-23 at 9.57.42 PM

Screen Shot 2016-02-23 at 10.02.46 PM

Elizabeth Heyman points us to this display by Adam Pearce and Dorothy Gambrell who write, “We scanned data from the U.S. Census Bureau’s 2014 American Community Survey—which covers 3.5 million households—to find out how people are pairing up.” They continue:

For any selected occupation, the chart highlights the five most common occupation/relationship matchups. (For example, male firefighters most often marry female nurses, while female nurses most often marry managers.) Same-sex occupation/relationship matchups weren’t common enough to reach the top five in any occupation. So the chart also highlights the top male-male and female-female job matchups for each occupation.

I find the graph nearly impossible to read. But the data are fascinating. It could be a good class project to display these data in different ways.

As is often the case, the biggest contribution is not in the graphic itself but in the idea of going through the data to display this information in the first place. What a great idea! I look forward to seeing many different displays of different aspects of these data. And, thanks, U.S. government, for providing this information for us.

Ramanujan notes

A new movie on Ramanujan is coming out; mathematician Peter Woit gives it a very positive review, while film critic Anthony Lane is not so impressed. Both these reactions make sense, I guess (or so I say without having actually seen the movie myself).

I’ll take this as an occasion to plug my article on the Ramanujan principle: Tables are read as crude graphs.
Continue reading ‘Ramanujan notes’ »

All that really important statistics stuff that isn’t in the statistics textbooks

Kaiser writes:

More on that work on age adjustment. I keep asking myself where is it in the Stats curriculum do we teach students this stuff? A class session focused on that analysis teaches students so much more about statistical thinking than anything we have in the textbooks.

I’m not sure. This sort of analysis is definitely statistics, but it doesn’t fit into the usual model-estimate-s.e. pattern of intro stat. Nor does it fit into the usual paradigm of exploratory data analysis which focuses on scatterplots. If we were to broaden exploratory analysis to include time series plots, that would work—but still this would miss the point, in that the usual focus would then be on the techniques of how to make the graph, not on the inquiry. From a conceptual point of view, the analysis I did is not so different from regression. It’s that lumping and splitting thing. And then there’s the age adjustment which is model-based but not in the usual way of intro statistics classes.

There’s an appeal to starting a class with examples such as this, where sample sizes are huge so we can go straight to the deterministic patterns and not get distracted with standard errors, p-values, and so forth.

When I taught intro stat, I did use various examples like this. Another one being the log-log graph of metabolism of animals vs. body mass, where again the point was the general near-deterministic relationship, with the variation around the line being secondary, and estimates/s.e./hyp-test not coming in at all). Deb and I have a bunch of these examples in our Teaching Statistics book.

It’s not hard to cover such material in class, but there does seem to be a bit of a conceptual gap when trying to link it to the rest of statistics, even at an introductory level but at other levels as well. In our new intro stat course, we’re trying to structure everything in terms of comparisons, and these sorts of examples fit in very well. But where the rubber meets the road is setting up specific skills students can learn so they can practice doing this sort of analysis on their own.

Kaiser follows up with a more specific question:

Also, on a separate topic, have you come across a visual display of confidence intervals that is on the scale of probabilities? It always bugs me that the scale of standard errors is essentially a log scale. Moving from 2 to 3 and from 3 to 4 are displayed as equal units but the probabilities have declined exponentially.

My reply: I like displaying 50% and 95% intervals together, as here:

Screen Shot 2015-11-13 at 1.00.33 AM

P.S. Regarding the age-adjustment example that Kaiser mentioned: it just happens that I posted on it recently in the sister blog:

Mortality rates for middle-aged white men have been going down, not up.
But why let the facts get in the way of a good narrative?

It’s frustrating to keep seeing pundits writing about the (nonexistent) increasing mortality rate of middle-aged white men. It’s like Red State Blue State all over again. Just makes me want to scream.

On deck this week

Mon: All that really important statistics stuff that isn’t in the statistics textbooks

Tues: Who marries whom?

Wed: Gray graphs look pretty

Thurs: Freak Punts on Leicester Bet

Fri: Who falls for the education reform hype?

Sat: Taking responsibility for your statistical conclusions: You must decide what variation to compare to.

Sun: Researchers demonstrate new breakthrough in public relations, promoting a study before it appears in Psychological Science or PPNAS

Should he major in political science and minor in statistics or the other way around?

Andrew Wheeler writes:

I will be a freshman at the University of Florida this upcoming fall and I am interested in becoming a political pollster. My original question was whether I should major in political science and minor in statistics or the other way around, but any other general advice would be appreciated.

My reply: I think when it comes to employability, a stat major with poli sci minor will have more options than the reverse. Stat majors have valuable computing skills, also most stat majors aren’t interested in politics, so if you are one of those unusual stat majors with political interests, that will make you special.

The “power pose” of the 6th century B.C.

From Selected Topics in the History of Mathematics by Aaron Strauss (1973):

Today Pythagoras is known predominantly as a mathematician. However, in his own day and age (which was also the day and age of Buddha, Lao-Tsa, and Confucious), he was looked upon as the personification of the highest divine wisdom by his followers to whom he preached the immortality of the soul. The whole lot of them were often ridiculed by ancient Greek society as being superstitious, filthy vegetarians. . . .

Pythagorean number theory was closely related to number mysticism. Odd numbers were male while even numbers were female. (Shakespeare: “there is divinity in odd numbers”). Individual integers had their own unique properties:

   1    generator and number of reason
   2    1st female number – number of opinion
   3    1st male number – number of harmony, being composed of unity and diversity
   4    number of justice or retribution (square of accounts)
   5    number of marriage, being composed of the first male and female numbers
   6    number of creation
   7    signified the 7 planets and 7 days in a week
   10    holiest number of all composed of 1+2+3+4 which determine a point, a line, and space respectively (later it was discovered that 10 is the smallest integer n for which there exist as many primes and nonprimes between 1 and n)
   17    the most despised and horrible of all numbers

The rectangles with dimensions 4 x 4 and 3 x 6 are interesting in that the former has area and perimeter equal to 16 and the latter has area and perimeter equal to 18. Possibly 17’s horror was kept under control by being surrounded by 16 and 18.

The whole book (actually comb-bound lecture notes) is great. It’s too bad Strauss died so young. I pulled it off the shelf to check my memory following this blog discussion. Indeed I’d been confused. I’d remembered 4 being the number of justice and 17 being the evil number, so I just assumed that the Pythagoreans viewed even numbers as male and odd numbers as female.

Just imagine what these ancient Greeks would’ve been able to do, had they been given the modern tools of statistical significance. I can see it now:

Pythagoras et al. (-520). Are numbers gendered? Journal of Experimental Psychology: General, Vol. -2390, pp. 31-36.

Stan on the beach


This came in the email one day:

We have used the great software Stan to estimate bycatch levels of common dolphins (Delphinus delphis) in the Bay of Biscay from stranding data. We found that official estimates are underestimated by a full order of magnitude. We conducted both a prior and likelihood sensitivity analyses : the former contrasted several priors for estimating a covariance matrix and the latter contrasted results from a Negative Binomial and a Discrete Weibull likelihood. The article is available at:

Unfortunately (and I know that this is not truly an excuse), given the journal scope and space constraints most of the modelling with Stan is actually described in two appendices. Data and R scripts are available on github (

Thanking you again for the amazing Stan,

Yours sincerely,

Matthieu Authier

Peltier, H. and Authier, M. and Deaville, R. and Dabin, W. and Jepson, P.D. and {van Canneyt}, O. and Daniel, P. and Ridoux, V. (2016) Small Cetacean Bycatch as Estimated from Stranding Schemes: the Common Dolphin Case in the Northeast Atlantic. Environmental Science & Policy, 63: 7–18, doi:10.1016/j.envsci.2016.05.004

That’s what it’s all about.

When doing causal inference, define your treatment decision and then consider the consequences that flow from it

Danielle Fumia writes:

I am a research at the Washington State Institute for Public Policy, and I work on research estimating the effect of college attendance on earnings. Many studies that examine the effect of attending college on earnings control for college degree receipt and work experience. These models seem to violate the practice you discuss in your data analysis book of not including intermediate outcomes in the regression of the treatment (attending college) on y (earnings). However, I’m not sure if this situation is different because only college attendees can obtain a degree, but in any case, this restriction wouldn’t be true for work experience. I fear I am missing an important idea because most studies control for one or both seemingly intermediate outcomes, and I cannot seem to find an explanation after much research.

My reply:

For causal inference, I recommend that, instead of starting by thinking of the outcome, you start by thinking of the intervention. As always, the question with the intervention is “Compared to what?” You want to compare people who attended college to equivalent people who did not attend college. At that point you can think about all the outcomes that flow from this choice. In your regressions, you can control for things that came before the choice, but not after. So you can control for work experience before the forking of paths (college or no college) but not for work experience after that choice. Also, suppose your choice is at age 21: either a student attends college by age 21 or he or she does not. Then that’s the fork. If a non-attender later goes to college at age 25, he or she would still be in the “no college attendance at 21” path.

I’m not saying you have to define college attendance in that way, I’m just saying you have to define it in some way. If you don’t define the treatment, your analysis will (implicitly) define it for you.

Here’s more you can read on causal inference and regression, from my book with Jennifer. We go into these and other points in more detail.

“99.60% for women and 99.58% for men, P < 0.05.”

Gur Huberman pointed me to this paper by Tamar Kricheli-Katz and Tali Regev, “How many cents on the dollar? Women and men in product markets.” It appeared in something called ScienceAdvances, which seems to be some extension of the Science brand, i.e., it’s in the tabloids!

I’ll leave the critical analysis of this paper to the readers. Just one hint: Their information on bids and prices comes from an observational study and an experiment. The observational study comes from real transactions and N is essentially infinity so the problem is not with p-values or statistical significance, but the garden of forking paths still comes into play, as there is still the selection of which among many possible comparisons to present, and the selection of which among many possible regressions to run. Also lots of concerns about causal identification, given that they’re drawing conclusions about different numbers of bids and different average prices for products sold by men and women, but they also report that men and women are using different selling strategies. The experiment is N=116 people on Mechanical Turk so there we have the usual concerns about interpretation of small nonrepresentative samples.

The paper has many (inadvertently) funny lines; my favorite is this one:

Screen Shot 2016-02-21 at 1.03.00 PM

I do not however, believe this sort of research is useless. To the extent that you’re interested studying behavior in online auctions—and this is a real industry, it’s worth some study—it seems like a very sensible plan to gather a lot of data and look at differences between different groups. No need to just compare men and women, you could also compare sellers by age, by educational background, by goals in being on Ebay, and so forth. It’s all good. And, for that matter, it seems reasonable to start by highlighting differences between the sexes—that might get some attention, and there’s nothing wrong with wanting a bit of attention for your research. It should be possible to present the relevant comparisons in the data in some sort of large grid rather than following the playbook of picking out statistically significant comparisons.

P.S. Some online hype here.

The difference between “significant” and “not significant” is not itself statistically significant: Education edition

In a news article entitled “Why smart kids shouldn’t use laptops in class,” Jeff Guo writes:

For the past 15 years, educators have debated, exhaustively, the perils of laptops in the lecture hall. . . . Now there is an answer, thanks to a big, new experiment from economists at West Point, who randomly banned computers from some sections of a popular economics course this past year at the military academy. One-third of the sections could use laptops or tablets to take notes during lecture; one-third could use tablets, but only to look at class materials; and one-third were prohibited from using any technology.

Unsurprisingly, the students who were allowed to use laptops — and 80 percent of them did — scored worse on the final exam. What’s interesting is that the smartest students seemed to be harmed the most.

Uh oh . . . a report that an effect is in one group but not another. That raises red flags. Let’s read on some more:

Among students with high ACT scores, those in the laptop-friendly sections performed significantly worse than their counterparts in the no-technology sections. In contrast, there wasn’t much of a difference between students with low ACT scores — those who were allowed to use laptops did just as well as those who couldn’t. (The same pattern held true when researchers looked at students with high and low GPAs.)

OK, now let’s go to the tape. Here’s the article, “The Impact of Computer Usage on Academic Performance: Evidence from a Randomized Trial at the United States Military Academy,” by Susan Payne Carter, Kyle Greenberg, and Michael Walker, and here’s the relevant table:

Screen Shot 2016-05-24 at 2.07.06 PM

No scatterplot of data, unfortunately, but you can see the pattern: the result is statistically significant in the top third but not in the bottom third.

But now let’s do the comparison directly: the difference is (-0.10) – (-0.25) = 0.15, and the standard error of the difference is sqrt(0.12^2 + 0.10^2) = 0.16. Not statistically significant! There’s no statistical evidence of any interaction here.

Now back to the news article:

These results are a bit strange. We might have expected the smartest students to have used their laptops prudently. Instead, they became technology’s biggest victims. Perhaps hubris played a role. The smarter students may have overestimated their ability to multitask. Or the top students might have had the most to gain by paying attention in class.

Nonononononono. There’s nothing to explain here. It’s not at all strange that there is variation in a small effect, and they happen to find statistical significance in some subgroups but not others.

As the saying goes, The difference between “significant” and “not significant” is not itself statistically significant. (See here for a recent example that came up on the blog.)

The research article also had this finding:

These results are nearly identical for classrooms that permit laptops and tablets without restriction as they are for classrooms that only permit modified-tablet usage. This result is particularly surprising considering that nearly 80 percent of students in the first treatment group used a laptop or tablet at some point during the semester while only 40 percent of students in the second treatment group ever used a tablet.

Again, I think there’s some overinterpretation going on here. With small effects, small samples, and high variation, you can find subgroups where the results look similar. That doesn’t mean the true difference is zero, or even nearly zero—you still have these standard errors to worry about. When the s.e. is 0.07, and you have two estimates, one of which is -0.17 and one of which is -0.18 . . . the estimates being so nearly identical is just luck.

Just to be clear, I’m not trying to “shoot down” this research article nor am I trying to “debunk” the news report. I think it’s great for people to do this sort of study, and to report on it. It’s because I care about the topic that I’m particularly bothered when they start overinterpreting the data and drawing strong conclusions from noise.

Annals of really pitiful spammers

Here it is:

On May 18, 2016, at 8:38 AM, ** <**@**.org> wrote:

Dr. Gelman,

I hope all is well. I looked at your paper on [COMPANY] and would be very interested in talking about having a short followup or a review article about this published in the next issue of the Medical Research Archives. It would be interesting to see a paper with new data since this was published, or any additional followup work you have done. If you could also tell me more about your current projects that would be helpful. The Medical Research Archives is an online and print peer-reviewed journal. The deadlines are flexible. I am happy to asnwer any questions. Please respond at your earliest convenience.

Best Regards,

Medical Research Archives
** ** Avenue
** CA ***** USA


Ummm, I guess it makes sense, if these people actually knew what they were doing, they’d either (a) have some legitimate job, or (b) be running a more lucrative grift.

But if they really really have their heart set on scamming academic researchers, I recommend they join up with Wolfram Research. Go with the market leaders, that’s what I say.

Here’s something I know nothing about

Paul Campos writes:

Does it seem at all plausible that, as per the CDC, rates of smoking among people with GED certificates are double those among high school dropouts and high school graduates?

My reply: It does seem a bit odd, but I don’t know who gets GED’s. There could be correlations with age and region of the country. It’s hard to know what to do with this sort of demographically-unadjusted number.

Albedo-boy is back!

New story here. Background here and here.

“Lots of hype around pea milk, with little actual scrutiny”

Paul Alper writes:

Had no idea that “Pea Milk” existed, let alone controversial. Learn something new every day.

Indeed, I’d never heard of it either. I guess “milk” is now a generic word for any white sugary drink? Sort of like “tea” is a generic word for any drink made from a powder steeped in boiling water.