Skip to content

“Breakfast skipping, extreme commutes, and the sex composition at birth”


Bhash Mazumder sends along a paper (coauthored with Zachary Seeskin) which begins:

A growing body of literature has shown that environmental exposures in the period around conception can affect the sex ratio at birth through selective attrition that favors the survival of female conceptuses. Glucose availability is considered a key indicator of the fetal environment, and its absence as a result of meal skipping may inhibit male survival. We hypothesize that breakfast skipping during pregnancy may lead to a reduction in the fraction of male births. Using time use data from the United States we show that women with commute times of 90 minutes or longer are 20 percentage points more likely to skip breakfast. Using U.S. census data we show that women with commute times of 90 minutes or longer are 1.2 percentage points less likely to have a male child under the age of 2. Under some assumptions, this implies that routinely skipping breakfast around the time of conception leads to a 6 percentage point reduction in the probability of a male child.

Here are the key graphs. First, showing that people with long commute times are more likely to be skipping breakfast:

Screen Shot 2016-04-03 at 2.39.10 PM

I have no idea how 110% of people are supposed to be skipping breakfast, but whatever.

And, second, showing that people with long commute times are less likely to have boy babies:

Screen Shot 2016-04-03 at 2.41.05 PM

I have no idea what’s going on with these bars that start at 49.8%, but whatever. Maybe someone can tell these people that it’s ok to plot points, you don’t need big gray bars attached?

Anyway, what can I say . . . I don’t buy it. This second graph, in particular: everything looks too noisy to be useful.

Or, to put it another way: The general hypothesis seems reasonable, when the fetus gets less nourishment, it’s more likely the boy fetus doesn’t survive. But this all looks really really noisy. Also, the statistical significance filter. So the estimates they report, are overestimates.

To put it another way: Get a new data set, and I don’t expect to see the pattern repeat.

That said, there are papers in this literature that are a lot worse. For example, Mazumder and Seeskin cite a Mathews, Johnson, and Neil paper on correlation between maternal diet and sex ratio that had a sample size of only 740, which makes it absolutely useless for learning anything at all, given actual effect sizes on sex ratios. They could’ve just as well been publishing random numbers. But that was 2008, back before people know about these problems. We can only hope that the editors of “Proceedings of the Royal Society B: Biological Sciences” know better today.

Abraham Lincoln and confidence intervals


Our recent discussion with mathematician Russ Lyons on confidence intervals reminded me of a famous logic paradox, in which equality is not as simple as it seems.

The classic example goes as follows: Abraham Lincoln is the 16th president of the United States, but this does not mean that one can substitute the two expressions “Abraham Lincoln” and “the 16th president of the United States” at will. For example, consider the statement, “If things had gone a bit differently in 1860, Stephen Douglas could have become the 16th president of the United States.” This becomes flat-out false if we do the substitution: “If things had gone a bit differently in 1860, Stephen Douglas could have become Abraham Lincoln.”

Now to confidence intervals. I agree with Rink Hoekstra, Richard Morey, Jeff Rouder, and Eric-Jan Wagenmakers that the following sort of statement, “We can be 95% confident that the true mean lies between 0.1 and 0.4,” is not in general a correct way to describe a classical confidence interval. Classical confidence intervals represent statements that are correct under repeated sampling based on some model; thus the correct statement (as we see it) is something like, “Under repeated sampling, the true mean will be inside the confidence interval 95% of the time” or even “Averaging over repeated samples, we can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.” Russ Lyons, however, felt the statement “We can be 95% confident that the true mean lies between 0.1 and 0.4,” was just fine. In his view, “this is the very meaning of “confidence.'”

This is where Abraham Lincoln comes in. We can all agree on the following summary:

A. Averaging over repeated samples, we can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.

And we could even perhaps feel that the phrase “confidence interval” implies “averaging over repeated samples,” and thus the following statement is reasonable:

B. “We can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.”

Now consider the other statement that caused so much trouble:

C. “We can be 95% confident that the true mean lies between 0.1 and 0.4.”

In a problem where the confidence interval is [0.1, 0.4], “the lower and upper endpoints of the confidence interval” is just “0.1 and 0.4.” So B and C are the same, no? No. Abraham Lincoln, meet the 16th president of the United States.

In statistical terms, once you supply numbers on the interval, you’re conditioning on it. You’re no longer implicitly averaging over repeated samples. Just as, once you supply a name to the president, you’re no longer implicitly averaging over possible elections.

So here’s what happened. We can all agree on statement A. Statement B is a briefer version of A, eliminating the explicit mention of replications because they are implicit in the reference to a confidence interval. Statement C does a seemingly innocuous switch but, as a result, implies conditioning on the interval, thus resulting in a much stronger statement that is not necessarily true (that is, in mathematical terms, is not in general true).

None of this is an argument over statistical practice. One might feel that classical confidence statements are a worthy goal for statistical procedures, or maybe not. But, like it or not, confidence statements are all about repeated sampling and are not in general true about any particular interval that you might see.

P.S. More here.

I’m only adding new posts when they’re important . . . and this one’s really important.


Durf Humphries writes:

I’m a fact-checker and digital researcher in Atlanta. Your blog has been quite useful to me this week. Your statistics and explanations are impressive, but the decision to ornament your articles with such handsome cats? That’s divine genius and it’s apparent that these are not random cats, but carefully curated critters that compliment the scholarship. Bravo.

He adds:

I’ve also attached a picture of my cats that you are welcome to use. Their names are Peach (the striped one) and Pancake (the very, very dark gray one).

How best to partition data into test and holdout samples?

Screen Shot 2016-07-01 at 1.36.44 PM

Bill Harris writes:

In “Type M error can explain Weisburd’s Paradox,” you reference Button et al. 2013. While reading that article, I noticed figure 1 and the associated text describing the 50% probability of failing to detect a significant result with a replication of the same size as the original test that was just significant.

At that point, something clicked: what advice do people give for holdout samples, for those who test that way?

R’s Rattle has a default partition of 70/15/15 (percentages). recommends at least a 20% holdout — 50% if you have a lot of data.

Seen in the light of Button 2013 and Gelman 2016, I wonder if it’s more appropriate to have a small training sample and a larger test or validation sample. That way, one can explore data without too much worry, knowing that a significant number of results could be spurious, but the testing or validation will catch that. With a 70/15/15 or 80/20 split, you risk wasting test subjects by finding potentially good results and then having a large chance of rejecting the result due to sampling error.

My reply:

I’m not so sure about your intuition. Yes, if you hold out 20%, you don’t have a lot of data to be evaluating your model and I agree with you that this is bad news. But usually people do 5-fold cross-validation, right? So, yes you hold out 20%, but you do this 5 times, so ultimately you’re fitting your model to all the data.

Hmmmm, but I’m clicking on your link and it does seem that people recommend this sort of one-shot validation on a subset (see image above). And this does seem like it would be a problem.

I suppose the most direct way to check this would be to run a big simulation study trying out different proportions for the test/holdout split and seeing what performs best. A lot will depend on how much of the decision making is actually being done at the evaluation-of-the-holdout stage.

I haven’t thought much about this question—I’m more likely to use leave-one-out cross-validation as, for me, I use such methods not for model selection but for estimating the out-of-sample predictive properties of models I’ve already chosen—but maybe others have thought about this.

I’ve felt for awhile (see here and here) that users of cross-validation and out-of-sample testing often seem to forget that these methods have sampling variability of their own. The winner of a cross-validation or external validation competition is just the winner for that particular sample.

P.S. My main emotion, though, when receiving the above email was pleasure or perhaps relief to learn that at least one person is occasionally checking to see what’s new on my published articles page! I hadn’t told anyone about this new paper so it seems that he found it there just by browsing. (And actually I’m not sure of the publication status here: the article was solicited by the journal but then there’ve been some difficulties, we’ve brought in a coauthor . . . who knows what’ll happen. It turns out I really like the article, even though I only wrote it as a response to a request and I’d never heard of Weisburd’s paradox before, but if the Journal of Quantitative Criminology decides it’s too hot for them, I don’t know where I could possibly send it. This happens sometimes in statistics, that an effort in some very specific research area or sub-literature has some interesting general implications. But I can’t see a journal outside of criminology really knowing what to do with this one.)




P.S. In the comment thread, Peter Dorman has an interesting discussion of Carlsen’s errors so far during the tournament.

Deep learning, model checking, AI, the no-homunculus principle, and the unitary nature of consciousness


Bayesian data analysis, as my colleagues and I have formulated it, has a human in the loop.

Here’s how we put it on the very first page of our book:

The process of Bayesian data analysis can be idealized by dividing it into the following three steps:

1. Setting up a full probability model—a joint probability distribution for all observable and unobservable quantities in a problem. The model should be consistent with knowledge about the underlying scientific problem and the data collection process.

2. Conditioning on observed data: calculating and interpreting the appropriate posterior distribution—the conditional probability distribution of the unobserved quantities of ultimate interest, given the observed data.

3. Evaluating the fit of the model and the implications of the resulting posterior distribution: how well does the model fit the data, are the substantive conclusions reasonable, and how sensitive are the results to the modeling assumptions in step 1? In response, one can alter or expand the model and repeat the three steps.

How does this fit in with goals of performing statistical analysis using artificial intelligence? Lots has been written on “machine learning” but in practice this often captures just part of the process. Here I want to discuss the possibilities for automating the entire process.

Currently, human involvement is needed in all three steps listed above, but in different amounts:

1. Setting up the model involves a mix of look-up and creativity. We typically pick from some conventional menu of models (linear regressions, generalized linear models, survival analysis, Gaussian processes, splines, Bart, etc etc). Tools such as Stan allow us to put these pieces together in unlimited ways, in the same way that we can formulate paragraphs by putting together words and sentences. Right now, a lot of human effort is needed to set up models in real problems, but I could imagine an automatic process that constructs models from parts, in the same way that there are computer programs to write sports news stories.

2. Inference given the model is the most nearly automated part of data analysis. Model-fitting programs still need a bit of hand-holding for anything but the simplest problems, but it seems reasonable to assume that the scope of the “self-driving inference program” will gradually increase. Just for example, we can automatically monitor the convergence of iterative simulations (that came in 1990!) and, with Nuts, we don’t have to tune the number of steps in Hamiltonian Monte Carlo. Step by step, we should be able to make our inference algorithms more automatic, also with automatic checks (for example, based on adaptive fake-data simulations) to flag problems when they do appear.

3. The third step—identifying model misfit and, in response, figuring out how to improve the model—seems like the toughest part to automate. We often learn of model problems through open-ended exploratory data analysis, where we look at data to find unexpected patterns and compare inferences to our vast stores of statistical experience and subject-matter knowledge. Indeed, one of my main pieces of advice to statisticians is to integrate that knowledge into statistical analysis, both in the form of formal prior distributions and in a willingness to carefully interrogate the implications of fitted models.

How would an AI do step 3? One approach is to simulate the human in the loop by explicitly building a model-checking module that takes the fitted model, uses it to make all sorts of predictions, and then checks this against some database of subject-matter information. I’m not quite sure how this would be done, but the idea is to try to program up the Aha process of scientific revolutions.

The conscious brain: decision-making homunculus or interpretive storyteller?

There is another way to go, though, and I thought of this after seeing Julien Cornebise speak at Google about a computer program that his colleagues wrote to play video games. He showed the program “figuring out” how to play a simulation the 1970s arcade classic game, Breakout. What was cool was not just how it could figure out how to position the cursor to always get to the ball on time, but how the program seemed to learn strategies: Cornebise pointed out how, after a while, the program seemed to have figured out how to send the ball up around the blocks to the top where it would knock out lots of bricks:

OK, fine. What does this have to do with model checking, except to demonstrate that in this particular example no model checking seems to be required as the model does just fine?

Actually, I don’t know on that last point, as it’s possible the program required some human intervention to get to the point that it could learn on its own how to win at Breakout.

But let me continue. For many years, cognitive psychologists have been explaining to us that our conscious mind doesn’t really make decisions as we usually think of it, at least not for many regular aspects of daily life. Instead, we do what we’re gonna do, and our conscious mind is a sort of sportscaster, observing our body and our environment and coming up with stories that explain our actions.

To return to the Breakout example, you could imagine a plug-in module that would observe the game and do some postprocessing—some statistical analysis on the output—and notice that, all of a sudden, the program was racking up the score. The module would interpret this as the discovery of a new strategy, and do some pattern recognition to figure out what’s going on. If this happens fast enough, it could feel like the computer “consciously” decided to try out the bounce-the-ball-along-the-side-to-get-to-the-top strategy.

That’s not quite what the human players do: we can imagine the strategy without it happening yet. But of course the computer could do so to, via a simulation model of the game.

Now let’s return to step 3 of Bayesian data analysis: model checking and improvement. Maybe it’s possible for some big model to be able to learn and move around model space, and to suddenly come across better solutions. This could look like model checking and improvement, from the perspective of the sportscaster part of the brain (or the corresponding interpretive plug-in to the algorithm) even though it’s really just blindly fitting a model.

All that is left, then, is the idea of a separate module that identifies problems with model fit based on comparisons of model inferences to data and prior information. I think that still needs to be built.

On deck this week

The other day someone asked me why we stopped running our On Deck This Week post every Monday morning. I replied that On Deck is not needed because a few months ago I announced all our posts, in order, through mid-January.

See here: My next 170 blog posts (inbox zero and a change of pace).

So to find out what’s next, just click there and scroll down. All the posts are there (except for various topical items that I’ve inserted).

“US Election Analysis 2016: Media, Voters and the Campaign”

Darren Lilleker, Einar Thorsen, Daniel Jackson, and Anastasia Veneti edited this insta-book of post-election analyses.

Actually, at least one of these chapters was written before the election. When the editors asked me if I could contribute to this book, I said, sure, and I pointed them to this article from a few weeks ago, “Trump-Clinton Probably Won’t Be a Landslide. The Economy Says So.” After the election, I changed the tenses of a few verbs and produced “Trump-Clinton was expected to be close: the economy said so.”

I haven’t had a chance to read any of the other chapters, but I glanced at the table of contents and noticed that one of them was by Ken Cosgrove, from Mad Men, writing on brand loyalty and politics. Cool! I knew the guy had published a short story in the Atlantic, so I guess it was only a matter of time before he dabbled in political science as well.

Unfinished (so far) draft blog posts


Most of the time when I start writing a blog post, I continue till its finished. As of this writing this blog has 7128 posts published, 137 scheduled, and only 434 unpublished drafts sitting in the folder.

434 might sound like a lot, but we’ve been blogging for over 10 years, and a bunch of those drafts never really got started.

Anyway, just for your amusement, I thought I’d share the titles of the draft posts, most of which are unfinished and probably will never be finished. They’re listed in reverse chronological order, and I’m omitting all the posts that I hadn’t bothered to title.

Here are the most recent few:

  • Of polls and prediction markets: More on #BrexitFail
  • Deep learning, model checking, AI, the no-homunculus principle, and the unitary nature of consciousness
  • “Simple, Scalable and Accurate Posterior Interval Estimation”
  • ESP and the Bork effect
  • Hey, PPNAS . . . this one is the fish that got away.
  • The new quantitative journalism
  • Trying to make some sense of it all, But I can see that it makes no sense at all . . . Stuck in a local maximum with you
  • Statistical significance, the replication crisis, and a principle from Bill James
  • Is retraction only for “the worst of the worst”?
  • Stan – The Bayesian Data Scientist’s Best Friend [this one’s from Aki]
  • How to think about a study that’s iffy but that’s not obviously crap
  • The identification trap
  • nisbett
  • The penumbra of shooting victims
  • What is the prior distribution for treatment effects in social psychology?
  • You can’t do Bayesian Inference for LDA! [by Bob]
  • Product vs. Research Code: The Tortoise and the Hare [another one from Bob; he was busy that week!]
  • Party like it’s 2005
  • The challenge of constructive criticism
  • We got mooks [This one I actually posted, and then one of my colleagues asked me to take it down because my message wasn’t 100% positive.]
  • Can’t Stop Won’t Stop Splittin
  • Some statistical lessons from the middle-aged-mortality-trends story
  • Humans Can Discriminate More than 1 Trillion Olfactory Stimuli. Not.
  • If I have not seen far, it’s cos I’m standing on the toes of midgets
  • Ovulation and clothing: More forking paths [this one was in the Zombies category and I think we’ve run enough posts on the topic]
  • How to get help with Stan [from Daniel. I don’t know why he didn’t post it.]
  • Running Stan [also from Daniel]
  • Stan taking over the world
  • Why is Common Core so crappy?
  • Attention-grabbing crap, statistics edition
  • Optimistic or pessimistic priors
  • I hate hate hate hate this graph. Not so much because it’s a terrible graph—which it is—but because it’s [Yup, that’s it. I guess I didn’t even finish the title of this one!]
  • Some more statistics quotes!
  • Show more of the time series
  • Just in case there was any confusion
  • “Steven Levitt from Freakonomics describes why he’s obsessed with golf” [Enough already on this guy. — ed.]
  • A statistical communication problem!
  • What should be in an intro stat course?
  • Postdoc opportunities to work with our research group!!
  • When you call me bayesian, I know I’m not the only one
  • The NIPS Experiment [from Bob]
  • Sociology comments
  • When is a knave also a fool?
  • Income Inequality: A Question of Velocity or Acceleration? [by David K. Park]
  • Economics now = Freudian psychology in the 1950s [I already posted on the topic, so this post must be some sort of old draft.]
  • “College Hilariously Defends Buying $219,000 Table”
  • Having a place to put my thoughts
  • ; vs |
  • The (useful) analogy between preregistration of a replication study and randomization in an experiment
  • It’s somewhat about the benjamins [Hey, I like that title!]
  • Intellectuals’ appreciation of old pop culture
  • Is it hype or is it real?
  • Scientific and scholarly disputes
  • Book by tobacco-shilling journalist given to Veterans Affairs employees
  • Alphabetism
  • I ain’t got no watch and you keep asking me what time it is

That takes us back to Oct 2014. Some of these are close to finished and maybe I’ll post soon; others are on topics we’ve already done to death; and some of the others, I have no idea what I was going to say. That last post above, I remember thinking of the idea when I was riding my bike and that Dylan song came on. When I got home, I wrote the title of the post but failed to put anything in the main text box, and now I completely forget my intentions. Too bad, it’s a good title.

P.S. I wrote the above a few months ago and I have a couple more drafts now in the pile.

Individual and aggregate patterns in the Equality of Opportunity research project

Dale Lehman writes:

I’ve been looking at the work of the Equality of Opportunity Project and noticed that you had commented on some of their work.

Since you are somewhat familiar with the work, and since they do not respond to my queries, I thought I’d ask you about something that is bothering me. I, too, was somewhat put off by their repeated use of the word “causation.” But what really concerns me is that it appears that the work is based on taking huge samples (millions of people) and doing the analysis based on aggregations of them into deciles. Isn’t this demonstrating ecological correlation—which would be fine, except that their interpretations all involve predictive and causative statements at the individual level. In other words, they find a close relationship between various aggregate measures—such as the percentile income rank and the percent attending college—and then interpret that correlation as representing individual correlations. The individual correlations are guaranteed to be weaker than the aggregate ones, and perhaps not even in the same direction.

There is significant effort in this work and it will take me a long time to understand exactly what they have done, but I thought you might be able to save me a bunch of time by telling me whether this is something worth pursuing. I would think that these researchers would be well aware of ecological correlations, but I was constantly puzzled by why their scatterplots have so few points when the sample sizes are so large. Finding a strong linear correlation between aggregate measures conveys a compelling story—but it may not be a true story.

My reply: I’m not sure. This update (which Lehman pointed me to) shows a bunch of individual-level results as well. So it reminds me of our Red State Blue State project where we used individual-level data where possible but also examined aggregate patterns.

Know-it-all neuroscientist explains Trump’s election victory to the rest of us

Alex Gamma points us to the latest atrocity of pseudo-science in the popular press:
Continue reading ‘Know-it-all neuroscientist explains Trump’s election victory to the rest of us’ »

Good news! PPNAS releases updated guidelines for getting a paper published in their social science division

Screen Shot 2016-06-29 at 12.55.58 PM

From zero to Ted talk in 18 simple steps: Rolf Zwaan explains how to do it!

The advice is from 2013 but I think it still just might work. Here’s Zwaan:

How to Cook up Your Own Social Priming Article

1. Come up with an idea for a study. Don’t sweat it. It’s not as hard as it looks. All you need to do is take an idiomatic expression and run with it. Here we go: the glass is half-full or the glass is half-empty.

2. Create a theoretical background. Surely there is some philosopher (preferably a Greek one) who has said something remotely relevant about optimists and pessimists while staring at a wine glass. Include him. For extra flavor you might want to add an anthropologist or a sociologist into the mix; Google is your friend here. Top it off with a few social psychology references. There, you have your theoretical framework. That wasn’t so hard, was it?

3. Think of a manipulation. Again, this is nothing to get nervous about. All you need to do is take the expression literally. Imagine this scenario. The subject is in a room. In the glass-full condition, a confederate comes in with an empty glass and a bottle of water. She then pours the glass half full and leaves the room. In the glass-half-empty condition, she comes in with a full glass and a bottle. She then pours half the glass back into the bottle and leaves.

4. Think of a dependent measure. This is where the fun begins. As you may know, the dependent measure of choice in social priming research is candy. You simply cannot go wrong with candy! So let’s say the subjects get to choose ten pieces of differently colored pieces of candy from a container that has equal numbers of orange and brown M&Ms. Your prediction here is that people in the half-full condition will be more likely to pick the cheery orange M&Ms than those in the half-empty condition, who will tend to prefer the gloomy brown ones.

5. Get a sample. You don’t want to overdo it here. About 30 students from a nondescript university will do nicely. Only 30 in a between-subjects design?, you worry. Worry no more. This is how we roll in social priming.

6. Run Experiment 1. Don’t fuss about issues like the age and gender of the subjects and details of the procedure; you won’t be reporting them anyway.

7. Analyze the results. Normally, you’d worry that you might not find an effect. But this is social priming remember? You are guaranteed to find an effect. In fact, your effect size will be around .8. That’s social priming for you!

8. Now on to Experiment 2. Come up with a new manipulation. What’s wrong with the glass and bottle from Experiment 1?, you might wonder. Are you kidding? This is social priming research. You need a new dependent measure. Just let your imagination run wild. How about balloons? In the half-full condition, the confederate walks in with an inflated balloon and lets half the air out in front of the subject. In the half empty condition, she half-inflates a balloon. And bingo! You’re done (careful with the word bingo, by the way; it makes people walk real slow).

9. Think of a new dependent measure. Why not have the subjects list their favorite TV shows? Your prediction here is that the half-full condition will list more sitcoms like Seinfeld and Big Bang Theory than the half-empty condition, which will list more crime shows like CSI and Law & Order (or maybe one of those stupid vampire shows). You could also include a second dependent measure. How about having subjects indicate how much they identify with Winnie de Pooh characters? Your prediction here is obvious: the half full condition will identify with Tigger the most while the half empty condition will prefer Eeyore by a landslide.

10. Repeat steps 5-7.

11. Now you are ready to write your General Discussion. You want to discuss the implications of your research. Don’t be shy here. Talk about the major implications for business, health, education, and politics this research so evidently has.

12. For garnish, add a quirky celebrity quote. Don’t work yourself into a lather. Just go to to find a quote. Here, I already did the work for you: “Some people see the glass half full. Others see it half empty. I see a glass that’s twice as big as it needs to be.” ― George Carlin. Just say something clever like: Unless you are like George Carlin, it does make a difference whether the glass is half empty or half full.

13. The next thing you need is an amusing title. And here your preparatory work really pays off. Just use the expression from Step 1 as your main title, describe your (huge) effect in the subtitle and your done: Is the Glass Half Empty or Half Full? The Effect of Perspective on Mood.

14. Submit to a journal that regularly publishes social priming research. They’ll eat it up.

15. Wax poetically about your research in the public media. If it wasn’t a good idea to be modest in the general discussion, you really need to let loose here. Like all social priming research, your work has profound consequences for all aspects of society. Make sure the taxpayer (and your Dean, haha) knows about it.

16. If bloggers are critical about your work, just ignore them. They’re usually cognitive psychologists with nothing better to do.

17. Once you’ve worked through this example, you might try your hand at more advanced topics like coming out of the closet. Imagine all the fun you’ll have with that one!

18. Good luck!

This is all so perfect, I just have nothing to add. You know how journals have style guides, and instructions on what sort of papers they like to publish? Wouldn’t it be just perfect if PPNAS (see here or here or here or . . .) linked to the Rolf Zwaan page, completely deadpan, saying this is the path to getting a paper published in their social science division?

Thinking more seriously about the design of exploratory studies: A manifesto


In the middle of a long comment thread on a silly Psychological Science paper, Ed Hagen wrote:

Exploratory studies need to become a “thing.” Right now, they play almost no formal role in social science, yet they are essential to good social science. That means we need to put as much effort in developing standards, procedures, and techniques for exploratory studies as we have for confirmatory studies. And we need academic norms that reward good exploratory studies so there is less incentive to disguise them as confirmatory.


The problem goes like this:

1. Exploratory work gets no respect. Do an exploratory study and you’ll have a difficult time getting it published.

2. So, people don’t want to do exploratory studies, and when someone does do an exploratory study, he or she is motivated to cloak it in confirmatory language. (Our hypothesis was Z, we did test Y, etc.)

3. If you tell someone you will interpret their study as being exploratory, they may well be insulted, as if you’re saying their study is only exploration and not real science.

4. Then there’s the converse: it’s hard to criticize an exploratory study. It’s just exploratory, right? Anything goes!

And here’s what I think:

Exploration is important. In general, hypothesis testing is overrated and hypothesis generation is underrated, so it’s a good idea for data to be collected with exploration in mind.

But exploration, like anything else, can be done well or it can be done poorly (or anywhere in between). To describe a study as “exploratory” does not get it off the hook for problems of measurement, conceptualization, etc.

For example, Ed Hagen in that thread mentioned that horrible ovulation and clothing paper, and its even more horrible followup where the authors pulled the outdoor temperature variable out of a hat to explain away an otherwise embarrassing non-replication (which shouldn’t’ve been embarrassing at all given the low low power and many researcher degrees of freedom of the original study which had gotten them on the wrong track in the first place). As I wrote in response to Hagen, I love exploratory studies, but gathering crappy one-shot data on a hundred people and looking for the first thing that can explain your results . . . that’s low-quality exploratory research.

From “EDA” to “Design of exploratory studies”

With the phrase “Exploratory Data Analysis,” the statistician John Tukey named and gave initial shape to a whole new way of thinking formally about statistics. Tukey of course did not invent data exploration, but naming the field gave a boost to thinking about it formally (in the same way that, to a much lesser extent, our decades of writing about posterior predictive checks has given a sense of structure and legitimacy to Bayesian model checking). And that’s all fine. EDA is great, and I’ve written about connections between EDA and Bayesian modeling; see here and here.

But today I want to talk about something different, which is the idea of design of an exploratory study.

Suppose you know ahead of time that your theories are a bit vague and omnidirectional, that all sorts of interesting things might turn up that you will want to try to understand, and you want to move beyond the outmoded Psych Sci / PPNAS / Plos-One model of chasing p-values in a series of confirmatory studies.

You’ve thought it through and you want to do it right. You know it’s time for exploration first and confirmation later, if at all. So you want to design an exploratory study.

What principles do you have? What guidelines? If you look up “design” in statistics or methods textbooks, you’ll find a lot of power calculations, maybe something on bias and variance, and perhaps some advice on causal identification. All these topics are relevant to data exploration and hypothesis generation, but not directly so, as the output of the analysis is not an estimate or hypothesis test.

So I think we—the statistics profession—should be offering guidelines on the design of exploratory studies.

An analogy here is observational studies. Way back when, causal inference was considered to come from experiments. Observational studies were second best, and statistics textbooks didn’t give any advice on the design of observational studies. You were supposed to just take your observational data, feel bad that they didn’t come from experiments, and go from there. But then Cochran, and Rosenbaum, and Angrist and Pischke, wrote textbooks on observational studies, including advice on how to design them. We’re gonna be doing observational studies, so let’s do a good job at them, which includes thinking about how to plan them.

Same thing with exploratory studies. Data-based exploration and hypothesis generation are central to science. Statisticians should be involved in the design as well as the analysis of these studies.

So what advice should we give? What principles do we have for the design of exploratory studies?

Let’s try to start from scratch, rather than taking existing principles such as power, bias, and variance that derive from confirmatory statistics.

– Measurement. I think this has to be the #1 principle. Validity and reliability: that is, you’re measuring what you think you’re measuring, and you’re measuring it precisely. Related: within-subject designs or, to put it more generally, structured measurements. If you’re interested in studying people’s behavior, measure it over and over, ask people to keep diaries, etc. If you’re interested in improving education, measure lots of outcomes, try to figure out what people are actually learning. And so forth.

– Open-endedness. Measuring lots of different things. This goes naturally with exploration.

– Connections between quantitative and qualitative data. You can learn from those open-ended survey responses—but only if you look at them.

– Where possible, collect or construct continuous measurements. I’m thinking of this partly because graphical data analysis is an important part of just about any exploratory study. And it’s hard to graph data that are entirely discrete.

I think much more can be said here. It would be great to have some generally useful advice for the design of exploratory studies.

Anti-immigration attitudes: they didn’t want a bunch of Hungarian refugees coming in the 1950s

In a post entitled “Not that complicated,” sociologist David Weakliem writes:

A few days ago, an article in the New York Times by Amanda Taub said that working-class support for Donald Trump reflected a “crisis of white identity.” Today, Ross Douthat said that it reflected the “thinning out of families.” The basic idea in both was that “working class” (ie less educated people’s) opposition to immigration is a symptom of anxiety about something else.

In September 1957, the days of the baby boom and the “affluent society,” when unions were strong and no one was talking about a crisis of white identity or masculinity, the Gallup Poll asked “UNDER THE PRESENT IMMIGRATION LAWS, THE HUNGARIAN REFUGEES WHO CAME TO THIS COUNTRY AFTER THE REVOLTS LAST YEAR HAVE NO PERMANENT RESIDENCE AND CAN BE DEPORTED AT ANY TIME. DO YOU THINK THE LAW SHOULD OR SHOULD NOT BE CHANGED SO THAT THESE REFUGEES CAN STAY HERE PERMANENTLY?”
42% said yes, and 43% said no.

33% approved and 55% disapproved.

With both questions, education made a difference for opinions. For example, in 1958, 55% of the people with a college degree favored letting the refugees come to the United States, compared to 31% of those without college degrees. The only other demographic factor that made a clear difference was that Jews were more likely to favor letting the refugees stay.

The 1957 survey also had a question about the Brown vs. Board of Education decision against school segregation—people who approved were more likely to favor letting the refugees stay. The 1958 survey had a series of questions about whether you would vote for various religious or racial minorities for president—people who were more tolerant were more likely to favor letting the refugees come to the United States.

The Hungarian refugees were white, Christian, and could be seen as part of a clear story of oppression vs. resistance. Despite this, most people, especially less educated people, were not in favor of letting them stay in the United States. So the contemporary opposition to immigration, and the tendency for it to be stronger among less educated people, are not a reflection of something specific to today, but continue a long-standing pattern. Of course, an increase in the number of immigrants today presumably makes the issue more important. But the basic pattern is not new.

[Data from the Roper Center for Public Opinion Research]

Interesting: among those who expressed an opinion, over 60% opposed letting those 65,000 anti-communist Hungarian refugees come to the U.S. And, as Weakliem points out, it’s hard to explain this based on ethnic prejudice, which is how we usually think about earlier anti-immigrant movements such as the Know Nothings of the 1850s.

Just one thing, though: There was a big recession in 1958. So people could’ve been reacting to that. In retrospect the 1958 recession doesn’t seem like much, but at the time people didn’t know if we were going to jump into another great depression.

Only on the internet . . .

I had this bizarrely escalating email exchange.

It started with this completely reasonable message:


I was unable to run your code here:

Besides a small typo [you have a 1 after names (options)], the code fails when you actually run the function. The error I get is a lexical error:

Error: lexical error: invalid character inside string.
{” “:{” “:2016,” “:11,” “:18},”
(right here) ——^

If you could help me understand where things went wrong, I can fix your code for you.


I didn’t remember any such code, but I followed the link, it was to a post from 2015 entitled “Downloading Option Chain Data from Google Finance in R: An Update,” and written by someone named Andrew, but not me.

So I replied:

Hi, that post was not by me!

A few hours later I got this reply in my inbox:

Yes it was. It says “by andrew.” Stop lying, professor.

Huh? Intonation is notoriously difficult to convey in typed speech. This was so over the top it must be someone goofing around. So I replied:

I guess you’re joking, right? I’m not the only person with that name.

And then he shoots back with this:

Unbelievable! Do you take me as a fool?

Ummmm . . . I better not touch that one!

P.S. I got one more email from this guy! He wrote:

Nothing to say? Fine, I will post a blog on this experience. The world will see how you were embarrassed to admit that your code had flaws!

Add another one to the list of professors with big egos and “no flaws.” I am now your 2nd biggest enemy. Your 1st is your own ego.

This was starting to get weird so I sent him one more email:

Hey, no kidding, it was a different Andrew. Follow the link and it goes here:

Downloading Options Data in R: An Update

It’s by someone named Andrew Collier. I’m Andrew Gelman. Different people.

I hope this works. The internet is a scary place.

Sniffing tears perhaps not as effective as claimed


Marcel van Assen has a story to share:

In 2011 a rather amazing article was published in Science where the authors claim that “We found that merely sniffing negative-emotion-related odorless tears obtained from women donors induced reductions in sexual appeal attributed by men to pictures of women’s faces.”
The article is this:
Gelstein, S., Yeshurun, Y., Rozenkrantz, L., Shushan, S., Frumin, I., Roth, Y., & Sobel, N. (2011). Human tears contain a chemosignal. Science, 331(6014), 226-230.

Ad Vingerhoets, an expert on crying, and a coworker Asmir Gračanin were amazed by this result and decided to replicate the study in several ways (my role in this paper was minor, i.e. doing and reporting some statistical analyses when the paper was already largely written). This resulted in:
Gračanin, A., van Assen, M. A., Omrčen, V., Koraj, I., & Vingerhoets, A. J. (2016). Chemosignalling effects of human tears revisited: Does exposure to female tears decrease males’ perception of female sexual attractiveness?.Cognition and Emotion, 1-12.

The paper failed to replicate the findings in the original study.

Original findings that do not get replicated is not special, but unfortunately core business. What IS striking, however, is the response of Sobel to the article of Gracanin et al (2016). See …
Sobel, N. (2016). Revisiting the revisit: added evidence for a social chemosignal in human emotional tears. Cognition and Emotion, 1-7.

Sobel re-analyzes the data of Gracanin et al, and after extensive fishing (with p-values just below .05) he concludes that the original study was right and the Gracanin et al study bad. Irrespective of whether chemosignalling actually exists, Sobel’s response is imo a beautiful and honest defense, where p-hacking is explicitly acknowledged and its consequences not understood.

We also wrote a short response to Sobel’s comment, commenting on the p-hacking of Sobel.
Gračanin, A., Vingerhoets, A. J., & van Assen, M. A. (2016). Response to comment on “Chemosignalling effects of human tears revisited: Does exposure to female tears decrease males’ perception of female sexual attractiveness?”.Cognition and Emotion, 1-2.

To save time, if your interested, I recommend reading Sobel (2016) first.

I asked Assen why he characterized Sobel’s horrible bit of p-hacking as “a beautiful and honest defense,” and he [Assen] responded:

I think it is beautiful (in the sense that I like it) because it is honest. I also think it is a beautiful and excellent example of how one should NOT react to a failed replication, and of NOT understanding how p-hacking works.

This is about emotions; although I was involved in this project, I ENJOYED the comment of Sobel because of its tone and content, even though it I did not agree with its content at all.

Our response to Sobel’s comment supports the fact that Sobel has been p-hacking. Vingerhoets asked BEFORE the replication if it mattered Tilburg had no lab, and Sobel says ‘no’, and AFTERWARDS when the replication fails he believes it IS a problem.

None of this is new, of course. By this time we should not be surprised that Science publishes a paper with no real scientific content. As we’ve discussed many times, newsworthiness rather than correctness is the key desideratum in publication in these so-called tabloid journals. The reviewers just assume the claims in submitted papers are correct and then move on to the more important (to them) problem of deciding whether the story is big and important enough for their major journal.

I agree with Assen that this particular case is notable in that the author of the original study flat-out admits to p-hacking and still doesn’t care.

Gračanin et al. tell it well in their response:

Generally, a causal theory should state that “under conditions X, it holds that if A then B”. Relevant to our discussion in particular and evaluating results of replications in general are conditions X, which are called scope conditions. Suppose an original study concludes that “if A then B”, but fails to specify conditions X, while the hypothesis was tested under condition XO. The replication study subsequently tested under condition XR and concludes that “if A then B” does NOT hold. Leaving aside statistical errors, two different con- clusions can be drawn. First, the theory holds in con- dition XO (and perhaps many other conditions) but not in condition XR. Second, the theory is not valid. We argue that the second explanation should be taken very seriously . . .

They continue:

What seems remarkable and inconsistent is that Sobel regards some of our as well as Oh, Kim, Park, and Cho’s (2012; Oh) findings as strong support for his theory, despite the fact that there was no sad context present in these studies. Apparently, in case of a failure to find corroborating results, the sad context is regarded crucial, but if some of our and Oh’s findings point in the same direction as his original findings, the lack of sad context and exact procedures are no longer important issues.

And this:

Sobel concludes that we did not dig very deep in our data to probe for a possible effect. That is true. We did not try to dig at all. Our aim was to test if human emotional tears act as a social chemosignal, using a different research methodology and with more statistical power than the original study; we were not on a fishing expedition.

I find the defensive reaction of Sobel to be understandable but disappointing. I’m just so so so tired of researchers who use inappropriate statistical methods and then can’t let go of their mistakes.

It makes me want to cry.

Josh Miller hot hand talks in NYC and Pittsburgh this week


Joshua Miller (the person who, with Adam Sanjurjo, discovered why the so-called “hot hand fallacy” is not really a fallacy) will be speaking on the topic this week.

In New York, Thurs 17 Nov, 12:30pm, 19 W 4th St, room 517, Center for Experimental Social Science seminar.

In Pittsburgh, Fri 18 Nov, 12pm, 4716 Posvsar Hall, University of Pittsburgh, Experimental Economics seminar.

And here’s the latest version of their paper, Surprised by the Gambler’s and Hot Hand Fallacies? A Truth in the Law of Small Numbers.

Josh gave a talk on this here a few months ago, and it was great. Lots of data, theory, and discussion. So I recommend you check this out.

“Men with large testicles”

Above is the title of an email I received from Marcel van Assen. We were having a discussion of PPNAS papers—I was relating my frustration about Case and Deaton’s response to my letter with Auerbach on age adjustment in mortality trends—and Assen wrote:

We also commented on a paper in PNAS. The original paper was by Fanelli & Ioannidis on “US studies may overestimate effect sizes in softer research.”

Their analyses smelled, with a rather obscure “to the power .25” of effect size. We re-analyzed their data in a more obvious way, and found nothing.

We were also amazed that the original paper’s authors waved away our re-analyses.

The most interesting PNAS article I [Assen] read last years is the one with Riling as one of the authors, arguing that men with bigger balls take less care of their children than men with smaller balls (yes, I mean testicles).

It had some p-values very close to .05. At that time I requested their data. I got the data, together with some requests not so share them. I did not follow upon this paper… Recently, I noticed Riling is also the author of at least one of the neuropsychology papers with huge correlations between a behavioral measure and neural activity.

OK, and here’s the promised cat picture:


Cute, huh?

Kaggle Kernels

Anthony Goldbloom writes:

In late August, Kaggle launched an open data platform where data scientists can share data sets. In the first few months, our members have shared over 300 data sets on topics ranging from election polls to EEG brainwave data. It’s only a few months old, but it’s already a rich repository for interesting data sets.

It’s also a nice place to share reproducible data science. We have built a tool called Kaggle Kernels, which allows data scientists and statisticians to share notebooks and scripts in Python or R on top of the data. If you find analysis you want to extend, you can “fork it” which gives you a reproducible version without going through the pain of replicating the author’s environment. It’s useful for learning new techniques (by being able to fork and play with other’s code), to share your side project with a large community and to draw attention to your research and store it in a way that can be easily reproduced.

He adds:

We don’t support Stan yet but we inevitably will.

Sooner rather than later, I hope!

P.S. Jamie Hall of Kaggle writes:

We’ve got RStan and PyStan ready to go in Kernels now. It would be fantastic to see some examples of the best ways to use them.

P.P.S. Aki has made a Kaggle notebook Bayesian Logistic Regression with rstanarm, and it works just fine.

Stan Webinar, Stan Classes, and StanCon

This post is by Eric.

We have a number of Stan related events in the pipeline. On 22 Nov, Ben Goodrich and I will be holding a free webinar called Introduction to Bayesian Computation Using the rstanarm R Package.

Here is the abstract:

The goal of the rstanarm package is to make it easier to use Bayesian estimation for most common regression models via Stan while preserving the traditional syntax that is used for specifying models in R and R packages like lme4. In this webinar, Ben Goodrich, one of the developers of rstanarm, will introduce the most salient features of the package.

To demonstrate these features, we will fit a model to loan repayments data from Lending Club and show why, in order to make rational decisions for loan approval or interest rate determination, we need a full posterior distribution as opposed to point predictions available in non-Bayesian statistical software.

As part of the upcoming StanCon 2017, we will be teaching a number of classes on Bayesian inference and statistical modeling. Here is the lineup:

  1. Introduction to Bayesian Inference with Stan (2 days): 19 – 20 Jan 2017
  2. Stan for Finance and Econometrics (1 day): 20 Jan 2017
  3. Stan for Pharmacometrics (1 day): 20 Jan 2017
  4. Advanced Stan: Programming, Debugging, Optimizing (1 day): 20 Jan 2017

For Stan users and readers of this blog, please use the code “stanusers” to get a 10% discount.

We hope to see many of you online and in person.