Skip to content

The latest episode in my continuing effort to use non-sports analogies

In a unit about the law of large numbers, sample size, and margins of error, I used the notorious beauty, sex, and power example:

A researcher, working with a sample of size 3000, found that the children of beautiful parents were more likely to be girls, compared to the children of less-attractive parents.

Can such a claim really be supported by the data at hand?

One way to get a sense of this is to consider possible effect sizes. It’s hard to envision a large effect; based on everything I’ve seen about sex ratios, I’d say .005 (i.e., one-half of one percentage point) is an upper bound on any possible difference in Pr(girl) comparing attractive and unattractive parents.

How big a sample size do you need to measure a proportion with that sort of accuracy? Since we’re doing a comparison, you’d need to measure the proportion of girls within each group (attractive or unattractive parents) to within about a quarter of a percentage point, or .0025. The standard deviation of a proportion is .5/sqrt(n), so we need to have roughly .5/sqrt(n)=.0025, or n=(.5/.0025)^2=40,000 in each group.

So, to have any chance of discovering this hypothetical difference in sex ratios, we’d need at least 40,000 attractive parents and 40,000 unattractive—a sample of 80,000 at an absolute minimum.

What the researcher actually had was a sample of 3000. Hopeless.

You might as well try to weld steel with a cigarette lighter.

Or do embroidery with a knitting needle.

OK, they’re not the best analogies ever. But I avoided sports!

I like the clever way they tell the story. It’s a straightforward series of graphs but the reader has to figure out where to click and what to do, which makes the experience feel more like a voyage of discovery.

Jonathan Falk asks what I think of this animated slideshow by Matthew Klein on “How Americans Die”:

Screen Shot 2014-10-30 at 9.35.59 PM

Please click on the above to see the actual slideshow, as this static image does not do it justice.

What do I think? Here was my reaction:

It is good, but I was thrown off by the very first page because it says that it looks like progress stopped in the mid-1990s, but on the actual graphs, the mortality rate continued to drop after the mid-1990s. Also the x-axis labeling was confusing to me, it took awhile for me to figure out that the numbers for the years are not written at the corresponding places on the axes, and I wasn’t clear on what the units are on the y-axis.

I guess what I’m saying is: I like the clever way they tell the story. It’s a straightforward series of graphs but the reader has to figure out where to click and what to do, which makes the experience feel more like a voyage of discovery. The only thing I didn’t like was some of the execution, in that it’s not always clear what the graphs are exactly saying. It’s a good idea and I could see it as a template for future graphical presentations.

It’s also an interesting example because it’s not just displaying data, it’s also giving a little statistics lesson.

Don’t, don’t, don’t, don’t . . . We’re brothers of the same mind, unblind


Hype can be irritating but sometimes it’s necessary to get people’s attention (as in the example pictured above). So I think it’s important to keep these two things separate: (a) reactions (positive or negative) to the hype, and (b) attitudes about the subject of the hype.

Overall, I like the idea of “data science” and I think it represents a useful change of focus. I’m on record as saying that statistics is the least important part of data science, and I’m happy if the phrase “data science” can open people up to new ideas and new approaches.

Data science, like any just about new idea you’ve heard of, gets hyped. Indeed, if it weren’t for the hype, you might not have heard of it!

So let me emphasize, that in my criticism of some recent hype, I’m not dissing data science, I’m just trying to help people out a bit by pointing out which of their directions might be more fruitful than others.

Yes, it’s hype, but I don’t mind

Phillip Middleton writes:

I don’t want to rehash the Data Science / Stats debate yet again. However, I find the following post quite interesting from Vincent Granville, a blogger and heavy promoter of Data Science.

I’m not quite sure if what he’s saying makes Data Science a ‘new paradigm’ or not. Perhaps it is reflective of something new apart from classical statistics, but then I would also say so of Bayesian analysis as paradigmatic (or at least a still budding movement) itself. But what he alleges – i.e that ‘Big Data’ by its very existence necessarily implies that cause of a response/event/observation can be ascertained, and seemingly w/o any measure of uncertainty, seems rather ‘over-promising’ and hypish.

I am a bit concerned with what I’m thinking he implies regarding ‘black box’ methods – that is the blind reliance upon them by those who are technically non-proficient. I feel the notion that one should always trust ‘the black box’ is not in alignment with reality.

He does appear to discuss dispensing with p-values. In a few cases, like SHT, I’m not totally inclined to disagree (for reasons you speak aobut frequently), but I don’t think we can be quite so universal about it. That would pretty much throw out most every frequentist test wrt to comparison, goodness-of-fit, what have you.

Overall I get the feeling that he’s implying the ‘new’ era as one of solving problems w/ certainty, which seems more the ideal than the reality.

What do you think?

OK, so I took a look at Granville’s post, where he characterizes data science as a new paradigm “very different, if not the opposite of old techniques that were designed to be implemented on abacus, rather than computers.”

I think he’s joking about the abacus but I agree with this general point. Let me rephrase it from a statistical perspective.

It’s been said that the most important thing in statistics is not what you do with the data, but, rather, what data you use. What makes new statistical methods great is that they open the door to the use of more data. Just for example:

- Lasso and other regularization approaches allow you to routinely thrown in hundreds or thousands of predictors, whereas classical regression models blow up at that. Now, just to push this point a bit, back before there was lasso etc., statisticians could still handle large numbers of predictors, they’d just use other tools such as factor analysis for dimension reduction. But lasso, support vector machines, etc., were good because they allowed people to more easily and more automatically include lots of predictors.

- Multiple imputation allows you to routinely work with datasets with missingness, which in turn allows you to work with more variables at once. Before multiple imputation existed, statisticians could still handle missing data but they’d need to develop a customized approach for each problem, which is enough of a pain that it would often be easier to simply work with smaller, cleaner datasets.

- Multilevel modeling allows us to use more data without having that agonizing decision of whether to combine two datasets or keep them separate. Partial pooling allows this to be done smoothly and (relatively) automatically. This can be done in other ways but the point is that we want to be able to use more data without being tied up in the strong assumptions required to believe in a complete-pooling estimate.

And so on.

Similarly, the point of data science (as I see it) is to be able to grab the damn data. All the fancy statistics in the world won’t tell you where the data are. To move forward, you have to find the data, you need to know how to scrape and grab and move data from one format into another.

On the other hand, he’s wrong in all the details

But I have to admit that I’m disturbed on how much Granville gets wrong. His buzzwords include “Model-free confidence intervals” (huh?), “non-periodic high-quality random number generators” (??), “identify causes rather than correlations” (yeah, right), and “perform 20,000 A/B tests without having tons of false positives.” OK, sure, whatever you say, as I gradually back away from the door. At this point we’ve moved beyond hype into marketing.

Can we put aside the cynicism, please?

Granville writes:

Why some people don’t see the unfolding data revolution?
They might see it coming but are afraid: it means automating data analyses at a fraction of the current cost, replacing employees by robots, yet producing better insights based on approximate solutions. It is a threat to would-be data scientists.

Ugh. I hate that sort of thing, the idea that people who disagree with you, do so out of corrupt reasons. So tacky. Wake up, man! People who disagree with you aren’t “afraid of the truth,” they just have different experiences than yours, they have different perspectives. Your perspective may be closer to the truth—as noted above, I agree with much of what Granville writes—but you’re a fool if you so naively dismiss the perspectives of others.
Continue reading ‘Don’t, don’t, don’t, don’t . . . We’re brothers of the same mind, unblind’ »

Saying things that are out of place


Basbøll points us to a column by Michael Shermer, a journalist and self-described skeptic who’s written a lot about skepticism, atheism, etc. Recently, though, Shermer wrote of an event that “shook [his] skepticism to its core”—it was a story about an old radio that didn’t work, then briefly started to work again, then stopped working.

From the outside it doesn’t sound like much (and indeed Shermer’s blog commenters aren’t particularly impressed) but, hey, they all laughed at Arthur Conan Doyle when he said he saw pictures of fairies (see image above), but who’s laughing now???

OK, sure, it’s easy to mock Doyle or Shermer (indeed, I couldn’t resist a bit of mockery myself in my comment to Basbøll’s above-linked post) or to move gently from mocking to being patronizing, saying that Doyle’s spiritualism is understandable given the sad events of his life, or that Shermer is being charmingly romantic in telling a story about his bride. Maybe we need a bit more romanticism and sentimentality when it comes to later-in-life weddings.

But I don’t want to take either of these paths here. Instead I’d like to talk about the somewhat unstable way in which we use different sources of discourse in different aspects of life, and the awkwardness that can arise when we use the wrong words at the wrong time.

For an obvious example, I’m all lovey-lovey when I talk with my family (except when we’re screaming at each other, of course) but that wouldn’t be appropriate at work. Sure, sometimes, I’ll get overcome with emotion and say “I love you guys” to my class, but it’s pretty rare, it certainly wouldn’t be appropriate to do that every day.

We also cordon off different aspects of inquiry. I have no problem mocking studies of fat arms and voting or whatever, but you’re not gonna see me making fun of the Bible here. Why? It’s not that the Bible is sacred to me, it just seems like it’s on a different dimension. It’s not claiming to be science. It’s a bunch of stories. If people want to believe that Moses crossed the Red Sea, or for that matter that there was an actual Moses or an actual King Arthur or whatever, fine. It doesn’t seem to interact in any direct way with statistical modeling, causal inference, or social science so it’s not particularly relevant to what we’re doing here.

So Shermer did this goofy thing where he was using romantic love discourse in a space where people were expecting science discourse or journalism discourse. It didn’t really work.

Sometimes the practice of unexpected forms of discourse can produce interesting results. For example, say what you want about Scott Adams, but sometimes his offbeat cartoonist’s perspective on public affairs can be interesting. So my message is not that everyone needs to stay in his place, or that Shermer shouldn’t display his romantic love in a column about skepticism. What I’m saying is that our norm is to evaluate statements in their context. When a comedian says something on a sitcom, we evaluate it based on how funny it is (and maybe on how offensive it might be), when a self-declared skeptic writes a column, we evaluate things in that way, etc. A story doesn’t exist on its own, it exists in context.

We tend to assume that since Shermer labels himself as a skeptic, that he is not superstitious and should “know better” than to believe in ghosts. But maybe that’s a misguided view of Shermer. Perhaps he has strong superstitious feelings, but has been convinced that those feelings are unscientific hence his career as a skeptical journalist, but the superstition is still there and bubbles up from time to time.

That’s just a story, of course, but my point is that it’s natural to interpret the Shermer story in terms of some frame or another. The frame “prominent skeptic is stunned by a coincidence” is just one way to interpret this story.

Next Generation Political Campaign Platform?


[This post is by David K. Park]

I’ve been imagining the next generation political campaign platform. If I were to build it, the platform would have five components:

  1. Data Collection, Sanitization, Storage, Streaming and Ingestion: This area will focus on the identification and development of the tools necessary to acquire the correct data sets for a given campaign, sanitizing the data and readying it for ingestion into the analytical components of the framework. This includes geotagged social media data, such as Twitter, fB, Instagram, Pinterest, Vine, etc.  and traditional local news, etc. focused not only on the candidate but challenger as well.
    • [side note (and potentially useless) idea: Embed rfid or inexpensive sensors to campaign lawn signs so we can measure how many people/cars pass by the sign.]
  2. Referential Data Sets: This area will focus on the identification and development of data sets that are referential in nature. Such sets might be databases and/or services to assist with geolocation, classification, etc. This includes demographic and marketing data, campaign specified sources, such as donors, and surveys as well data from Catalist and the Atlas Project.
  3. Analytics Engine: This area will focus on the identification and development of the tools necessary to provide the core analytical work for the specific project. Here, I’m thinking of statistical (use STAN of course), machine learning and NLP packages, both open source and commercially available. This includes language sentiment analysis, polling trends, and so on.
  4. Model Forms: These would be the models, and the underlying software to drive the models, used within the analysis. We can readily exploit the packages that already exist in this area, whether Python, R, Umbra, etc., and build custom models where necessary. This includes direct marketing impact analysis (i.e., A/B testing campaigns,) overall metrics for campaign health, election prediction, and more.
  5. Interpretation of Results, Data Visualization, and Visual Steering: Identifying and developing the necessary data visualization toolkits necessary to provide insights by adequately displaying the visual representations of quantitative and statistical information. Further, solving the problem of getting resultant data sets to the visualization system in a reliable fashion and making this connection tightly coupled and full duplex, allowing for a visual steering model to emerge. This includes geographic and demographic segmentation, overview of historical political context, results of various marketing messages, etc.

Just a (rough) thought at this point…

The Fallacy of Placing Confidence in Confidence Intervals

Richard Morey writes:

On the tail of our previous paper about confidence intervals, showing that researchers tend to misunderstand the inferences one can draw from CIs, we [Morey, Rink Hoekstra, Jeffrey Rouder, Michael Lee, and EJ Wagenmakers] have another paper that we have just submitted which talks about the theory underlying inference by CIs. Our main goal is to elucidate for researchers why many of the things commonly believed about CIs are false, and to show that the theory of CIs does not offer a very compelling theory for inference.

One thing that I [Morey] have noted going back to the classic literature is how clear Neyman seemed about all this. Neyman was under no illusions about what the theory could or could not support. It was later authors who tacked on all kinds of extra interpretations to CIs. I think he would be appalled at how CIs are used.

From their abstract:

The width of confidence intervals is thought to index the precision of an estimate; the parameter values contained within a CI are thought to be more plausible than those outside the interval; and the confidence coefficient of the interval (typically 95%) is thought to index the plausibility that the true parameter is included in the interval. We show in a number of examples that CIs do not necessarily have any of these properties, and generally lead to incoherent inferences. For this reason, we recommend against the use of the method of CIs for inference.

I agree, and I too have been pushing against the idea that confidence intervals resolve the well-known problems with null hypothesis significance testing. I also had some specific thoughts:

For another take on the precision fallacy (the idea that the width of a confidence interval is a measure of the precision of an estimate), see my post, “Why it doesn’t make sense in general to form confidence intervals by inverting hypothesis tests.” See in particular the graph which illustrates the problem very clearly, I think:

Regarding the general issue that confidence intervals are no inferential panacea, see my recent article, “P values and statistical practice,” in which I discuss the problem of taking a confidence interval from a flat prior and using it to make inferences and decisions.

My current favorite (hypothetical) example is an epidemiology study of some small effect where the point estimate of the odds ratio is 3.0 with a 95% conf interval of [1.1, 8.2]. As a 95% conf interval, this is fine (assuming the underlying assumptions regarding sampling, causal identification, etc. are valid). But if you slap on a flat prior you get a Bayes 95% posterior interval of [1.1, 8.2] which will not in general make sense, because real-world odds ratios are much more likely to be near 1.1 than to be near 8.2. In a practical sense, the uniform prior is causing big problems by introducing the possibility of these high values that are not realistic. And taking a confidence interval and treating it as a posterior interval gives problems too. Hence the generic advice to look at confidence intervals rather than p-values does not solve the problem.

I think the Morey et al. paper is important in putting all these various ideas together and making it clear what are the unstated assumptions of interval estimation.

Stan at NIPS 2014

For those in Montreal a few of the Stan developers will giving talks at the NIPS workshops this week.  On Saturday at 9 AM I’ll be talking about the theoretical foundations of Hamiltonian Monte Carlo at the Riemannian Geometry workshop ( while Dan will be talking about Stan at the Software Engineering workshop ( Saturday afternoon at 4 PM.  We’ll also have an interactive poster at the Probabilistic Programming workshop on Saturday (*2014_Workshop) — it should be an…attractive presentation.


If you’re up early be sure to check out Matt Hoffman talking first thing on Saturday, at 8:30 AM in the Variational Inference workshop (

Dan and I will be around Thursday night and Friday if anyone wants to grab a drink or talk Stan.

The inclination to deny all variation


One thing we’ve been discussing a lot lately is the discomfort many people—many researchers—feel about uncertainty. This was particularly notable in the reaction of psychologists Jessica Tracy and Alec Beall to our “garden of forking paths” paper, but really we see it all over: people find some pattern in their data and they don’t even want to consider the possibility that it might not hold in the general population. (In contrast, when I criticize these studies, I always make it clear that I just don’t know, that their claim could hold in general, I just don’t see convincing evidence.)

The story seems pretty clear to me (but, admittedly, this is all speculation, just amateur psychology on my part): in general, people are uncomfortable with not knowing and would like to use statistics to create fortresses of certainty in a dangerous, uncertain world.

Along with this is an even more extreme attitude, which is not just to deny uncertainty but to deny variation. We see this sometimes in speculations in evolutionary psychology (a field where much well-publicized work can be summarized by the dictum: Because of evolutionary pressures, all people are identical to all other people, except that all men are different from all women and all white people are different from all black people [I’ve removed that last part on the advice of some commenters; apparently my view of evolutionary psychology has been too strongly influenced by the writings of Satoshi Kanazawa and Nicholas Wade.]). But even in regular psychology this attitude comes up, of focusing on similarities between people rather than differences. For example, we learn from Piaget that children can do X at age 3 and Y at age 4 and Z at age 5, not that some children go through one developmental process and others learn in a different order.

We encountered an example of this recently, which I wrote up, under the heading, “When there’s a lot of variation, it can be a mistake to make statements about ‘typical’ attitudes.” My message there is that sometimes variation itself is the story, but there’s a tendency among researchers to express statements in terms of averages.

But then I recalled an even more extreme example, from a paper by Phoebe Clarke and Ian Ayres that claimed that “sports participation [in high school] causes women to be less likely to be religious . . . more likely to have children . . . more likely to be single mothers.” In my post on this paper a few months ago, I focused on the implausibility of the claimed effect sizes and on the problems with trying to identify individual-level causation from state-level correlations in this example. At the time I recommended they give their results a more descriptive spin, both in their journal article and in their mass-media publicity.

But there was one other point that came up, which I wrote about in my earlier post but want to focus on here. The article by Clarke and Ayres includes the following footnote:

It is true that many successful women with professional careers, such as Sheryl Sandberg and Brandi Chastain, are married. This fact, however, is not necessarily opposed to our hypothesis. Women who participate in sports may “reject marriage” by getting divorces when they find themselves in unhappy marriages. Indeed, Sheryl Sandberg married and divorced before marrying her current husband.

This footnote is a striking (to me) example of what Tversky and Kahneman called the fallacy of “the law of small numbers”: the attitude that patterns in the population should appear in any sample, in this case even in a sample of size 1. Even according to their own theories, Clarke and Ayres should not expect their model to work in every case! The above paragraph indicates that they want their theory to be something it can’t be; they want it to be a universal explanation that works in every example. Framed that way, this is obvious. My point, though, is that it appears that Clarke and Ayres were thinking deterministically without even realizing it.

Don’t believe everything you read in the (scientific) papers


A journalist writes in with a question:

This study on [sexy topic] is getting a lot of attention, and I wanted to see if you had a few minutes to look it over for me . . .

Basically, I am somewhat skeptical of [sexy subject area] explanations of complex behavior, and in this case I’m wondering whether there’s a case to be made that the researchers are taking a not-too-strong interaction effect and weaving a compelling story about it. . . .

Anyway, obviously not expecting you to chime in on [general subject area], but thoughts on whether the basic stats are sound here would be appreciated.

The paper was attached, and I looked at it.

My reply to the journalist:

I’ll believe it when I see a pre-registered replication. And not before.

P.S. Just to be clear, I wouldn’t give that response to every paper that is sent to me. I’m convinced by lots and lots of empirical research that hasn’t been replicated, pre-registered or otherwise. But in this case, where there’s a whole heap of possible comparisons, all of which are consistent with the general theory being expressed, I’m definitely concerned that we’re in “power = 0.06″ territory.

Bayesian Cognitive Modeling Models Ported to Stan

Hats off for Martin Šmíra, who has finished porting the models from Michael Lee and Eric-Jan Wagenmakers’ book Bayesian Cognitive Modeling  to Stan.

Here they are:

Martin managed to port 54 of the 57 models in the book and verified that the Stan code got the same answers as BUGS and JAGS.