Skip to content

About a zillion people pointed me to yesterday’s xkcd cartoon

I have the same problem with Bayes factors, for example this:

Screen Shot 2015-01-27 at 4.42.52 PM

and this:

Screen Shot 2015-01-27 at 4.45.03 PM

(which I copied from Wikipedia, except that, unlike you-know-who, I didn’t change the n’s to d’s and remove the superscripting).

Either way, I don’t buy the numbers, and I certainly don’t buy the words that go with them.

I do admit, though, to using the phrase “statistically significant.” It doesn’t mean so much, but, within statistics, everyone knows what it means, so it’s convenient jargon.

Crowdsourcing data analysis: Do soccer referees give more red cards to dark skin toned players?

Raphael Silberzahn Eric Luis Uhlmann Dan Martin Pasquale Anselmi Frederik Aust Eli Christopher Awtrey Štěpán Bahník Feng Bai Colin Bannard Evelina Bonnier Rickard Carlsson Felix Cheung Garret Christensen Russ Clay Maureen A. Craig Anna Dalla Rosa Lammertjan Dam Mathew H. Evans Ismael Flores Cervantes Nathan Fong Monica Gamez-Djokic Andreas Glenz Shauna Gordon-McKeon Tim Heaton Karin Hederos Eriksson Moritz Heene Alicia Hofelich Mohr Fabia Högden Kent Hui Magnus Johannesson Jonathan Kalodimos Erikson Kaszubowski Deanna Kennedy Ryan Lei Thomas Andrew Lindsay Silvia Liverani Christopher Madan Daniel Molden Eric Molleman Richard D. Morey Laetitia Mulder Bernard A. Nijstad Bryson Pope Nolan Pope Jason M. Prenoveau Floor Rink Egidio Robusto Hadiya Roderique Anna Sandberg Elmar Schlueter Felix S Martin Sherman S. Amy Sommer Kristin Lee Sotak Seth Spain Christoph Spörlein Tom Stafford Luca Stefanutti Susanne Täuber Johannes Ullrich Michelangelo Vianello Eric-Jan Wagenmakers Maciej Witkowiak SangSuk Yoon and Brian A. Nosek write:

Twenty-­nine teams involving 61 analysts used the same data set to address the same research questions: whether soccer referees are more likely to give red cards to dark skin toned players than light skin toned players and whether this relation is moderated by measures of explicit and implicit bias in the referees’ country of origin. Analytic approaches varied widely across teams. For the main research question, estimated effect sizes ranged from 0.89 to 2.93 in odds ratio units, with a median of 1.31. Twenty teams (69%) found a significant positive effect and nine teams (31%) observed a non­significant relationship. The causal relationship however remains unclear. No team found a significant moderation between measures of bias of referees’ country of origin and red card sanctionings of dark skin toned players. Crowdsourcing data analysis highlights the contingency of results on choices of analytic strategy, and increases identification of bias and error in data and analysis. Crowdsourcing analytics represents a new way of doing science; a data set is made publicly available and scientists at first analyze separately and then work together to reach a conclusion while making subjectivity and ambiguity transparent.

“It is perhaps merely an accident of history that skeptics and subjectivists alike strain on the gnat of the prior distribution while swallowing the camel that is the likelihood”

I recently bumped into this 2013 paper by Christian Robert and myself, “‘Not Only Defended But Also Applied': The Perceived Absurdity of Bayesian Inference,” which begins:

Younger readers of this journal may not be fully aware of the passionate battles over Bayesian inference among statisticians in the last half of the twentieth century. During this period, the missionary zeal of many Bayesians was matched, in the other direction, by a view among some theoreticians that Bayesian methods are absurd—not merely misguided but obviously wrong in principle. Such anti-Bayesianism could hardly be maintained in the present era, given the many recent practical successes of Bayesian methods. But by examining the historical background of these beliefs, we may gain some insight into the statistical debates of today. . . .

The whole article is just great. I love reading my old stuff!

Also we were lucky to get several thoughtful discussions:

“Bayesian Inference: The Rodney Dangerfield of Statistics?” — Steve Stigler

“Bayesian Ideas Reemerged in the 1950s” — Steve Fienberg

“Bayesian Statistics in the Twenty First Century” — Wes Johnson

“Bayesian Methods: Applied? Yes. Philosophical Defense? In Flux” — Deborah Mayo

And our rejoinder, “The Anti-Bayesian Moment and Its Passing.”

Good stuff.

“The Statistical Crisis in Science”: My talk this Thurs at the Harvard psychology department

Noon Thursday, January 29, 2015, in William James Hall 765 room 1:

The Statistical Crisis in Science

Andrew Gelman, Dept of Statistics and Dept of Political Science, Columbia University

Top journals in psychology routinely publish ridiculous, scientifically implausible claims, justified based on “p < 0.05.” And this in turn calls into question all sorts of more plausible, but not necessarily true, claims, that are supported by this same sort of evidence. To put it another way: we can all laugh at studies of ESP, or ovulation and voting, but what about MRI studies of political attitudes, or embodied cognition, or stereotype threat, or, for that matter, the latest potential cancer cure? If we can’t trust p-values, does experimental science involving human variation just have to start over? And what to we do in fields such as political science and economics, where preregistered replication can be difficult or impossible? Can Bayesian inference supply a solution? Maybe. These are not easy problems, but they’re important problems.

Here are the slides from the last time I gave this talk, and here are some relevant articles:

[2014] Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors. {\em Perspectives on Psychological Science} {\bf 9}, 641–651. (Andrew Gelman and John Carlin)

[2014] The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective. {\em Journal of Management}. (Andrew Gelman)

[2013] It’s too hard to publish criticisms and obtain data for replication. {\em Chance} {\bf 26} (3), 49–52. (Andrew Gelman)

[2012] P-values and statistical practice. {\em Epidemiology}. (Andrew Gelman)

The (hypothetical) phase diagram of a statistical or computational method

phase_diagram_sketch copy

So here’s the deal. You have a new idea, call it method C, and you try it out on problems X, Y, and Z and it works well—it destroys the existing methods A and B. And then you publish a paper with the pithy title, Method C Wins. And, hey, since we’re fantasizing here anyway, let’s say you want to publish the paper in PPNAS.

But reviewers will—and should—have some suspicions. How great can your new method really be? Can it really be that methods A and B, which are so popular, have nothing to offer anymore?

Instead give a sense of the bounds of your method. Under what conditions does it win, and under what conditions does it not work so well?

In the graph above, “Dimension 1″ and “Dimension 2″ can be anything, they could be sample size and number or parameters, or computing time and storage cost, or bias and sampling error, whatever. The point is that a method can be applied under varying conditions. And, if a method is great, what that really means is that it works well under a wide range of conditions.

So, make that phase diagram. Even if you don’t actually draw the graph or even explicitly construct a definition of “best,” you can keep in mind the idea of exploring the limitations of your method, coming up with places where it doesn’t perform so well.

On deck this week

Mon: The (hypothetical) phase diagram of a statistical or computational method

Tues: “It is perhaps merely an accident of history that skeptics and subjectivists alike strain on the gnat of the prior distribution while swallowing the camel that is the likelihood”

Wed: Six quick tips to improve your regression modeling

Thurs: “Another bad chart for you to criticize”

Fri: Cognitive vs. behavioral in psychology, economics, and political science

Sat: Economics/sociology phrase book

Sun: Oh, it’s so frustrating when you’re trying to help someone out, and then you realize you’re dealing with a snake

Tell me what you don’t know

We’ll ask an expert, or even a student, to “tell me what you know” about some topic. But now I’m thinking it makes more sense to ask people to tell us what they don’t know.

Why? Consider your understanding of a particular topic to be divided into three parts:
1. What you know.
2. What you don’t know.
3. What you don’t know you don’t know.

If you ask someone about 1, you get some sense of the boundary between 1 and 2.

But if you ask someone about 2, you implicitly get a lot of 1, you get a sense of the boundary between 1 and 2, and you get a sense of the boundary between 2 and 3.

As my very rational friend Ginger says: More information is good.

Postdoc opportunity here, with us (Jennifer Hill, Marc Scott, and me)! On quantitative education research!!

Hop the Q-TRAIN: that is, the Quantitative Training Program, a postdoctoral research program supervised by Jennifer Hill, Marc Scott, and myself, and funded by the Institute for Education Sciences.

As many of you are aware, education research is both important and challenging. And, on the technical level, we’re working on problems in Bayesian inference, multilevel modeling, survey research, and causal inference.

There are various ways that you can contribute as a postdoc: You can have a PhD in psychometrics or education research, and this is your chance to go in depth with statistical inference and computation, or maybe you can do all sorts of Bayesian computation and you’d like to move into education research. We’re looking for top people to join our team.

If you’re interested, send me an email with a letter describing your qualifications and reason for applying, a C.V., and at least one article you’ve written, and have three letters of recommendation sent to me. All three of us (Jennifer, Marc, and I) will evaluate the applications.

We have openings for two 2-year postdocs. As per federal government regulations, candidates must be United States citizens or permanent residents.

“What then should we teach about hypothesis testing?”

Someone who wishes to remain anonymous writes in:

Last week, I was looking forward to a blog post titled “Why continue to teach and use hypothesis testing?” I presume that this scheduled post merely became preempted by more timely posts. But I am still interested in reading the exchange that will follow.

My feeling is that we might have strong reservations about the utility of NHST [null hypothesis significance testing], but realize that they aren’t going away anytime soon. So it is important for students to understand what information other folks are trying to convey when they report their p-values, even if we would like to encourage them to use other frameworks (e.g. a fully Bayesian decision theoretic approach) in their own decision making.

So I guess the next question is, what then should we teach about hypothesis testing? What proportion of the time in a one semester upper level course in Mathematical Statistics should be spent on the theory and how much should be spent on the nuance and warnings about misapplication of the theory? These are questions I’d be interested to hear opinions about from you and your thoughtful readership.

A related question I have is on the “garden of forking paths” or “researcher degrees of freedom”. In applied research, do you think that “tainted” p-values are the norm, and that editors, referees, and readers basically assume some level of impurity of reported p-values?

I wonder, because it seems, if applied statistics textbooks are any guide, that the first recommendation in a data analysis seems to often be: plot your data. And I suspect that many folks might do this *before* settling in on the model they are going to fit. e.g. If they see nonlinearity, they will then consider a transformation that they wouldn’t have considered before. So whether they make the transformation or not, they might have, thus affecting the interpretability of p-values and whatnot. Perhaps I am being an extremist. Pre-registration, replication studies, or simply splitting a data set into training and testing sets may solve this problem, of course.

So to tie these two questions together, shouldn’t our textbooks do a better job in this regard, perhaps in making clear a distinction between two types of statistical analysis: a data analysis, which is intended to elicit the questions and perhaps build a model, and a confirmatory analysis which is the “pure” estimation and prediction from a pre-registered model, from which a p-value might retain some of its true meaning?

My reply: I’ve been thinking about this a lot recently because Eric Loken, Ben Goodrich, and I have been designing an introductory statistics course, and we have to address these issues. One way I’ve been thinking about it is that statistical significance is more of a negative than a positive property:

Traditionally we say: If we find statistical significance, we’ve learned something, but if a comparison is not statistically significant, we can’t say much. (We can “reject” but not “accept” a hypothesis.)

But I’d like to flip it around and say: If we see something statistically significant (in a non-preregistered study), we can’t say much, because garden of forking paths. But if a comparison is not statistically significant, we’ve learned that the noise is too large to distinguish any signal, and that can be important.

What’s the point of the margin of error?

So . . . the scheduled debate on using margin of error with non-probability panels never happened. We got it started but there was some problem with the webinar software and nobody put the participants could hear anything.

The 5 minutes of conversation we did have was pretty good, though. I was impressed. The webinar was billed as a “debate” which didn’t make me happy—I wasn’t looking forward to hearing a bunch of pious nonsense about probability sampling and statistical theory—but the actual discussion was very reasonable.

The first thing that came up was, Are everyday practitioners in market research concerned about the margins of error for non-probability samples? The consensus among the market researchers on the panel was: No, users pretty much just take samples and margins of error as they are, without worrying about where the sample came from or how it was collected.

I pointed out that if you’re concerned about non-probability samples and if you don’t trust the margin of error for non-probability samples, then you shouldn’t trust the margin of error for any real sample from a human population, given the well-known problems of nonavailability and nonresponse. When the nonresponse rate is 91%, any sample is a convenience sample.

Sampling and adjustment

The larger point is that just about any survey requires two steps:
1. Sampling.
2. Adjustment.

There are extreme settings where either 1 or 2 alone is enough.

If you have a true probability sample from a perfect sampling frame, with 100% availability and 100% response, and if your sampling probabilities don’t vary much, and if your data are dense relative to the questions you’re asking, then you can get everything you need—your estimate and your margin of error—from the sample, with no adjustment needed.

From the other direction, if you have a model for the underlying data that you really believe, and if you have a sample with no selection problems, or if you have a selection model that you really believe (which I assume can happen in some physical settings, maybe something like sampling fish from a lake), then you can take your data and adjust, with no concerns about random sampling. Indeed, this is standard in non-sampling areas of statistics, where people just take data and run regressions and that’s it.

In general, though, it makes sense to be serious about both sampling and adjustment, to sample as close to randomly as you can, and to adjust as well as you can.

Remember: just about no sample of humans is really a probability sample or even close to a probability sample, and just about no regression model applied to humans is correct or even close to correct. So we have to worry about sampling, and we have to worry about adjustment. Sorry, Michael Link, but that’s just the way things are. No “grounding in theory” is going to save you.

What’s the point of the margin of error?

Where, then, does the margin of error come in? (Note to outsiders: to the best of my knowledge, “margin of error” is not a precisely-defined term, but I think it is usually taken to be 2 standard errors.)

What I said, during our abbreviated 5-minute panel discussion, is that, in practice, we often don’t need the margin of error at all. Anything worth doing is worth doing multiple times, and once you have multiple estimates from different samples, you can look at the variation between them to get an external measure of variation that is more relevant than an internal margin of error, in any case.

The margin of error is an approximate lower bound on the expected error of an estimate from a sample, and that such a lower bound can be useful, but that in most cases I’d get more out of the between-survey variation (which includes sampling error as well as variation over time, variation between sampling methods, and variation in nonsampling error).

Where the margin of error often is useful is in design, in deciding how large a sample size you want to estimate a quantity of interest to some desired precision.

In an email discussion afterward, John Bremer pointed out that in tracking studies you are interested particularly in measuring change, and in that case it might not be so easy to get an external measure of variance. Indeed, if you only measure something at time 1 and time 2, then the margin of error is indeed relevant to assessing the evidence. To get an external measure of uncertainty and variation you need a longer time series. I just wanted to emphasize the point that the margin of error is a lower bound and, as such, can be useful if it is interpreted in that way. Even if sampling is perfect probability sampling and there is 100% response, the margin of error is still an underestimate because the sample is only giving a snapshot, and attitudes change over time.