Skip to content

Statistical Challenges of Survey Sampling and Big Data (my remote talk in Bologna this Thurs, 15 June, 4:15pm)

Statistical Challenges of Survey Sampling and Big Data

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University, New York

Big Data need Big Model. Big Data are typically convenience samples, not random samples; observational comparisons, not controlled experiments; available data, not measurements designed for a particular study. As a result, it is necessary to adjust to extrapolate from sample to population, to match treatment to control group, and to generalize from observations to underlying constructs of interest. Big Data + Big Model = expensive computation, especially given that we do not know the best model ahead of time and thus must typically fit many models to understand what can be learned from any given dataset. We discuss Bayesian methods for constructing, fitting, checking, and improving such models.

It’ll be at the 5th Italian Conference on Survey Methodology, at the Department of Statistical Sciences of the University of Bologna. A low-carbon remote talk.

Criminology corner: Type M error might explain Weisburd’s Paradox

[silly cartoon found by googling *cat burglar*]

Torbjørn Skardhamar, Mikko Aaltonen, and I wrote this article to appear in the Journal of Quantitative Criminology:

Simple calculations seem to show that larger studies should have higher statistical power, but empirical meta-analyses of published work in criminology have found zero or weak correlations between sample size and estimated statistical power. This is “Weisburd’s paradox” and has been attributed by Weisburd, Petrosino, and Mason (1993) to a difficulty in maintaining quality control as studies get larger, and attributed by Nelson, Wooditch, and Dario (2014) to a negative correlation between sample sizes and the underlying sizes of the effects being measured. We argue against the necessity of both these explanations, instead suggesting that the apparent Weisburd paradox might be explainable as an artifact of systematic overestimation inherent in post-hoc power calculations, a bias that is large with small N. Speaking more generally, we recommend abandoning the use of statistical power as a measure of the strength of a study, because implicit in the definition of power is the bad idea of statistical significance as a research goal.

I’d never heard of Weisburd’s paradox before writing this article. What happened is that the journal editors contacted me suggesting the topic, I then read some of the literature and wrote my article, then some other journal editors didn’t think it was clear enough so we found a couple of criminologists to coauthor the paper and add some context, eventually producing the final version linked here. I hope it’s helpful to researchers in that field and more generally. I expect that similar patterns hold with published data in other social science fields and in medical research too.

PhD student fellowship opportunity! in Belgium! to work with us! on the multiverse and other projects on improving the reproducibility of psychological research!!!

[image of Jip and Janneke dancing with a cat]

Wolf Vanpaemel and Francis Tuerlinckx write:

We at the Quantitative Psychology and Individual Differences, KU Leuven, Belgium are looking for a PhD candidate. The goal of the PhD research is to develop and apply novel methodologies to increase the reproducibility of psychological science. More information can be found on the job website or by contacting us at or The deadline for application is Monday June 26, 2017.

One of the themes a successful candidate may work on is the further development of the multiverse. I expect to be an active collaborator in this work.

So please apply to this one. We’d like to get the best possible person to be working on this exciting project.

Why I’m not participating in the Transparent Psi Project

I received the following email from psychology researcher Zoltan Kekecs:

I would like to ask you to participate in the establishment of the expert consensus design of a large scale fully transparent replication of Bem’s (2011) ‘Feeling the future’ Experiment 1. Our initiative is called the ‘Transparent Psi Project’. [] Our aim is to develop a consensus design that is mutually acceptable for both psi proponent and mainstream researchers, containing clear criteria for credibility.

I replied:

Thanks for the invitation. I am not so interested in this project because I think that all the preregistration in the world won’t solve the problem of small effect sizes and poor measurements. It is my impression from Bem’s work and others that the field of psi is plagued by noisy measurements and poorly specified theories. Sure, preregistration etc. would stop many of the problems–in particular, there’s no way that Bem would’ve seen 9 out of 9 statistically significant p-values, or whatever that was. But I can’t in good conscience recommend the spending of effort in this way. I think any serious work in this area would have to go beyond the phenomenological approach and perform more direct measurements, as for example here: . I’ve not actually read the paper linked there so this may be a bad example but the point is that one could possibly study such things scientifically with a physical model of the process. To just keep taking Bem-style measurements, though, I think that’s hopeless: it’s the kangaroo problem ( Better to preregister than not, but better still not to waste time on this or similarly-hopeless problems (studying sex ratios in samples of size 3000, estimating correlations of monthly cycle on political attitudes using between-person comparisons, power pose, etc.). I recognize that some of these ideas, ESP included, had some legitimate a priori plausibility, but, at this point, a Bem-style experiment seems like a shot in the dark. And, of course, even with preregistration, there’s a 5% chance you’ll see something statistically significant just by chance, leading to further confusion. In summary, preregistration and consensus helps with the incentives, but all the incentives in the world are no substitute for good measurements. (See the discussion of “in many cases we are loath to recommend pre-registered replication” here:

Kekecs wrote back:

Thank you for your feedback. We fully realize the problem posed by small effect size. However, this problem in itself can be solved simply by throwing a larger sample at it. In fact based on our simulations we plan to collect 14,000-60,000 data points (700 – 3,000 participants) using bayesian analysis and optional stopping, aiming to reach a Bayes factor threshold of 60 or 1/60. Our simulations show that using these parameters we only have a p = 0.0004 false positive chance, so it is highly unlikely that we would accidentally generate more confusion on the field just by conducting the replication. On the contrary, by doing our study, we will effectively more than double the amount of total data accumulated so far by Bem´s and others studies using this paradigm, which should help with clarity on the field by introducing good quality, credible data.

You might be right though that the measurements itself is faulty, and that we cannot expect precognition to work in an environmentally invalid situation like this. But in reality, we don’t have any information on how precognition should works if it really does exist, so I am not sure what would be a better way of measuring it than seeing how effective are people at predict future events.

Our main goal here is not really to see whether precognition exists or not. The ultimate aim of our efforts is to do a proof of concept study where we will see whether it is possible to come to a consensus on criterion of acceptability and credibility in a field this divided, and to come up with ways in which we can negate all possibilities of questionable research practice. This approach can then be transferred to other fields as well.

I then responded:

I still think it’s hopeless. The problem (which I’ll say using generic units as I’m not familiar with the ESP experiment) is: suppose you have a huge sample size and can detect an effect of 0.003 (on some scale) with standard error 0.001. Statistically significant, preregistered, the whole deal. Fine. But then you could very well see an effect of -0.002 with different people, in a different setting. And -0.003 somewhere else. And 0.001 somewhere else. Etc. You’re talking about effects that are indistinguishable given various sources of leakage in the experiment.

I support your general goal but I recommend you choose a more promising topic than ESP or power pose or various other topics that get talked about so much.

Kekecs replied:

We are already committed to follow through with this particular setting. But I agree with you that our approach can be easily transferred to the research of other effects and we fully intend to do that.

If you put it that way, your question is all about construct validity. Whether we can detect the effect that we really want to detect, or are there other confounds that bias the measurement. In this particular experimental setting which is simple as stone (basically people are guessing about the outcomes of future coin flips) the types of bias that we can expect are more related to questionable research practices (QRPs) than anything else. The only way other types of bias, such as personal differences in ability (sampling bias), participant expectancy, and demand characteristics, etc., can have an effect is if there is truly an anomalous effect. For example if we detected an effect of 0.003 with 0.001 SE only because we accidentally sampled people with high psi abilities, our conclusion that there is a psi effect would still be true (although our effect size estimate would be slightly off).

That is why in this project we are focusing mainly on negating all possibilities of QRPs and full transparency. I am not sure what other types of leakage can we have in this particular experiment if we addressed all possible QRPs. Would you care to elaborate?

I responded:

Just in answer to that last question: I’m not sure what other types of leakage might exist—it’s my impression that Bem’s experiments had various problems, so I guess it depends how exact a replication you’re talking about. My real point, though, is if we think ESP exists at all, then an effect that’s +0.003 on Monday and -0.002 on Tuesday and +0.001 on Wednesday probably isn’t so interesting. This becomes clearer if we move the domain away from possible null phenomena such as ESP or homeopathy, to things like social priming, which presumably has some effect, but which varies so much by person and by context to be generally unpredictable and indistinguishable from noise. I don’t think ESP is such a good model for psychology research because it’s one of the few things people study that really could be zero.

And then Kekecs closed out the discussion:

In response, I find doing this effort on the field of ESP interesting exactly because the effect could potentially be zero. Positive findings have an overwhelming dominance in both psi literature, and social sciences literature in general. In the case of most other social science research, it is a theoretical possibility (but unrealistic) that researchers just get lucky all the time and they always ask the right questions, that is why they are so effective in finding positive effects. Again, this is obviously cannot be true for the entirety of the literature, but for each topic studied individually, it can be quite probable that there is an effect if ever so small, which blurs the picture about publication bias and other types of bias in the literature. However, it may be that there is no ESP effect at all. In that case, we would have a field where the effect of bias in research can be studied in its purest form.

From another perspective, precognition in particular is a perfect research topic exactly because these designs by their nature are very well protected from the usual threats to internal validity, at least in the positive direction. It is hard to see what could make a person perform better at predicting the outcome of a state of the art random number generator if there is no psi effect. Bias can always be introduced by different questionable research practices (QRPs), but if we are able to design a study completely immune the QRPs, there is no real possibility for bias toward type I error. Of course, if the effect really exists, all the usual threats to validity can have an influence (for example, it is possible that people can get “psi fatigue” if they perform a lot of trials, or that events and contextual features, or even expectancy can have an effect on performance), but we cannot make a type I error in that case, because the effect exists, we can only make errors in estimating the size of the effect, or a type II error.

So understanding what is underlying the dominance of positive effects in ESP research is very important. If there is no effect, psi literature can serve as a case study for bias in its purest form, which can help us understand it in other research fields. On the other hand, if we find an effect when all QRPs are controlled for, we may need to really rethink our current paradigm.

I continue to think that the study of ESP is irrelevant for psychology, both for substantive reasons—there is no serious underlying theory or clear evidence for ESP, it’s all just hope and intuition—and for methodological reasons, in that zero is a real possibility. In contrast, even silly topics such as power pose and embodied cognition seem to me to have some relevance to psychology and also involve the real challenge that there are no zeroes. Standing in an unusual position for two minutes will have some effect on your thinking and behavior; the debate is what are the consistent effects, if any. That’s my take, anyway; but I wanted to share Kekecs’s view too, given all the effort he’s putting into this project.

Financial anomalies are contingent on being unknown

Jonathan Falk points us to this article by Kewei Hou, Chen Xue, and Lu Zhang, who write:

In retrospect, the anomalies literature is a prime target for p-hacking. First, for decades, the literature is purely empirical in nature, with little theoretical guidance. Second, with trillions of dollars invested in anomalies-based strategies in the alone, the financial interest is overwhelming. Third, more significant results make a bigger splash, and are more likely to lead to publications as well as promotion, tenure, and prestige in academia. As a result, armies of academics and practitioners engage in searching for anomalies, and the anomalies literature is most likely one of the biggest areas in finance and accounting. Finally, as we explain later, empiricists have much flexibility in sample criteria, variable definitions, and empirical methods, which are all tools of p-hacking in chasing statistical significance.

Falk writes:

A weakness in this study is that the use of a common data period obscures the fact that financial anomalies are contingent on being unknown: known (true) anomalies will be arbitraged away so that they no longer exist. Their methodology continues to estimate many of these anomalies after the results of the studies were public knowledge and heavily scrutinized. This should attenuate the results. (It would be interesting to see if the results weakened the earlier the study was published. On a low-hanging fruit theory, it should be just the opposite.) It’s as if Power Pose worked until Amy Cuddy wrote about it, at which point everyone wised up and the effect went away. Effects like that are really hard to replicate.

Falk’s comment, about financial anomalies being contingent on being unknown, reminds me of something: In finance (so I’m told), when someone has a great idea, they keep it secret and try to milk all the advantage out of it that they can. This also happens in some fields of science: we’ve all heard of bio labs that refuse to share their data or their experimental techniques because they want to squeeze out a couple more papers in Nature and Cell. Given all the government funding involved, that’s not cool, but it’s how it goes. But in statistics, when think we have a good idea, we put it out there for free, we scream about it and get angry that other people aren’t using our wonderful methods and our amazing free software. Funny, that.

P.S. For an image, I went and googled *cat anomaly*. I recommend you don’t do that. The pictures were really disturbing to me.

UK election summary

The Conservative party, led by Theresa May, defeated the Labour party, led by Jeremy Corbyn.

The Conservative party got 42% of the vote, Labour got 40% of the vote, and all the other parties received 18% between them. The Conservatives ended up with 51.5% of the two-party vote, just a bit less than Hillary Clinton’s share last November.

In the previous U.K. general election, two years ago, Conservative beat Labour, 37%-30%, that’s 55% of the two-party vote.

The time before, the Conservatives received 36%, compared to Labour’s 29%. The Conservatives again had received 55% of the two-party vote.

As with the Clinton-Trump presidential election and the “Brexit” election in the U.K. last year, the estimates from the polls turned out to give pretty good forecasts.

The predictions were not perfect—the 318-262 split in parliament was not quite the 302-269 that was predicted, and the estimated 42-38 vote split didn’t quite predict the 43.5-41.0 split that actually happened (those latter figures, for Great Britain only, come from the Yougov post-election summary). And the accuracy of the seat forecast has to be attributed in part to luck, given the wide predictive uncertainty bounds (based on pre-election polls, the Conservatives were forecast to win between 269 and 334 seats). The predictions were done using Mister P and Stan.

The Brexit and Clinton-Trump poll forecasts looked bad at the time because they got the outcome wrong, but as forecasts of public opinion they were solid, only off by a percentage point or two in each case. In general we’d expect polls to do better in two-party races or, more generally, in elections with two clear options, because then there are fewer reasons for prospective voters to change their opinions. In most parts of the U.K., this 2017 election was a two-party affair, hence it should be no surprise that the final polls were accurate (after suitable adjustment for nonresponse), even if, again, there was some luck that they were as close as shown in these graphs by Jack Blumenau:

P.S. I like Yougov and some of our research is supported by Yougov, but I’m kinda baffled cos when I googled I found this page by Anthony Wells, which estimates 42% for the Conservatives, 35% for Labour, and a prediction of “an increased Conservative majority in the Commons,” which seems to contradict their page that I linked to above, with that prediction of a hung parliament. That’s the forecast I take seriously because it used MRP, but then it makes me wonder why their “Final call” was different. Once you have a model and a series of polls, why throw all that away when making your final call?

The (Lance) Armstrong Principle

If you push people to promise more than they can deliver, they’re motivated to cheat.

“Bombshell” statistical evidence for research misconduct, and what to do about it?

Someone pointed me to this post by Nick Brown discussing a recent article by John Carlisle regarding scientific misconduct.

Here’s Brown:

[Carlisle] claims that he has found statistical evidence that a surprisingly high proportion of randomised controlled trials (RCTs) contain data patterns that cannot have arisen by chance. . . . the implication is that some percentage of these impossible numbers are the result of fraud. . . .

I thought I’d spend some time trying to understand exactly what Carlisle did. This post is a summary of what I’ve found out so far. I offer it in the hope that it may help some people to develop their own understanding of this interesting forensic technique, and perhaps as part of the ongoing debate about the limitations of such “post publication analysis” techniques . . .

I agree with Brown that these things are worth studying. The funny thing is, it’s hard for me to get excited about this particular story, even though Brown, who I respect, calls it a “bombshell” that he anticipates will “have quite an impact.”

There are two reasons this new paper doesn’t excite me.

1. Dog bites man. By now, we know there’s lots of research misconduct in published papers. I use “misconduct” rather than “fraud” because from, the user’s perspective, I don’t really care so much whether Brian Wansink, for example, was fabricating data tables, or had students make up raw data, or was counting his carrots in three different ways, or was incompetent in data management, or was actually trying his best all along and just didn’t realize that it can be detrimental to scientific progress to be fast and loose with your data. Or some combination of all of these. Clarke’s Law.

Anyway, the point is, it’s no longer news when someone goes into a literature of p-value-based papers in a field with noisy data, and finds that people have been “manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.” At this point, it’s to be expected.

2. As Stalin may have said, “When one man dies it’s a tragedy. When thousands die it’s statistics.” Similarly, the story of Satoshi Kanazawa or Brian Wansink or Daryl Bem has human interest. And even the stories without direct human interest have some sociological interest, one might say. For example, I can’t even remember who wrote the himmicanes paper or the ages-ending-in-9 paper, but in each case I’m still interested in the interplay between the plausible-but-oh-so-flexible theory, the weak data analysis, the poor performance of the science establishment, and the media hype. This new paper by Carlisle, though: it’s so general, so it’s hard to grab onto the specifics of any single paper or set of papers. Also, for me, medical research is less interesting than social science.

Finally, I want to briefly discuss the current and future reactions to this study. I did a quick google and found it was covered on Retraction Watch, where Ivan Oransky quotes Andrew Klein, editor of Anaesthesia, as saying:

No doubt some of the data issues identified will be due to simple errors that can be easily corrected such as typos or decimal points in the wrong place, or incorrect descriptions of statistical analyses. It is important to clarify and correct these in the first instance. Other data issues will be more complex and will require close inspection/re-analysis of the original data.

This is all fine, and, sure, simple typos should just be corrected. But . . . if a paper has real mistakes I think the entire paper should be flagged as suspect. If the authors have so little control over their data and methods, then we may have no good reason to believe their claims about what their data and methods imply about the external world.

One of the frustrating things about the Richard Tol saga was that we became aware of more and more errors in his published article, but the journal never retracted it. Or, to take a more mild case, Cuddy, Norton, and Fiske published a paper with a bunch of errors. Fiske assures us that correction of the errors doesn’t change the paper’s substantive conclusions, and maybe that’s true and maybe it’s not. But . . . why should we believe her? On what evidence should we believe the claims of a paper where the data are mishandled?

To put it another way, I think it’s unfortunate that retractions and corrections are considered to be such a big deal. If a paper has errors in its representation of data or research procedures, that should be enough for the journal to want to put a big fat WARNING on it. That’s fine, it’s not so horrible. I’ve published mistakes too. Publishing mistakes doesn’t mean you have to be a bad person, nobody’s perfect.

So, if Anaesthesia and other journals wants to correct incorrect descriptions of statistical analyses, numbers that don’t add up, etc., that’s fine. But I hope that when making these corrections—and when identifying suspicious patterns in reported data—they also put some watermark on the article so that future readers will know to be suspicious. Maybe something like this:

The authors of the present paper were not careful with their data. Their main claims were supported by t statistics reported as 5.03 and 11.14, but the actual values were 1.8 and 3.3.

Or whatever. The burden of proof should not be on the people who discovered the error to demonstrate that it’s consequential. Rather, the revelation of the error provides information about the quality of the data collection and analysis. And, again, I say this as a person who’d published erroneous claims myself.

Workshop on reproducibility in machine learning

Alex Lamb writes:

My colleagues and I are organizing a workshop on reproducibility and replication for the International Conference on Machine Learning (ICML). I’ve read some of your blog posts on the replication crisis in the social sciences and it seems like this workshop might be something that you’d be interested in.

We have three main goals in holding a workshop on reproducing and replicating results:

1. Provide a venue in Machine Learning for publishing replications, both successful and unsuccessful. This helps to give credit and visibility to researchers who work on replicating results as well as researchers whose results are replicated.

2. A place to share new ideas about software and tools for making research more reproducible.

3. A forum for discussing how reproducing research and replication effects different parts of the machine learning community. For example, what does it mean to reproduce the results of a recommendations engine which interacts with live humans?

I agree that this is a super-important topic because the fields of statistical methodology and machine learning are full of hype. Lots of algorithms that work in the test examples but then fail in new problems. This happens even with my own papers!

No conf intervals? No problem (if you got replication).

This came up in a research discussion the other day. Someone had produced some estimates, and there was a question: where are the conf intervals. I said that if you have replication and you graph the estimates that were produced, then you don’t really need conf intervals (or, for that matter, p-values). The idea is that the display of the different estimates (produced from different years, or different groups, or different scenarios, or whatever) gives you the relevant scale of variation. In general, this is even better than confidence intervals in that the variation is visually clear and less assumption-based.

What I’m saying is, use the secret weapon.

The Publicity Factory: How even serious research gets exaggerated by the process of scientific publication and media exposure

The starting point is that we’ve seen a lot of talk about frivolous science, headline-bait such as the study that said that married women are more likely to vote for Mitt Romney when ovulating, or the study that said that girl-named hurricanes are more deadly than boy-named hurricanes, and at this point some of these studies are almost pre-debunked. Reporters are starting to realize that publication in Science or Nature or PPNAS is, not only no guarantee of correctness but also no guarantee that a study is even reasonable.

But what I want to say here is that even serious research is subject to exaggeration and distortion, partly through the public relations machine and partly because of basic statistics. The push to find and publicize so-called statistically significant results leads to overestimation of effect sizes (type M errors), and crude default statistical models lead to broad claims of general effects based on data obtained from poor measurements and nonrepresentative samples.

One example we’ve discussed a lot is that claim of the effectiveness of early childhood intervention, based on a small-sample study from Jamaica. This study is not “junk science.” It’s a serious research project with real-world implications. But the results still got exaggerated. My point here is not to pick on those researchers. No, it’s the opposite: even top researchers exaggerate in this way so we should be concerned in general.

What to do here? I think we need to proceed on three tracks:
1. Think more carefully about data collection when designing these studies. Traditionally, design is all about sample size, not enough about measurement.
2. In the analysis, use Bayesian inference and multilevel modeling to partially pool estimated effect sizes, thus giving more stable and reasonable output.
3. When looking at the published literature, use some sort of Edlin factor to interpret the claims being made based on biased analyses.

The above remarks are general, indeed it was inspired by yesterday’s discussion about the design and analysis of psychology experiments, as I think there’s some misunderstanding in which people don’t see where assumptions are coming into various statistical analyses (see for example this comment).

A big big problem here, I think, is that many people seem to have the impression that, if you have a randomized experiment (or its quasi-randomized equivalent), then comparisons in your data can be given general interpretation in the outside world, with the only concern being “statistical significance.” But that view is not correct. You can have a completely clean randomized experiment, but if your measurements are not good enough, you can’t make general claims at all. Indeed, standard methods yield overestimates of effect sizes.

And, again, this is not just a problem with junk science. Naive overinterpretation of results from randomized comparisons: this is a problem with lots of serious work too in the human sciences.

How has my advice to psychology researchers changed since 2013?

Four years ago, in a post entitled, “How can statisticians help psychologists do their research better?”, I gave the following recommendations to researchers:

– Analyze all your data.

– Present all your comparisons.

– Make your data public.

And, for journal editors, I wrote, “if a paper is nothing special, you don’t have to publish it in your flagship journal.”

Changes since then?

The above advice is fine, but it’s missing something—a big something—regarding design and data collection. So let me two more tips, arguably the most important pieces of advice of all:

– Put in the effort to take accurate measurements. Without accurate measurements (low bias, low variance, and a large enough sample size), you’re drawing dead. That’s what happened with the beauty-and-sex-ratio study, the ovulation-and-clothing study, the ovulation-and-voting study, the fat arms study, etc etc. All the analysis and shared data and preregistration in the world won’t save you if your data don’t closely address the questions you’re trying to answer.

– Do within-person comparisons where possible; that is, cross-over designs. Don’t worry about poisoning the well; that’s the least of your worries. Generally it’s much more important to get those direct comparisons.

A quote from William James that could’ve come from Robert Benchley or S. J. Perelman or Dorothy Parker

Following up on yesterday’s post, here’s a William James quote that could’ve been plucked right off the Algonquin Round Table:

Is life worth living? It all depends on the liver.

Using external C++ functions with PyStan & radial velocity exoplanets

Dan Foreman-Mackey writes:

I [Mackey] demonstrate how to use a custom C++ function in a Stan model using the Python interface PyStan. This was previously only possible using the R interface RStan (see an example here) so I hacked PyStan to make this possible in Python as well. . . .

I have some existing C++ code that implements my model and its derivatives so I don’t want to have to re-write the model in the Stan language. Furthermore, part of the model requires solving a transcendental equation numerically; it’s not obvious that applying autodiff to an iterative solver is a great idea, but the analytic gradients are trivial to evaluate.

The example that we’ll use is fitting radial velocity observations of an exoplanet. In particular, we’ll fit recent observations of 51 Peg b, the first exoplanet discovered around a main sequence star.

An exoplanet! Cool.

Mackey continues with tons of details. Great stuff.

I have some issues with his Stan model; in particular he uses priors with hard constraints which I think is generally a bad idea. For example, he has a parameter with a uniform (-10, 5) prior. I can’t imagine this is the right thing to do. From basic recommendations it would be better to do something like normal (-2.5, 7.5), but really I have a feeling he could do a lot more regularization here. The uniform prior might work in the particular example that he was using, but in general it would be safer to control the inference a bit more. Mackey’s got a bunch of these uniform priors in his code and I think he should look carefully at all of them.

The real point of Mackey’s post, though, is that he’s hacking Stan to solve his problem. And that’s great.

A collection of quotes from William James that all could’ve come from . . . Bill James!

From a few years ago, some quotes from the classic psychologist that fit within the worldview of the classic sabermetrician:

Faith means belief in something concerning which doubt is theoretically possible.

A chain is no stronger than its weakest link, and life is after all a chain.

A great many people think they are thinking when they are merely rearranging their prejudices.

A man has as many social selves as there are individuals who recognize him.

Acceptance of what has happened is the first step to overcoming the consequences of any misfortune.

An act has no ethical quality whatever unless it be chosen out of several all equally possible.

Be willing to have it so. Acceptance of what has happened is the first step to overcoming the consequences of any misfortune.

Belief creates the actual fact.

Compared to what we ought to be, we are half awake.

Do something everyday for no other reason than you would rather not do it, so that when the hour of dire need draws nigh, it may find you not unnerved and untrained to stand the test.

Great emergencies and crises show us how much greater our vital resources are than we had supposed.

Human beings can alter their lives by altering their attitudes of mind.

If any organism fails to fulfill its potentialities, it becomes sick.

If you want a quality, act as if you already had it.

In the dim background of mind we know what we ought to be doing but somehow we cannot start.

Individuality is founded in feeling; and the recesses of feeling, the darker, blinder strata of character, are the only places in the world in which we catch real fact in the making, and directly perceive how events happen, and how work is actually done.

It is only by risking our persons from one hour to another that we live at all. And often enough our faith beforehand in an uncertified result is the only thing that makes the result come true.

It is our attitude at the beginning of a difficult task which, more than anything else, will affect its successful outcome.

It is wrong always, everywhere, and for everyone, to believe anything upon insufficient evidence.

No matter how full a reservoir of maxims one may possess, and no matter how good one’s sentiments may be, if one has not taken advantage of every concrete opportunity to act, one’s character may remain entirely unaffected for the better.

Nothing is so fatiguing as the eternal hanging on of an uncompleted task.

Some are more Bill-James-like than others, but, as a whole, this list is kind of amazing.

Hey—here are some tips on communicating data and statistics!

This fall I’ll be again teaching the course, Communicating Data and Statistics.

Here’s the rough course plan. I’ll tinker with it between now and September but this is the basic idea. (The course listing is here, but that online description is out of date; the course plan linked above is more accurate.)

Here are the topics for the 13 weeks of the course:

1. Introducing yourself and telling a story
2. Principles of statistical graphics
3. Teaching
4. Making effective graphs
5. Communicating variation and uncertainty
6. Displaying fitted models
7. Giving a presentation
8. Dynamic graphics
9. Writing
10. Collaboration and the scientific community
11. Data processing and programming
12. Student projects
13. Student projects

Communication is central to your job as a quantitative researcher. Our goal in this course is for you to improve at all aspects of statistical communication, including writing, public speaking, teaching, informal conversation and collaboration, programming, and graphics.

Always keep in mind your goals and your audience.

Never forget, as one of our blog commenters reminds us: your closest collaborator is you six months ago . . . and she doesn’t reply to email!

The course is intended for Ph.D. students from all departments on campus; it is also open to some masters and undergraduate students who have particular interest in the topic.

This is my favorite course ever. As a student, you’ll get practice in all sorts of useful skills that are central to data and statistics, and you’ll participate in fast-moving conversations with fellow students with different backgrounds and experiences. In the interstices, you’ll learn all sorts of important ideas and methods in statistical design and analysis, that you’d never learn anywhere else. It’s the course where we first introduced statistics diaries. It’s the course where (a prototype version of) ShinyStan was one of the final projects!

You don’t want to miss this one.

The class will meet twice a week.

You’ll never guess this one quick trick to diagnose problems with your graphs and then make improvements

The trick is to consider graphs as comparisons.

Here’s the story. This post from several years ago shows a confusing and misleading pair of pie charts from a Kenyan election:

The quick reaction would be to say, ha ha, pie charts. But that’s not my point here. Sure, pie charts have problems and I think they’re almost never the right way to share data. But sometimes they’re not so bad.

To see the problem with the above display, we have to go back to first principles. And, with graphs, the first principle is always to consider what comparisons you would like to display, and what comparisons does the graph facilitate.

In this example, the main goal seems to be to compare the official results to the new exit poll. Thus, there are four comparisons to be made in parallel. A second goal is to compare the four categories within each dataset.

What about the pie charts? What comparisons do they easily allow?

The most salient point of any pie chart is that the percentages add up to 100%. So the first thing the graphs do is allow each of the proportions to be visually compared to the reference points of 100%, 50%, and 0%, and also 25% and 75%. We can see pretty clearly that in each dataset there are no candidates who received more than 50%, and there are two candidates who received between 25% and 50%.

But the pair of pie charts do not make it at all easy to compare each candidate from official results to the new poll. Part of that is perverse design choices, as the locations of the four “slices” have been permuted from one pie to another. But even if that had not been done—even if the four categories had been kept in pretty much the same position in each graph, and even if the coloring had been kept consistent—it would still be very difficult to compare angles of the different pies.

Basically the only way you can do it is to compare the numbers. And in that case a table would be clearer, as the relevant pairs could be written side by side. I’d prefer a dot plot.

Why, then, the pie? Part of this must be sheer convenience: whoever made these plots happened to know to make pies. But I think it’s more than that. I think the real appeal of the pie is that it graphically shows the percentages adding up to 100. And, sure, that’s something, but in this case it doesn’t really help with what I think is the key comparison of interest.

I had the same problem with Florence Nightingale’s clock plot. People think that graph is the coolest thing in the world—the wheel of time and all that—but it doesn’t facilitate the time-series comparisons that are the real goal in that example.

So, again, the point is not simply Don’t do pie charts (although I would endorse that message).

Rather, the point is: When making graphs, think about the comparisons you want readers to be able to visualize, and then evaluate possible displays based on their performance in making such comparisons clear.

U.K. news article congratulates YouGov on using modern methods in polling inference

Mike Betancourt pointed me to this news article by Alan Travis that is refreshingly positive regarding the use of sophisticated statistical methods in analyzing opinion polls.

Here’s Travis:

Leading pollsters have described YouGov’s “shock poll” predicting a hung parliament on 8 June as “brave” and the decision by the Times to splash it on its front page as “even braver”.

It is certainly rare for a polling company to produce a seats prediction. They usually leave that to psephologists and political scientists.

Good catch: YouGov’s chief scientist is Doug Rivers, who’s a political scientist. And the new method they’re using is multilevel regression and poststratification (MRP or Mister P), which was developed by . . . me! And I’m a political scientist too.

Travis continues:

But it is even more unusual for a company to suddenly employ a new polling model 10 days before a British general election.

Good point! That really would be an unusual practice. Fortunately, So, sure, MRP is not so new—our first paper on the method (“Poststratification into many categories using hierarchical logistic regression,” with Tom Little) was published 20 years ago.

Travis then supplies some details of YouGov’s forecast:

The Tories are in line to lose 20 seats, giving them 310, and Labour is set to gain 30, giving it 257 . . . But as Stephan Shakespeare, YouGov’s chief executive, notes in an accompanying analysis, that is only a central projection that “allows for a wide range of error” and he concedes: “However, these are just the midpoints and, as with any model, there is leeway either side. The Tories could end up with as many as 345 and Labour as few as 223, leading to an increased Conservative majority.” . . . The Times says the projection means the Tories “could get as many as 345 seats on a good night . . . and as few as 274 on a bad night”. That is a pretty wide range.

It’s good to see Travis explaining that realistic forecasts have wide ranges of uncertainty.

He continues with some details:

The methodology involved is described as “multi-level regression and post-stratification” analysis and is based on a substitute for traditional constituency polling, which it regards as “prohibitively expensive”. Shakespeare claimed YouGov tested it during last year’s EU referendum campaign and it produced leave leads every time. What a shame YouGov did not feel like sharing it with voters while their own published referendum polls showed a remain lead right up to polling day.

Travis missed a beat on this one! YouGov did share their MRP estimates during the EU referendum campaign. Ummm . . . let me google that . . . here it is: “YouGov uses Mister P for Brexit poll”: it’s an article by Ben Lauderdale and Doug Rivers which includes this graph showing Leave in the lead:


OK, it’s good we got that settled.

Travis concludes:

In an industry already suffering from an existential crisis of public confidence it is indeed a “brave” decision to come up with this one 10 days before a general election and promise to publish its results several times more before polling day.

“Brave” is perhaps overstating things, but, yes, I agree with Travis that presenting MRP estimates is the honorable way to go. Using the best available methods based on the best available knowledge, and giving appropriately wide uncertainty intervals: that’s how to do it.

It’s indeed a shame that Travis was not aware that YouGov had released a report with MRP in the lead-up to the Brexit vote. Also Travis, coming from the U.K., might not have realized that MRP was used in U.S. polls too. For example there was this Florida poll from September, 2016, which was analyzed by four different groups:

Charles Franklin, who estimated Clinton +3
Patrick Ruffini, who estimated Clinton +1
New York Times, who estimated Clinton +1
Sam Corbett-Davies, David Rothschild, and me, who estimated Trump +1. We used MRP.

That said, I don’t have any crystal ball. Shortly before the election I was promoting a forecast that gave Clinton a 92% chance of winning the election. That forecast didn’t use MRP; it aggregated information from state polls without appropriately accounting for expected levels of systematic error. I should’ve used MRP myself—I would’ve, but I wasn’t working with the raw data. In this upcoming U.K. election, YouGov does have the raw polling data so of course they are using MRP; it would at this point be a strange choice not to.

Anyway, it’s cool that Alan Travis conveyed some subtleties of polling to the British newspaper audience, even if he did miss one or two things.

Full disclosure: YouGov gives some financial support to the Stan project.

P.S. Thanks to Steven Johnson for the photo at the top of this post.

P.P.S. More on yougov’s use of MRP here.

Another serious error in my published work!

Uh oh, I’m starting to feel like that pizzagate guy . . .

Here’s the background. When I talk about my serious published errors, I talk about my false theorem, I talk about my empirical analysis that was invalidated by miscoded data, I talk my election maps whose flaws were pointed out by an angry blogger, and I talk about my near-miss regarding Portugal.

OK, fine. But then I was going through old blog posts and I’d found a published error of mine that I’d completely forgotten about! It was a statement in one of my most influential papers—just a small part of the paper, it didn’t change our main results, but we really were wrong, and arguably our mistake misled people. I’m glad that later researchers were suspicious of our statement, checked it, and pointed out how we’d been wrong.

More graphs of mortality trends

Corinne Riddell writes:

In late March you released a series of plots visualizing mortality rates over time by race and gender. For almost a year now, we’ve been working on a similar project and have compiled all of our findings into an R shiny web app here, with a preprint of our first manuscript here. I’m also attaching our preprint’s web appendix as it fails to load on bioRxiv at the moment. The web app takes about 30 seconds to load, but if you can stand the wait, you’ll be able to interact with the app to investigate changes to life expectancy over time by state, race, and cause of death.
In the spirit of eliciting peer review, I would be happy to have you and your readers have a look at the app and/or the preprint.

In the spirit of providing feedback, I started clicking through and first came across this graph:

Uh oh! Something went terribly wrong here. I reloaded the page and got this:

I like certain aspects of the visual layout but to my eye there’s just too much going on with the colors, the dotted lines, the multiple columns, all happening at once. Also I don’t like how the top and bottom graphs blur together: visually it doesn’t work at all for me. I think they could have a much cleaner display here.

The next one I also think is just way too tricky:

I find myself doing lots of mental arithmetic trying to add and subtract and compare these different numbers.

In any case, I think Riddell and her colleagues are doing it exactly right: they’re putting their graphs out there and getting comments so they can do better. I certainly don’t think my own graphs are perfect and I’m glad to see various people taking their cracks at making these data accessible in different ways.