Skip to content

Pro Publica Surgeon Scorecard Update

Adan Becerra writes:

In light of your previous discussions on the ProPublica surgeon scorecard, I was hoping to hear your thoughts about this article recently published in Annals of Surgery titled, “Evaluation of the ProPublica Surgeon Scorecard ‘Adjusted Complication Rate’ Measure Specifications.”​

The article is by K. Ban, M. Cohen, C. Ko, M. Friedberg, J. Stulberg, L. Zhou, B. Hall, D. Hoyt, and K. Bilimoria and begins:

The ProPublica Surgeon Scorecard is the first nationwide, multispecialty public reporting of individual surgeon outcomes. However, ProPublica’s use of a previously undescribed outcome measure (composite of in-hospital mortality or 30-day related readmission) and inclusion of only inpatients have been questioned. Our objectives were to (1) determine the proportion of cases excluded by ProPublica’s specifications, (2) assess the proportion of inpatient complications excluded from ProPublica’s measure, and (3) examine the validity of ProPublica’s outcome measure by comparing performance on the measure to well-established postoperative outcome measures.

They find:

ProPublica’s inclusion criteria resulted in elimination of 82% of all operations from assessment (range: 42% for total knee arthroplasty to 96% for laparoscopic cholecystectomy). For all ProPublica operations combined, 84% of complications occur during inpatient hospitalization (range: 61% for TURP to 88% for total hip arthroplasty), and are thus missed by the ProPublica measure. Hospital-level performance on the ProPublica measure correlated weakly with established complication measures, but correlated strongly with readmission.

And they conclude:

ProPublica’s outcome measure specifications exclude 82% of cases, miss 84% of postoperative complications, and correlate poorly with well-established postoperative outcomes. Thus, the validity of the ProPublica Surgeon Scorecard is questionable.

When this came up before, I wrote, “The more important criticisms involved data quality, and that’s something I can’t really comment on, at least without reading the report in more detail.”

And that’s still the case. I still haven’t put in any effort to follow this story. So I’ll repeat what I wrote before:

You fit a model, do the best you can, be open about your methods, then invite criticism. You can then take account of the criticisms, include more information, and do better.

So go for it, Pro Publica. Don’t stop now! Consider your published estimates as a first step in a process of continual quality improvement.

At this point, I’d like Pro Publica not to try to refute these published data-quality criticisms (unless they truly are off the mark) but rather to thank the critics and take this as an opportunity to do better. Let this be the next step in an ongoing process.



I’ve spent a lot of time mocking Mark Hauser on this blog, and I still find it annoying that, according to the accounts I’ve seen, he behaved unethically toward his graduate students and lab assistants, he never apologized for manipulating data, and, perhaps most unconscionably, he wasted the lives of who knows how many monkeys in discredited experiments.

But, fine, nobody’s perfect. On the plus side—and I’m serious about this—it seems that Hauser was not kidding when, upon getting kicked out of his position as a Harvard professor, he said he’d be working with at-risk kids.

Here’s the website for his organization, Risk Eraser. I’m in no position to evaluate this work, and I’d personally think twice before hiring someone with a track record of faking data, but I looked at the people listed on his advisory board and they include Steven Pinker and Susan Carey, two professors in the Harvard Psychology Department. This is notable because Hauser dragged that department into the muck, and Pinker and Carey, as much as almost anybody, would have reason to be angry with him. But they are endorsing his new project. This suggest that these two well-respected researchers, who know Hauser much better than I do (that is, more than zero!) have some trust in him, they think his project is real and that it’s doing some good.

That’s wonderful news. Hauser is clearly a man of many talents, even if experimental psychology is not his strong point, and, no joke, no sarcasm at all, I’m so happy to see that he is using these talents productively. He’s off the academic hamster wheel and doing something real with his life. An encouraging example for Michael Lacour, Diederik Stapel, Jonah Lehrer, Team Power Pose, Team Himmicanes, and all the rest. (I think Weggy and Dr. Anil Potti are beyond recovery, though.)

We all make mistakes in various aspects of life, and let’s hope we can all turn things around in the manner of Marc Hauser, to forget about trying to be a “winner” and to just move forward and help others. I imagine Hauser is feeling good about this life shift as well, taking the effort he was spending trying to be a public intellectual and to stay ahead of the curve in science, and just trying to help others.

P.S. You might feel that I’m being too hard on Hauser here: even while celebrating his redemption I can’t resist getting in some digs on his data manipulation. But that’s kinda the point: even if Hauser has not fully reformed his ways, maybe especially if that is the case, his redemption is an inspiring story. In reporting this I’m deliberately not oversimplifying; I’m not trying to claim that Hauser has resolved all his issues. He still, to my knowledge, has shown neither contrition nor even understanding of why it’s scientifically and professionally wrong to lie about and hide your data. But he’s still moved on to something useful (at least, I’ll take the implicit word of Carey and Pinker on this one, I don’t know any of these people personally). And that’s inspiring. It’s good to know that redemption can be achieved even without a full accounting for one’s earlier actions.

P.P.S. Some commenters share snippets from Hauser’s Risk Eraser program and some of it sounds kinda hype-y. So perhaps I was too charitable in my quick assessment above.

Hauser may be “trying to help others” (as I wrote above), but that doesn’t mean his methods will actually help.

For example, they offer “the iWill game kit, a package of work-out routines that can help boost your students’ willpower through training.” And then they give the following reference:

Baumeister, R.F. (2012). Self-control: the moral muscle. The Psychologist.

We’ve heard of this guy before; he’s in the can’t replicate and won’t admit it club. Also this, which looks like another PPNAS-style mind-over-matter bit of p-hacking:

Job, V., Walton, G.M., Bernecker, K. & Dweck, C.S. (2013). Beliefs about willpower determine the impact of glucose on self-control. Proceedings of the National Academy of Sciences.

OK, so what’s going on here? I was giving Hauser the benefit of the doubt because of the endorsements from Pinker and Carey, who are some serious no-bullshit types.

But maybe that inference was a mistake on my part. One possibility is that Pinker and Carey fell for the Ted-talking Baumeuster PPNAS hype, just like so many others in their profession. Another possibility is that they feel sorry for Hauser and agreed to be part of his program as a way to help out their old friend. This one seems hard to imagine—given what Hauser did to the reputation of their department, I’d think they’d be furious at the guy, but who knows?

Too bad. I really liked the story of redemption, but this Risk Eraser thing seems not so different from what Hauser was doing before: shaky science and fast talking, with the main difference being that this time he’s only involved in the “fast talking” part. Baumeister indeed. Why not go whole hog and start offering power pose training and ESP?

An auto-mechanic-style sign for data sharing

Yesterday’s story reminds me of that sign you used to see at the car repair shop:


Maybe we need something similar for data access rules:


If you want to write a press release for us        $   50.00
If you want to write a new paper using our data    $   90.00
If you might be questioning our results            $  450.00*
If you're calling from Retraction Watch            $30000.00

* Default rate unless you can convince us otherwise

Whaddya think? Anyone interested in adding a couple more jokes and formatting it poster style with some cute graphic?

Sharing data: Here’s how you do it, and here’s how you don’t

I received the following email today:

Professor Gelman,

My name is **, I am a senior at the University of ** studying **, and recently came across your paper, “What is the Probability That Your Vote Will Make a Difference?” in my Public Choice class. I am wondering if you are able to send me the actual probabilities that you calculated for all of the states, as some are mentioned in the paper, but I can’t find the actual data anywhere online.

The reason I ask is that I am trying to do some analysis on rational voter absenteeism. Specifically I want to see if there is any correlation between the probability that someone’s vote will make a difference (From your paper) and the voter turnout in each state in the 2008 election.


Hmmm, where are the data? I went to the page of my published papers, searched on “What is the probability” and found the file, which was called probdecisive2.pdf, then searched on my computer for that file name, found the directory, came across two seemingly relevant files, electionnight.R and nate.R, and send this student a quick email with those two R files and all the data files that were referenced there. No big deal, it took about 5 minutes.

And then I was reminded of this item that Malte Elson pointed me to the other day, a GoFundMe website that begins:

My name is Chris Ferguson, I am a psychology professor at Stetson University in DeLand, FL. In my research, I’m studying how media affect children and young adults.

Earlier this year, another researcher from Brigham Young University published a three-year longitudinal study between viewing relational aggression on TV and aggressive behavior in the journal Developmental Psychology. Longitudinal studies are rare in my field, so I was very excited to see this study, and eager to take a look at the data myself to check up on some of the analyses reported by the authors.

So I spoke with the Flourishing Families project staff who manage the dataset from which the study was published and which was authored by one of their scholars. They agreed to send the data file, but require I cover the expenses for the data file preparation ($300/hour, $450 in total; you can see the invoice here). Because I consider data sharing a courtesy among researchers, I contacted BYU’s Office of Research and Creative Activities and they confirmed that charging a fee for a scholarly data request is consistent with their policy.

Given I have no outside funding, I might not be able to afford the dataset of Dr. [Sarah] Coyne’s study, although it is very important for my own research. Although somewhat unconventional, I am hoping that this fundraising site will help me cover parts of the cost!

The paper in question was published in the journal Developmental Psychology. On the plus side, no public funding seems to have been involved, so I guess I can’t say that these data were collected with your tax dollars. If BYU wants to charge $300/hr for a service that I provide for free, they can go for it.

Here’s the invoice:


In future perhaps journals will require all data to be posted as a condition of publication and then this sort of thing won’t happen anymore.

P.S. Related silliness here.

“Evaluating Online Nonprobability Surveys”

Courtney Kennedy, Andrew Mercer, Scott Keeter, Nick Hatley, Kyley McGeeney and Alejandra Gimenez wrote this very reasonable report for Pew Research. Someone should send a copy to Michael W. Link or whoever’s running the buggy-whip show nowadays.

Let’s play Twister, let’s play Risk

Screen Shot 2016-05-03 at 12.05.20 AM

Alex Terenin, Dan Simpson, and David Draper write:

Some months ago we shared with you an arxiv draft of our paper, Asynchronous Distributed Gibbs Sampling.​

Through comments we’ve received, for which we’re highly grateful, we came to understand that

(a) our convergence proof was wrong, and

(b) we actually have two algorithms, one exact and one approximate.

We’ve now remedied (a) and thoroughly explored (b)

​The proof is constructive and interesting—it uses ideas from parallel tempering and from the Banach fixed-point contraction-mapping theorem​.

The approximate algorithm runs much faster (​than the exact version)​ at big-data scale, so we provide a diagnostic to see whether the approximation is good enough for its advantage in speed to be enjoyed ​in the problem ​currently being solved.

Wow, Gibbs sampling, that’s so 90’s. Do people still do this anymore? And parallel tempering, that’s what people used to do before adiabatic Monte Carlo, right?

All jokes aside, here’s the paper, and here’s their story:

The e-commerce company eBay, Inc. is interested in using the Bayesian paradigm for purposes of inference and decision-making, and employs a number of Bayesian models as part of its operations. One of the main practical challenges in using the Bayesian paradigm on a modern industrial scale is that the standard approach to Bayesian computation for the past 25 years – Markov Chain Monte Carlo (MCMC) [22, 33] – does not scale well, either with data set size or with model complexity. This is especially of concern in e-commerce applications, where typical data set sizes range from n = 1, 000, 000 to n = 10, 000, 000, 000.

In this paper we [Terenin, Simpson, and Draper] offer a new algorithm – Asynchronous Gibbs sampling – which removes synchronicity barriers that hamper the efficient implementation of most MCMC methods in a distributed environment. This is a crucial element in improving the behavior of MCMC samplers for complex models that fall within the current “Big Data/Big Models” paradigm: Asynchronous Gibbs is well-suited for models where the dimensionality of the parameter space increases with sample size. We show that Asynchronous Gibbs performs well on both illustrative test cases and a real large-scale Bayesian application.

Just as a minor point, I think they’re kneecapping themselves by running only one chain. Check out this, for example:

Screen Shot 2016-05-03 at 12.11.06 AM

This sort of thing really cries out for multiple chains. No big deal—it would be trivial to implement their algorithm in parallel, it’s just a good idea to do so, otherwise you’re likely to be running your simulations way too long out of paranoia.

No guarantee

From a public relations article by Karen Weintraub:

An anti-aging startup hopes to elude the U.S. Food and Drug Administration and death at the same time. The company, Elysium Health, says it will be turning chemicals that lengthen the lives of mice and worms in the laboratory into over-the-counter vitamin pills that people can take to combat aging. . . . The problem, [startup founder] Guarente says, is that it’s nearly impossible to prove, in any reasonable time frame, that drugs that extend the lifespan of animals can do the same in people; such an experiment could take decades. That’s why Guarente says he decided to take the unconventional route of packaging cutting-edge lab research as so-called nutraceuticals, which don’t require clinical trials or approval by the FDA.

So far so good. But then this:

This means there’s no guarantee that Elysium’s first product, a blue pill called Basis that is going on sale this week, will actually keep you young.

Ummm . . . do you think that if the product did have a clinical trial and was approved by the FDA, that there would be a “guarantee” it would actually keep you young?

P.S. As an MIT graduate, I’m disappointed to see this sort of press release published in Technology Review. Hype hype hype hype hype.

Solving Statistics Problems Using Stan (my talk at the NYC chapter of the American Statistical Association)

Here’s the announcement:

Solving Statistics Problems Using Stan

Stan is a free and open-source probabilistic programming language and Bayesian inference engine. In this talk, we demonstrate the use of Stan for some small fun problems and then discuss some open problems in Stan and in Bayesian computation and Bayesian inference more generally.

It’s next Tues, 20 Sept, 2pm, at the CUNY Graduate Center, 365 Fifth Avenue (at 34th Street), room to be announced.

You can register for it here.

Bayesian Statistics Then and Now

I happened to recently reread this article of mine from 2010, and I absolutely love it. I don’t think it’s been read by many people—it was published as one of three discussions of an article by Brad Efron in Statistical Science—so I wanted to share it with you again here.

This is the article where I introduce three meta-prinicples of statistics:

The information principle: the key to a good statistical method is not its underlying philosophy or mathematical reasoning, but rather what information the method allows us to use. Good methods make use of more information.

The methodological attribution problem: the many useful contributions of a good statistical consultant, or collaborator, will often be attributed to the statistician’s methods or philosophy rather than to the artful efforts of the statistician himself or herself. I give as examples Rubin, Efron, and Pearl.

Different applications demand different philosophies, demonstrated with a discussion of the assumptions underlying the so-called so-called false discovery rate.

Also this:

I also think that Efron is doing parametric Bayesian inference a disservice by focusing on a fun little baseball example that he and Morris worked on 35 years ago. If he would look at what is being done now, he would see all the good statistical practice that he naively (I think) attributes to “frequentism.”

As I get older, I too rely on increasingly outdated examples. So I can see how this can happen but we should be aware of it.

Also this:

I also completely disagree with Efron’s claim that frequentism (whatever that is) is “fundamentally conservative.” One thing that “frequentism” absolutely encourages is for people to use horrible, noisy estimates out of a fear of “bias.”

That’s a point I’ve been banging on a lot recently, the idea that people make all kinds of sacrifices and put their data into all sorts of contortions in order to feel that they are using a rigorous, unbiased procedure.


as discussed by Gelman and Jakulin (2007) [here], Bayesian inference is conservative in that it goes with what is already known, unless the new data force a change. In contrast, unbiased estimates and other unregularized classical procedures are noisy and get jerked around by whatever data happen to come by—not really a conservative thing at all.

Back in 2010 I didn’t know about the garden of forking paths, but since then I’ve seen, over and over, how classical inference based on p-values and confidence intervals (or, equivalently, noninformative priors) leads to all sorts of jumping to conclusions. Power pose, embodied cognition, himmicanes: these are all poster boys for the anticonservatism of classical statistics and the potential of Bayesian inference to bring us to some sort of sanity.

And this comment on “frequentism” more generally:

Of course, frequentism is a big tent and can be interpreted to include all sorts of estimates, up to and including whatever Bayesian thing I happen to be doing this week—to make any estimate “frequentist,” one just needs to do whatever combination of theory and simulation is necessary to get a sense of my method’s performance under repeated sampling. So maybe Efron and I are in agreement in practice, that any method is worth considering if it works, but it might take some work to see if something really does indeed work.

“It might take some work to see if something really does indeed work”: I like that. Clever and also true, and also it leaves an opening for statistical theory and, for that matter, frequentism: we want to see if something really does indeed work. Or, to put it more precisely, to understand the domains in which a method will work, and to also understand where and how it will fail.

Lots more good stuff in this 4-page article. As the saying goes, read the whole thing.

Stan wins again!

See here.

Genius is not enough: The sad story of Peter Hagelstein, living monument to the sunk-cost fallacy


I sometimes pick up various old collections that will be suitable for bathroom reading, and so it was that the other day I was sitting on the throne reading the summer 1985 issue of Granta, entitled Science.

Lots of great stuff here, including Oliver Sacks on Tourette’s syndrome, Thomas McMahan on Alexander Graham Bell, and articles by Darryl Pickney, Italo Calvino, Stephen Jay Gould, and several other celebrated authors. Reading it gave me a pleasant sense of dislocation because the pieces were just slapped down, juxtaposed with captionless photos, requiring me to situate myself anew when reading each piece.

Anyway, I next turned to an article by William Broad, the long-time New York Times science writer, on The Scientists of Star Wars. He wrote about Edward Teller and Lowell Wood, of course, and then . . . Peter Hagelstein.

Where had I heard that name before?

Back in 1989, when that “cold fusion” thing hit all the newspapers, I had a friend in the physics department who told me about some young theorist at MIT, Peter Hagelstein, who, in a burst of effort, had come up with a physical model for cold fusion. Apparently the guy was gunning for the Nobel Prize. At the time we were not laughing: cold fusion was a big deal. Sure there was some skepticism, and it was considered to be unproven, but we didn’t think it was bogus either.

Interesting that this one physicist was in on two of the biggest scientific debacles of the late twentieth century. Broad’s portrayal of Hagelstein in that 1985 article seemed consistent with what my friend had heard. Brilliant guy, willing and able to work hard, lots of ambition, chose the wrong stars to hitch his wagon to.

A cautionary tale for all of us. It doesn’t matter how smart you are: if you work on the wrong project you won’t get anywhere. None of use has perfect scientific taste, of course: Isaac Newton worked on alchemy, and so forth. Hagelstein just had the bad luck to devote his career to two such projects.

Or, I should say, bad luck to enter these projects, bad judgment to stick with them. He’s a living monument to the sunk-cost fallacy.
Continue reading ‘Genius is not enough: The sad story of Peter Hagelstein, living monument to the sunk-cost fallacy’ »

Hypothesis Testing is a Bad Idea (my talk at Warwick, England, 2pm Thurs 15 Sept)

This is the conference, and here’s my talk (will do Google hangout, just as with my recent talks in Bern, Strasbourg, etc):

Hypothesis Testing is a Bad Idea

Through a series of examples, we consider problems with classical hypothesis testing, whether performed using classical p-values or confidence intervals, Bayes factors, or Bayesian inference using noninformative priors. We locate the problem not in the use of any particular statistical method but rather with larger problems of deterministic thinking and a misguided version of Popperianism in which the rejection of a straw-man null hypothesis is taken as confirmation of a preferred alternative. We suggest solutions involving multilevel modeling and informative Bayesian inference.

P.S. More here from 2014.

You may not be interested in peer review, but peer review is interested in you

Here’s an ironic juxtaposition from Tyler Cowen’s blog.

On 28 Apr he discusses a paper with a market system for improving peer review and concludes, “Interesting, but the main problem with the idea is simply that no one cares.”

The day before, his assorted links featured “Frequency of Sex Shapes Automatic, but Not Explicit, Partner Evaluations,” which pointed to a Psychological Science paper that was just full of forking paths, arbitrary data selection and analysis decisions, comparisons of statistically significant to non-significant, characterization of “p less than .1” as “marginally significant” . . . the works. The full Psych Sci monty.

The point is that, as long as generally savvy opinion leaders such as Tyler Cowen trust peer-reviewed publications on sight, we can’t afford not to care about peer review.

Q: “Is A 50-State Poll As Good As 50 State Polls?” A: Use Mister P.

Jeff Lax points to this post from Nate Silver and asks for my thoughts.

In his post, Nate talks about data quality issues of national and state polls. It’s a good discussion, but the one thing he unfortunately doesn’t talk about is multilevel regression and poststratification (or see here for more). What you want to do is fit a multilevel regression to your raw data so as to estimate your outcome of interest (for example, support for Hillary Clinton) in demographic/geographic slices of the population, characterized by age, sex, ethnicity, education, state of residence, maybe some other variables, maybe party identification as well. Then you poststratify using some combination of census and poll data that give you the number of people in each category within each state.

That’s the way to go.

See here for further discussion, particularly on the subject of state-level opinion.

And, hey! We wrote this paper in 2005 on state-level opinion from national surveys.

And here’s the original paper, “Poststratification into many categories using hierarchical logistic regression.” It was published in 1997 in the journal Survey Methodology.

It takes a long long time for research to make its way from academia to journalism. Slowly but surely, though, it will happen.

Stan users group hits 2000 registrations


Of course, there are bound to be duplicate emails, dead emails, and people who picked up Stan, joined the list, and never came back. But still, that’s a lot of people who’ve expressed interest! It’s been an amazing ride that’s only going to get better as we learn more and continue to improve Stan’s speed and functionality.

What’s it good for?

If I do say so myself, our users list is a great source of practical applied data analysis advice both general and Stan-specific; topics range from installation help to advanced statistical computing, with visualization and Bayesian statistical theory and practice in between. I can’t even begin to enumerate how much I’ve learned reading it.

Will I get yelled at?

Stan logoNo, you won’t. We respect all of our users and try to be very friendly and helpful by answering all relevant statistical and computational questions that get posted. So don’t worry that you’re going to ask a “stupid” question and get scolded; I’m gently corrected on our list all the time.

As with all lists, we appreciate users putting in some work ahead of time in trying to formulate a question in a form we can answer (the biggest problem we face is not getting enough clues to answer a question). I often find the process of carefully formulating a query leads me to the answer myself—the list makes an excellent rubber ducky.

It’s a community

We’re happy to see that more users are answering each others’ questions these days, because the developers only have so much time in the day (though we are adding developers at the same rate as users—we’re over 20 now and always happy to take code contributions).

Check it out yourself

The full record is available online and you don’t even need to join to read it. If you’d like to check it out, see:

There’s also the developer’s list, which is where we discuss the software side of Stan internals; that one’s read only unless you request permission to join the development discussion.

Exploration vs. exploitation tradeoff


Alon Levy (link from Palko) looks into “Hyperloop, a loopy intercity rail transit idea proposed by Tesla Motors’ Elon Musk, an entrepreneur who hopes to make a living some day building cars,” and writes:

There is a belief within American media that a successful person can succeed at anything. He (and it’s invariably he) is omnicompetent, and people who question him and laugh at his outlandish ideas will invariably fail and end up working for him. If he cares about something, it’s important; if he says something can be done, it can. The people who are already doing the same thing are peons and their opinions are to be discounted, since they are biased and he never is. . . .

And this:

In the US, people will treat any crank seriously if he has enough money or enough prowess in another field. A sufficiently rich person is surrounded by sycophants and stenographers who won’t check his numbers against anything.

No kidding.

The basic idea is that financial resources, political resources, and respect in the news media are somewhat fungible. $ can buy you political influence; $ and political influence can get you credibility with journalists, and if you are famous and respected you can use this to raise money.

Every time a rich guy like Musk lends his reputation to something like Hyperloop that doesn’t work, he loses a bit of credibility. Fair enough. And, without the billions behind it, journalists wouldn’t treat the project so respectfully. Which is logical reasoning too, as we’d expect just about any idea to me more likely to succeed if it’s backed with big money. Plus the reasonable inference that if this guy has made all this money in business, he must have some sense of what works and what doesn’t.

Each step in the reasoning makes sense. The trouble is that each step is merely probabilistic. Rich people can have bad ideas, certain plans can’t be implemented even with big financial resources, and so on. Even if Steven Wolfram were to devote the entire resources of his company to the problem of trisecting the angle with compass and straightedge, he would not succeed (unless you’re allowed to slide the straightedge along the circle, but that’s cheating).

It’s a tough problem for journalists. Statistical reasoning would suggest giving the benefit of the doubt to business proposals coming from rich people, but this creates a perverse incentives where richies can float all sorts of ridiculous ideas, secure in the knowledge that most outlets will report their plans uncritically. Get enough of these, and on the margin these plans could even be worse than what you’d see from institutions with fewer financial resources.

Game theory eats decision theory once again.

P.S. Levy writes of his “specific problems are that Hyperloop a) made up the cost projections, b) has awful passenger comfort, c) has very little capacity, and d) lies about energy consumption of conventional HSR [high speed rail].” I won’t get into the details on that—you can follow the link and the comment thread there.

Hokey mas, indeed


Paul Alper writes:

The pictures which often accompany your blog are really “inside baseball” and I frequently fail to see the connection to the accompanying text. For example, when I click on today’s picture, I get:

Screen Shot 2016-04-28 at 5.19.13 PM

This happens to interest me because our granddaughter, who is now two years old and attends a Spanish-speaking daycare, was at our house the other week and after dinner we all did the hokey pokey. When the adults finished, she decided that she wasn’t, so she said, “Hokey mas.” But what is the link to statistical graphics?

A few days ago on your blog you had a picture of Alan Reed, aka Teddy Bergman:

Screen Shot 2016-04-28 at 5.20.19 PM

What is the (obvious?) connection to neanderthal genes, yours in particular or generally speaking?

My reply:

1. Today’s post, like the Hokey Pokey, concluded with the line, “that’s what it’s all about.”

2. Alan Reed was the voice of Fred Flintstone who was somewhat Neanderthal, no?

Several postdoc positions in probabilistic modeling and machine learning in Aalto, Helsinki

This post is by Aki

In addition to the postdoc position I advertised recently, now Aalto University and University of Helsinki have 20 more open postdoc and research fellow positions. Many of the positions are in probabilistic models and machine learning. You could work with me (I’m also part of HIIT), but I can also recommend all the other professors listed in the call.

For the full list of topics, call text and information about the application process, see here. You can also ask me for more information.

It’s not about normality, it’s all about reality

Screen Shot 2016-04-27 at 10.03.10 AM

This is just a repost, with a snazzy and appropriate title, of our discussion from a few years ago on the assumptions of linear regression, from section 3.6 of my book with Jennifer.

In decreasing order of importance, these assumptions are:

1. Validity. Most importantly, the data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient. . . .

2. Additivity and linearity. The most important mathematical assumption of the regression model is that its deterministic component is a linear function of the separate predictors . . .

3. Independence of errors. . . .

4. Equal variance of errors. . . .

5. Normality of errors. . . .

Further assumptions are necessary if a regression coefficient is to be given a causal interpretation . . .

Normality and equal variance are typically minor concerns, unless you’re using the model to make predictions for individual data points.

Polling in the 21st century: There ain’t no urn


David Rothschild writes:

The Washington Post (WaPo) utilized Survey Monkey (SM) to survey 74,886 registered voters in all 50 states on who they would vote for in the upcoming election. I am very excited about the work, because I am a huge proponent of advancing polling methodology, but the methodological explanation and data detail bring up some serious questions.

The WaPo explanation conflates method and mode: the mode was online (versus telephone) and the method was a random sample of SM users with raked demographics (versus probability-based sampling with raked demographics). So, the first key difference is that it is online versus telephone. The second key difference is that the sample is random to users of SM versus any telephone users. The third possible difference is not employed: despite different modes and selection criteria, the WaPo/SM team used traditional analytics. This poll is more like traditional polls than the WaPo admits, in both its strengths and weaknesses.

Both online and telephone have limitations in who can be reached; but, they have very similar coverage. As of September 2015 89% of the US adults were online in some context. Between cell phones and landlines, most people are on telephones as well. About 90% have cellphones and actually about half are cell phone only. But, that is actually the problem, a confusing number of people have both cell phones and landlines, and many of the cell phones owners no longer live near the area code of their phone. So, while telephones may be able to reach slightly more American adults, that advantage is rapidly diminishing, and US adults without any internet access, are very unlikely voters.

The bigger limitation is that the survey only reaches SM users, rather than any possible online users. I do not know the limitations of SM’s reach, but as the WaPo article notes that they are drawing from about three million daily active users. Over the course of 24-day study, a non-trivial cross section US adults may have interacted with SM, but while I have no way of knowing how it is biased versus the general population of online users, I assume it is relatively good at covering a cross-section of genders, ages, races, incomes, education levels, and geography.

So, while the WaPo is right in saying this sample is non-probability, in that we do not know the probability of any voter answering the survey, so is the traditional phone method. We do not know the probability of non-telephone users being excluded from being called, especially with shifting cell-phone and landline coverage. On a personal note, I do not get called by traditional polls because my cell phone area code is at where my parents lived when I got my first cell phone 14 years ago. And, we do not know all of the dimensions which drive the nonresponse of people called (somewhere between 1% and 10% of people answer the phone). In short, both methods are non-probability.

What is disappointing to me is that the WaPo/SM team then employed an analytical method that is optimized for probability-based telephone surveys: raking. Raking means matching the marginal demographics of the respondents to the Census on: age, race, sex, education, and region. With 74,886 respondents, and a goal to provide state-level results, the team should use modelling and post-stratification. MRP employees all of respondents to create an estimate for any sub-group. It draws on the idea that white men from Kansas can help estimate how white men from Arkansas may vote or white people in general from New York. It is a powerful tool for non-probability surveys (regardless of the mode or method).

The team did break from tradition and weighed on party identification in: Colorado, Florida, Georgia, Ohio, and Texas. Partisan nonresponse is a big problem and party identification should be used to both stabilize the ups and downs of any given poll and create a more accurate forecast of actual voting. But, it should never be employed selectively within a single survey!

Finally, the WaPo/SM team notes that “The Post-SurveyMonkey poll employed a “non-probability” sample of respondents. While standard Washington Post surveys draw random samples of cellular and landline users to ensure every voter has a chance of being selected, the probability of any given voter being invited to a SurveyMonkey is unknown, and those who do not use the platform do not have a chance of being selected. A margin of sampling error is not calculated for SurveyMonkey results, since this is a statistical property only applicable to randomly sampled surveys.” As noted above the true probability is never known for both the new SM survey and any of the WaPo’s traditional polls. Empirically, the true margin of error is about twice as large as the stated margin of error, for traditional polls, because they ignore: coverage, nonresponse, measurement, and specification error.

I just want to add three things.

1. I appreciate the details given by Washington Post polling manager Scott Clement in his news article, “How The Washington Post-SurveyMonkey 50-state poll was conducted,” and I know that David Rothschild appreciates it too. Transparent discussion is great. David and I disagree with Clement on some things, and the way we can all move forward is to address this, which is facilitated by Clement’s step of posting that article.

2. Clement talks about “why The Post chose to use a non-probability sample . . .” But with nonresponse rates at 90%, every poll is a non-probability sample. There ain’t no urn.

3. Clement also writes, “A margin of sampling error is not calculated for SurveyMonkey results, since this is a statistical property only applicable to randomly sampled surveys.” I disagree. Again, no modern survey is even close to a random sample. Sampling error calculations are always based on assumptions which are actually false. That doesn’t mean it’s a bad idea to do such a calculation. Better still would be to give a margin of error that includes non-sampling error, for reasons discussed in the last paragraph of David’s note above. But, again, this is the case for random-digit-dial telephone surveys as well.

And one more thing: I agree with David that (a) they should be adjusting for party ID in all the states, not just 4 of them, and (b) if they want separate state estimates, they should use MRP.

Overall it looks like the analysis plan wasn’t fully thought through. Too bad they didn’t ask us for advice!