Skip to content

The Psychological Science stereotype paradox

teacups

Lee Jussim, Jarret Crawford, and Rachel Rubinstein just published a paper in Psychological Science that begins,

Are stereotypes accurate or inaccurate? We summarize evidence that stereotype accuracy is one of the largest and most replicable findings in social psychology. We address controversies in this literature, including the long-standing and continuing but unjustified emphasis on stereotype inaccuracy . . .

I haven’t read the paper in detail but I imagine that a claim that stereotypes are accurate will depend strongly on the definition of “accuracy.”

But what I really want to talk about is this paradox:

My stereotype about a Psychological Science article is that it is an exercise in noise mining, followed by hype. But this Psychological Science paper says that stereotypes are accurate. So if the article is true, then my stereotype is accurate, and the article is just hype, in which case stereotypes are not accurate, in which case the paper might actually be correct, in which case stereotypes might actually be accurate . . . now I’m getting dizzy!

Webinar: Introduction to Bayesian Data Analysis and Stan

This post is by Eric.

We are starting a series of free webinars about Stan, Bayesian inference, decision theory, and model building. The first webinar will be held on Tuesday, October 25 at 11:00 AM EDT. You can register here.

Stan is a free and open-source probabilistic programming language and Bayesian inference engine. In this talk, we will demonstrate the use of Stan for some small problems in sports ranking, nonlinear regression, mixture modeling, and decision analysis, to illustrate the general idea that Bayesian data analysis involves model building, model fitting, and model checking. One of our major motivations in building Stan is to efficiently fit complex models to data, and Stan has indeed been used for this purpose in social, biological, and physical sciences, engineering, and business. The purpose of the present webinar is to demonstrate using simple examples how one can directly specify and fit models in Stan and make logical decisions under uncertainty.

Advice on setting up audio for your podcast

Jennifer and I were getting ready to do our podcast, and in preparation we got some advice from Enrico Bertini and the Data Stories team:

1) Multitracking. The best way is to multitrack and have each person record locally (note: this is easier if you are in different rooms/locations). Multitracking gives you a lot of freedom in the postediting phase. You can fix when voice overlaps, remove various noises and utterances, adjust volume levels etc. If you are in the same room you can still multitrack but it’s more complex.

2) Microphone. Having good (even high-end) mics makes a huge difference. When you hear the difference between a good mic and your average iPhone earbuds it’s stunning! With good mics you sound like a pro, without you sound meh … Here you have many many options.
You can use a usb mic made for podcasting and plug it in your computer (Rode podcaster is great: http://www.rode.com/microphones/podcaster. I have Yeti also but it’s not as good).
You can buy a standalone recorder (we have a Zoom and we love it: https://www.zoom-na.com/products/field-video-recording/field-recording/zoom-h2n-handy-recorder).
You can buy high-end condenser mics and plug them in a mixer.

3) Recording device. Recording in your computer is fine. We record most of our sound using our mac and quicktime. Very easy and straightforward. When I use the Zoom I record directly in the mic since it is also a recorder. Recording with an iPhone is not good enough.

4) Remote communication. If you are located remotely and/or have a remote guest, you can (should) keep recording locally but you still have to communicate. We have used Skype or Hangout with mixed results. When there are too many people or someone has a slow network it’s a real pain. We are still struggling with this ourselves. Hangout seems to be a bit more reliable. One good thing with Skype is that you can record within it and make sure you always have a backup. Backups and redundancy are crucial. Things do go wrong sometime in very unexpected ways!

5) Noise. It’s important to reduce noise in your environment. Especially, turn phones down or in airplane mode, avoid interruptions and ambient noise (even birds can be a problem!). Sometime the sound coming from your headphones can also be picked up by your mic so you need to be careful.

6) Synchronization. When you have multiple tracks you have to find a way to sync them. We have a very low-fi trick. We ask our guest to count backward 3, 2, 1 and clap and put our headphones close to our pics (not sure how others have solved this problem).

7) Audio postproduction. There are tons of things that can be done after the recording. We have a fantastic person working for us who is a pro. I don’t know all the details of the filter he uses. But he does cut things down when we are too verbose or make mistakes. This is priceless.

I [Bertini] think the most important thing to know is if you are planning to be in the same room or not and if you are going to have guests. The set up can change considerably according to what kind of combination you have.

Should Jonah Lehrer be a junior Gladwell? Does he have any other options?

Remember Jonah Lehrer—that science writer from a few years back whose reputation was tarnished after some plagiarism and fabrication scandals? He’s been blogging—on science!

And he’s on to some of the usual suspects: Ellen Langer’s mindfulness (see here for the skeptical take) and—hey—“an important new paper [by] Kyla Haimovitz and Carol Dweck” (see here for background).

Also a post, “Can Tetris Help Us Cope With Traumatic Memories?” and another on the recently-questioned “ego-depletion effect.”

And lots more along these lines. It’s This Week in Psychological Science in blog form. Nothing yet on himmicanes or power pose, but just give him time.

Say what you want about Malcolm Gladwell, he at least tries to weigh the evidence and he’s willing to be skeptical. Lehrer goes all-in, every time.

It’s funny: they say you can’t scam a scammer, but it looks like Lehrer’s getting scammed, over and over again. This guy seems to be the perfect patsy for every Psychological Science press release that’s ever existed.

But how could he be so clueless? Perhaps he’s auditioning for the role of press agent: he’d like to write an inspirational “business book” with someone who does one of these experiments, so he’s getting into practice and promoting himself by writing these little posts. He’d rather write them as New York Times magazine articles or whatever but that path is (currently) blocked for him so he’s putting himself out there as best he can. From this perspective, Lehrer has every incentive to believe all these Psychological Science-style claims. It’s not that he’s made the affirmative decision to believe, it’s more that he gently turns off the critical part of his brain, the same way that a sports reporter might only focus on the good things about the home team.

Lehrer’s in a tough spot, though, as he doesn’t that much going for him. He’s a smooth writer, but there are other smooth writers out there. He can talk science, but he can’t get to any depth. He used to have excellent media contacts, but I suspect he’s burned most of those bridges. And there’s a lot of competition out there, lots of great science writers. So he’s in a tough spot. He might have to go out and get a real job at some point.

Some people are so easy to contact and some people aren’t.

I was reading Cowboys Full, James McManus’s entertaining history of poker (but way too much on the so-called World Series of Poker), and I skimmed the index to look up some of my favorite poker writers. Frank Wallace and David Spanier were both there but only got brief mentions in the text, I was disappointed to see. I guess McManus and I have different taste. Fair enough. I also looked up Patrick Marber, author of the wonderful poker-themed play, Dealer’s Choice. Marber was not in the index.

And this brings be to the subject of today’s post. Anyone who wants can reach me by email or even call me on the phone. That’s how it is with college teachers: we’re accessible, that’s part of our job. But authors, not so much. Even authors much more obscure than James McManus typically don’t make themselves easy to contact. Maybe they don’t want to be bothered, maybe it’s just tradition, I dunno. But I think they’re missing out. McManus does seem to have a twitter account, but that doesn’t work for me. I just want to send the guy an email.

People can, of course, duck emails. I tried a couple times to contact Paul Gertler about the effect of the statistical significance filter on his claimed effects of early childhood intervention, and I have it on good authority that he received my email but just chose not to respond, I assume feeling that his life would simpler if he were not to have to worry about that particular statistical bias. And of course famous people have to guard their time, so I usually don’t get responses from the likes of Paul Krugman, Malcolm Gladwell, David Brooks, or Nate Silver. (That last one is particularly ironic given that people are always asking me for Nate’s email. I typically give them the email but warn them that Nate might not respond.)

Anyway, I have no problem at all with famous people not returning my emails—if they responded to all the emails they received from statistics professors, they’d probably have no time for anything else, and they’d be reduced to a Stallman-esque existence.

And, while I disapprove of the likes of Gertler not responding to emails of mine making critical comments on their work, hey, that’s his choice: if he doesn’t want to improve his statistics, there’s nothing much I can do about it.

But it’s too bad it’s not so easy to directly reach people like James McManus, or Thomas Mallon, or George Pelacanos. I think they’d be interested in the stories I would share with them.

P.S. In his book, McManus does go overboard in a few places, including his idealization of Barack Obama (all too consistent with the publication date of 2009) and this bit of sub-Nicholas-Wade theorizing:

Screen Shot 2015-08-15 at 10.42.42 PM

Screen Shot 2015-08-15 at 10.42.53 PM

Aahhhh, so that’s what it was like back in the old days! Good that we have an old-timer like James McManus to remember it for us.

But that’s just a minor issue. Overall, I like the book. All of us are products of our times, so it’s no big deal if a book has a few false notes like this.

Opportunity for publishing preregistered analyses of the 2016 American National Election Study

Brendan Nyhan writes:

Have you heard about the Election Research Preacceptance Competition that Skip Lupia and I are organizing to promote preaccepted articles? Details here: http://www.erpc2016.com. A number of top journals have agreed to consider preaccepted articles that include data from the ANES. Authors who publish qualifying entries can win a $2,000 prize. We’re eager to let people know about the opportunity and to promote better scientific publishing practices

The page in question is titled, “November 8, 2016: what really happened,” which reminded me of this election-night post of mine from eight years ago entitled, “Election 2008: what really happened.”

I could be wrong, but I’m guessing that a post such as mine would not have much of a chance in this competition, which is designed to reward “an article in which the hypotheses and design were registered before the data were publicly available.” The idea is that the proposed analyses would be performed on the 2016 American National Election Study, data from which will be released in Apr 2017. I suppose it would be possible to take a post such as mine and come up with hypotheses that could be tested using ANES data but it wouldn’t be so natural.

So, I think this project of Nyhan and Lupia has some of the strengths and weaknesses of aspects of the replication movement in science.

The strengths are that the competition’s rules are transparent and roughly equally open to all, in a way that, for example, publication in PPNAS does not seem to be. Also, of course, preregistration minimizes researcher degrees of freedom which allows p-values to be more interpretable.

The minuses are the connection to the existing system of journals; the framing as a competition; the restriction to a single relatively small dataset; and a hypothesis-testing framework which (a) points toward confirmation rather than discovery, and (b) would seem to favor narrow inquiries rather than broader investigations. Again, I’m concerned that my own “Election 2008: what really happened” story wouldn’t fit into this framework.

Overall I think this project of Nyhan and Lupia is a good idea and I’m not complaining about it at all. Sure, it’s limited, but it’s only one of may opportunities out there. Researchers who want to test specific hypotheses can enter this competition. Hypothesis testing isn’t my thing, but nothing’s stopping me or others from posting whatever we do on blogs, Arxiv, SSRN, etc. There’s room for lots of approaches, and, at the very least, this effort should encourage some researchers to use ANES more intensively than they otherwise would have.

“Marginally Significant Effects as Evidence for Hypotheses: Changing Attitudes Over Four Decades”

Kevin Lewis sends along this article by Laura Pritschet, Derek Powell, and Zachary Horne, who write:

Some effects are statistically significant. Other effects do not reach the threshold of statistical significance and are sometimes described as “marginally significant” or as “approaching significance.” Although the concept of marginal significance is widely deployed in academic psychology, there has been very little systematic examination of psychologists’ attitudes toward these effects. Here, we report an observational study in which we investigated psychologists’ attitudes concerning marginal significance by examining their language in over 1,500 articles published in top-tier cognitive, developmental, and social psychology journals. We observed a large change over the course of four decades in psychologists’ tendency to describe a p value as marginally significant, and overall rates of use appear to differ across subfields. We discuss possible explanations for these findings, as well as their implications for psychological research.

The common practice of dividing data comparisons into categories based on significance levels is terrible, but it happens all the time (as discussed, for example, in this recent comment thread about a 2016 Psychological Science paper by Haimowitz and Dweck), so it’s worth examining the prevalence of this error, as Pritschet et al. do.

Let me first briefly explain why categorizing based on p-values is is such a bad idea. Consider, for example, this division: “really significant” for p less than .01, “significant” for p less than .05, “marginally significant” for p less than .1, and “not at all significant” otherwise. And consider some typical p-values in these ranges: say, p=.005, p=.03, p=.08, and p=.2. Now translate these two-sided p-values back into z-scores, which we can do in R via 1 – qnorm(c(.005, .03, .08, .2)/2), yielding the z-scores 2.8, 2.2, 1.8, 1.3. The seemingly yawning gap in p-values comparing the “not at all significant” p-value of .2 to the “really significant” p-value of .005, is only 1.5. Indeed, if you had two independent experiments with these z-scores and with equal standard errors and you wanted to compare them, you’d get a difference of 1.5 with a standard error of 1.4—completely consistent with noise. This is the point that Hal Stern and I made in our paper from a few years back.

From a statistical point of view, the trouble with using the p-value as a data summary is that the p-value is only interpretable in the context of the null hypothesis of zero effect—and in psychology studies, nobody’s interested in the null hypothesis. Indeed, once you see comparisons between large, marginal, and small effects, the null hypothesis is irrelevant, as you want to be comparing effect sizes.

From a psychological point of view, the trouble with using the p-value as a data summary is that this is a kind of deterministic thinking, an attempt to convert real uncertainty into firm statements that are just not possible (or, as we would say now, just not replicable).

P.S. Related is this paper from a few years ago, “Erroneous analyses of interactions in neuroscience: a problem of significance,” by Sander Nieuwenhuis, Birte Forstmann, and E. J. Wagenmakers, who wrote:

In theory, a comparison of two experimental effects requires a statistical test on their difference. In practice, this comparison is often based on an incorrect procedure involving two separate tests in which researchers conclude that effects differ when one effect is significant (P < 0.05) but the other is not (P > 0.05). We reviewed 513 behavioral, systems and cognitive neuroscience articles in five top-ranking journals (Science, Nature, Nature Neuroscience, Neuron and The Journal of Neuroscience) and found that 78 used the correct procedure and 79 used the incorrect procedure. An additional analysis suggests that incorrect analyses of interactions are even more common in cellular and molecular neuroscience. We discuss scenarios in which the erroneous procedure is particularly beguiling.

It’s a problem.

P.S. Amusingly enough, just a couple days ago we discussed an abstract that had a “marginal significant” in it.

Is it fair to use Bayesian reasoning to convict someone of a crime?

Ethan Bolker sends along this news article from the Boston Globe:

If it doesn’t acquit, it must fit

Judges and juries are only human, and as such, their brains tend to see patterns, even if the evidence isn’t all there. In a new study, researchers first presented people with pieces of evidence (a confession, an eyewitness identification, an alibi, a motive) in separate contexts. Then, similar pieces of evidence were presented together in the context of a single criminal case. Although judgments of the probative value of each piece of evidence were uncorrelated when considered separately, their probative value became significantly correlated when considered together. In other words, perceiving one piece of evidence as confirming guilt caused other pieces of evidence to become more confirming of guilt too. For example, among people who ended up reaching a guilty verdict, the same kind of confession was considered more voluntary when considered alongside other evidence than when it had been considered in isolation.

Greenspan, R. & Scurich, The Interdependence of Perceived Confession Voluntariness and Case Evidence, N. Law and Human Behavior (forthcoming).

Bolker writes:

The tone suggests that this observation—“perceiving one piece of evidence as confirming guilt caused other pieces of evidence to become more confirming of guilt too”—reflects an inability to weigh evidence, but to me it makes Bayesian sense: each piece influences the priors for the others.

I agree. It seems like a judicial example of a familiar tension from statistical analysis: when do we want to simply be summarizing the data at hand, and when do we want to “collapse the wave function,” as it were, and perform inference for underlying parameters.

Tenure Track Professor in Machine Learning, Aalto University, Finland

Posted by Aki.

I promise that next time I’ll post something else than a job advertisement, but before that here’s another great opportunity to join Aalto Univeristy where I work, too.

“We are looking for a professor to either further strengthen our strong research fields, with keywords including statistical machine learning, probabilistic modelling, Bayesian inference, kernel methods, computational statistics, or complementing them with deep learning. Collaboration with other fields is welcome, with local opportunities both at Aalto and University of Helsinki. A joint appointment with the Helsinki Institute for Information Technology HIIT a joint research centre with University of Helsinki, can be negotiated.”

Naturally, I would hope we could someone who is also interested in probabilistic modelling and Bayesian inference :)

See more details here

Applying the “If there’s no report you can read, there’s no study” principle in real time

catwash

So, I was on the website of the New York Times and came across this story by Donna de la Cruz:

Opioids May Interfere With Parenting Instincts, Study Finds . . .

Researchers at the Perelman School of Medicine at the University of Pennsylvania scanned the brains of 47 men and women before and after they underwent treatment for opioid dependence. While in the scanner, the study subjects looked at various images of babies, and the researchers measured the brain’s response. . . . Sometimes the babies’ features were exaggerated to make them even more adorable; in others, the chubby cheeks and big eyes were reduced, making the faces less appealing. . . .

Compared with the brains of healthy people, the brains of people with opioid dependence didn’t produce strong responses to the cute baby pictures. But once the opioid-dependent people received a drug called naltrexone, which blocks the effects of opioids, their brains produced a more normal response. . . .

Interesting, and full credit to the Times for inserting the qualifier “may” into the headline.

Anyway, the article continues:

The study, among the first to look at the effects of opioid dependence and how its treatment affects social cognition, was presented last month at the European College of Neuropsychopharmacology Congress in Vienna.

OK, where’s the research paper? Where are the data? The Times article provides a link which leads to a short abstract but no paper. Here’s the key paragraph from the abstract:

Forty-seven opioid dependent patients and 25 controls underwent two functional magnetic resonance imaging sessions, approximately ten days apart, while viewing infant portraits with high and low baby schema content and rating them for cuteness. Immediately after the first session, patients received an injection of extended-release naltrexone, while controls received no treatment. The repeated-measures ANOVA revealed a marginal significant main effect of Group [F(1,42) = 3.90, p = 0.06] with greater ΔRating (i.e. RatingHigh – RatingLow) in the patient group, but no main effects of Session [F(1,42) = 0.00, p = 0.99] or Gender [F(1,42) = 2.00, p = 0.17]. Among patients, self-reported craving was significantly reduced [F(1,24) = 45.76, p < 0.001] after the injection (i.e. On-XRNTX), but there was no gender difference [F(1,24) = 3.15, p = 0.09]. Whole brain analysis of variance showed Gender by Group by Session interaction in the ventral striatum. Brain responses increased in female patients and decreased in male patients across sessions, while the pattern was reversed in the controls. We found that the behavioral effect of baby schema, indexed by “cuteness” ratings, was present across all participant categories, without significant differences between patients and controls, genders or sessions. The pattern of the brain response to baby schema includes insulae, inferior frontal gyri, MPFC and the parietal cortex, in addition to the ventral striatum, caudate and fusiform gurus reported by Glocker et al. (2009) [4] in healthy nulliparous women.

I can’t quite follow what they did, but of course the phrase “marginal significant” sets off an alarm—not because I’m a “p less than .05” purist but because it makes me think of forking paths. There’s also the issue of assigned cuteness or rated cuteness, interactions with sex and maybe other patient-level characteristics, the difference between significant and non-significant, not to mention options in the outcome measure (what was referred to in the news article as “strong responses” and whatever zillion degrees of freedom were available from the MRI data.

This is not to say that the conclusions of the study are wrong, just that I have no idea. Just as I have no idea about the gay-gene study, which was one of our original inspirations for the principle, “If there’s no report you can read, there’s no study.”

I googled the title of the paper, *Sustained opioid antagonism increases striatal sensitivity to baby schema in opioid dependent women*, but all I could find was that abstract and an unsigned press release from 19 Sept on a website called MedicalXpress.

Again, I have no idea if this study’s claims are correct, if they have good evidence for their claims, or if the study is useful in some way. (These are three different questions!) In any case, the topic is important and I have no problem with the Times writing about research in this area. But . . . if there’s no report you can read, there’s no study. It’s not about whether it’s peer-reviewed. In this case, there’s nothing to review. An abstract and a press release just don’t cut it.

I know nothing about this research area, and the people who did this project may be doing wonderful work. I’m sure that at some point they’ll write a paper that people can read, and at that point, there’s something to report on.

Stan case studies!

screen-shot-2016-10-13-at-11-18-47-am

In the spirit of reproducible research, we (that is, Bob*) set up this beautiful page of Stan case studies.

Check it out.


* Bob here. Michael set the site up, I set this page up, and lots of people have contributed case studies and we’re always looking for more to publish.

Transparency, replications, and publication

Bob Reed responded to my recent Retraction Watch article (where I argued that corrections and retractions are not a feasible solution to the problem of flawed science, because there are so many fatally flawed papers out there and retraction or correction is such a long, drawn-out process) with a post on openness, data transparency, and replication. He writes:

A recent survey article by Duvendack et al. report that, of 333 journals categorized as “economics journals” by Thompson Reuter’s Journal Citation Reports, 27, or a little more than 8 percent, regularly published data and code to accompany empirical research studies. As some of these journals are exclusively theory journals, the effective rate is somewhat higher.

Noteworthy is that many of these journals only recently instituted a policy of publishing data and code. So while one can argue whether the glass is, say, 20 percent full or 80 percent empty, the fact is that the glass used to contain virtually nothing. That is progress.

But making data more “open” does not, by itself, address the problem of scientific unreliability. Researchers have to be motivated to go through these data, examine them carefully, and determine if they are sufficient to support the claims of the original study. Further, they need to have an avenue to publicize their findings in a way that informs the literature. . . .

Without an outlet to publish their findings, researchers will be unmotivated to spend substantial effort re-analysing other researchers’ data. Or to put it differently, the open science/data sharing movement only addresses the supply side of the scientific market. Unless the demand side is addressed, these efforts are unlikely to be successful in providing a solution to the problem of scientific unreliability.

Reed concludes:

The irony is this: The problem has been identified. There is a solution. The pieces are all there. But in the end, the gatekeepers of scientific findings, the journals, need to open up space to allow science to be self-correcting.

Could be. It could also be that the journals become less important. In statistics and political science, I see journals as still being very important in determining academic careers, but not so important any more for the dissemination of knowledge. There’s just so much being published and so many places to look for things. Econ might be different because there are still a few journals that are recognized to be the top. From the other direction, the CS publication system is completely broken because all the conferences are flooded with papers, it’s just endless layers of hype. CS, on the other hand, is doing just fine. At this point the publication process just seems to be going in parallel with the research process, not really doing anything very useful.

Mister P can solve problems with survey weighting

It’s tough being a blogger who’s expected to respond immediately to topics in his area of expertise.

For example, here’s Scott “fraac” Adams posting on 8 Oct 2016, post titled “Why Does This Happen on My Vacation? (The Trump Tapes).” After some careful reflection, Adams wrote, “My prediction of a 98% chance of Trump winning stays the same.” And that was all before the second presidential debate, which “Trump won bigly. This one wasn’t close.” I don’t know what Trump’s chance of winning is now. Maybe 99%. Or 108%.

That’s fine. When Gregg Easterbrook made silly political prognostications, I was annoyed, because he purported to be a legitimate political writer. Adams has never claimed to be anything but an entertainer and, by golly, he continues to perform well in that regard. So, yes, I laugh at Adams, but I don’t see why he’d mind that. He is a humorist.

What interested me about Adams’s post of 8 Oct was not so much his particular opinions—Adams’s judgments on electoral politics are about as well-founded as my takes on cartooning—but rather his apparent attitude that he had a duty to his readers to share his thoughts, right away. The whole thing had a pleasantly retro feeling; it brought me back to the golden age of blogging, back around 2002 and the “warbloggers” who, whatever their qualifications, expressed such feelings of urgency about each political and military issue as it arose.

Anyway, that’s all background, and I thought of it all only because a similar thing happened to me today.

The real post starts here

Regular readers know that I’ve been taking a break from blogging—wow, it’s been over two months now—except for the occasional topical item that just can’t wait. And today something came that just couldn’t wait.

Several people pointed me to this news article by Nate Cohn with the delightful title, “How One 19-Year-Old Illinois Man Is Distorting National Polling Averages”:

There is a 19-year-old black man in Illinois who has no idea of the role he is playing in this election. . . .

He’s a panelist on the U.S.C. Dornsife/Los Angeles Times Daybreak poll, which has emerged as the biggest polling outlier of the presidential campaign. Despite falling behind by double digits in some national surveys, Mr. Trump has generally led in the U.S.C./LAT poll. . . .

Our Trump-supporting friend in Illinois is a surprisingly big part of the reason. In some polls, he’s weighted as much as 30 times more than the average respondent . . . Alone, he has been enough to put Mr. Trump in double digits of support among black voters. . . .

Cohn gives a solid exposition of how this happens: When you do a survey, the sample won’t quite match the population, and survey researchers use adjustments to adjust for known differences between sample and population. In particular, young black men tend to be underrepresented in surveys, compared to the general population, hence the few respondents in this demographic group need to be correspondingly upweighted. If there’s just one guy in the cell, he might have to get a really big weight, and Cohn identifies this as a key problem in the adjustment, that the survey is using weighting cells that are too small, hence they get very noisy adjustments. In this case, the noise manifests itself as big swings in this USC/LAT poll depending on whether or not this one man is in the sample.

There’s also an issue of adjusting for recalled vote in the previous presidential election but I’ll set that aside for now.

Here’s Cohn on the problems with the big survey weights:

In general, the choice in “trimming” weights [or using coarser weighting cells] is between bias and variance in the results of the poll. If you trim the weights [or use coarser weighting cells], your sample will be biased — it might not include enough of the voters who tend to be underrepresented. If you don’t trim the weights, a few heavily weighted respondents could have the power to sway the survey. . . .

By design, the U.S.C./LAT poll is stuck with the respondents it has. If it had a slightly too Republican sample from the start — and it seems it did, regardless of weighting — there was little it could do about it.

This is fine for what it is, conditional on the assumption that survey researchers are required to only use classical weighting methods. But there is no such requirement! We can now use Mister P.

Here’s a recent article in the International Journal of Forecasting describing how we used MRP for the Xbox poll. Here’s a longer article in the American Journal of Political Science with more technical details. Here’s MRP in the New York Times back in 2009! And here’s MRP in a Nate Cohn article last month in the Times.

Mister P is not magic; of course if your survey has too many Clinton supporters or too many Trump supporters, compared to what you’d expect based on their demographics, then you’ll get the wrong answer. No way around that. But MRP will automatically give the appropriate weight to single observations.

Two issues arise. First, there’s setting up the regression model. The usual plan would be logistic regression with predictors for sex*ethnicity and age*education. We don’t usually see sex*ethnicity*age. This one guy in the survey would influence all these coefficients—but, again, it’s just one survey respondent so the influence shouldn’t be large, especially assuming you use some sort of informative prior to avoid the blow-up you’d get if you had zero African-American Trump supporters in your sample. Second, poststratification. There you’ll need some estimate of the demographic composition of the electorate. But you’d need such an estimate to do weighting, too. I assume the survey organization’s already on top of this one.

So, yeah, we pretty much already know how to handle these problems. That said, there’s some research to be done in easing the transition from classical survey weighting to a modern MRP approach. I addressed some of these challenges in my 2007 paper, Struggles with Survey Weighting and Regression Modeling, but I think a clearer roadmap is needed. We’re working on it.

P.S. Someone forwarded me some comments on a listserv, posted by Arie Kapteyn, Director, USC Dornsife Center for Economic and Social Research:

When designing our USC/LAT poll we have strived for maximal transparency so that indeed anyone who has registered to use our data can verify every step we have taken.

The weights we use to make sure our sample is representative of the U.S. population do result in underrepresented individuals in the sample with a higher weight than those who are in overrepresented groups. In general, one has to make a decision whether to trim weights so that the factor for any individual will not exceed a certain value. However, trimming weights comes with a trade-off, in that it may not be possible to adequately balance the overall sample after trimming. In this poll, we made the decision that we would not trim the weights to ensure that our overall sample would be representative of, for example, young people and African Americans. The result is that a few individuals from groups such as those who are less represented in polling samples and thus have higher weighting factors, can shift the subgroup graphs when they participate. However, they contribute to an unbiased (but possibly noisier) estimate of the outcomes for the overall population.

Our confidence intervals (the grey zones) take into account the effect of weights. So if someone with a big weight participates the confidence interval tends to go up. One can see this very clearly in the graph for African Americans. Essentially, whenever the line for Trump improved, the grey band widened substantially. More generally, the grey band implies a confidence interval of some 30 percentage points so we really should not base any firm conclusion on the changes in the graphs. Admittedly, the weight given to this one individual is very large, nevertheless excluding this individual would move the estimate of the popular vote by less than one percent. Admittedly a lot, but not something that fundamentally changes our forecast. And indeed a movement that falls well within the estimated confidence interval.

So the bottom line is: one should not over-interpret movements if confidence bands are wide.

OK, sure, don’t overinterpret movements if confidence bands are wide, but . . . (1) One concern expressed by Cohn was not just movements but also the estimate itself being consistently too high for the Republican candidate, and (2) With MRP, you can do better! No need to take these horrible noisy estimates and just throw up your hands. Using basic principles of statistics you can get better estimates.

It’s not about trimming the weights or not trimming the weights, it’s about getting a better estimate of national public opinion. The weights—or, more generally, the statistical adjustment—is a means to an end. And you have to keep that end in mind. Don’t get fixated on weighting.

P.P.S. Also, I guess I should also clarify this one point: The classical weighting estimate is not actually unbiased. Kapteyn was incorrect in that claim of unbiasedness.

Should you abandon that low-salt diet? (uh oh, it’s the Lancet!)

Bos_gaurus

Russ Lyons sends along this news article by Ian Johnston, who writes:

The prestigious medical journal The Lancet has been attacked for publishing an academic paper that claimed eating too little salt could increase the chance of dying from a heart attack or stroke.

Johnston summarizes the study:

Researchers from the Population Health Research Institute in Canada, studied more than 130,000 people from 49 different countries on six continents and concluded people should consume salt “in moderation”, rather than trying to reduce it in accordance with government guidelines across the world. . . .

The paper compared the health of people who tests showed had consumed low levels of sodium (up to three grams a day), medium amounts (four or five grams) and high levels (seven grams or more).

“Those participants with four to five grams of sodium excretion had the lowest risk” of death or suffering a “major cardiovascular disease event”, the researchers reported.

Among people who had high blood pressure, eating high and low levels of salt “were both associated with increased risk”. And for people without high blood pressure, consuming less than three grams a day was “associated with a significantly increased risk” – 11 per cent – of death or a serious cardiovascular event.

But there are critics:

Professor Francesco Cappuccio, head of the World Health Organization’s Collaborating Centre for Nutrition, attacked both the methods used in the study and the journal for agreeing to publish it.

“It is with disbelief that we should read such bad science published in The Lancet,” he said.

Professor Cappuccio said the article contained “a re-publication of data” used in another paper.

“The flaws that were extensively noted in their previous accounts are maintained and criticisms ignored,” he said.

The measurement of salt intake used by the researchers, he said, was “flawed” because it was done by testing urine samples given in the morning and then “extrapolated to 24-hour excretion” using an “inadequate” equation.

Professor Cappuccio also said the participants were “almost exclusively from clinical trials of sick people that have a very high risk of dying and are taking several medications”.

Now I don’t know what to think. I really don’t. I haven’t looked at the paper or the criticisms. If the study really is so flawed, though, I can’t say I’m surprised that it was published in the Lancet, a journal that’s famous for producing headline-grabbing papers that are later refuted, such as that Iraq survey, or that article claiming that gun laws could reduce firearm deaths by 200% or whatever, or, most notoriously, that paper by Andrew Wakefield [no link needed]. The Lancet may well publish some high-quality work but they do seem to have a weakness for bold claims and publicity.

I suppose the Lancet will publish a letter by Cappuccio or someone else with the criticisms? Perhaps a reader can keep us up to date here.

P.S. I’m sure I eat too much salt for my own good. I have a big jar of pretzels just sitting here in my office!

Gray graphs look pretty

Swupnil made this graph for a research meeting we had today:

9-18

It looks so cool. I think it’s the gray colors.

So here’s my advice to you: If you want to make your graphs look cool, use lots of gray.

My online talk this Friday noon for the Political Methods Colloquium: The Statistical Crisis in Science

Justin Esarey writes:

This Friday, October 14th at noon Eastern time, the International Methods Colloquium will inaugurate its Fall 2016 series of talks with a presentation by Andrew Gelman of Columbia University. Professor Gelman’s presentation is titled “The Statistical Crisis in Science.” The presentation will draw on these two papers:

“Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors”
http://www.stat.columbia.edu/~gelman/research/published/retropower_final.pdf

“Disagreements about the Strength of Evidence”
http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics12.pdf

To tune in to the presentation and participate in the discussion after the talk, visit http://www.methods-colloquium.com/ and click “Watch Now!” on the day of the talk. To register for the talk in advance, click here:

https://riceuniversity.zoom.us/webinar/register/6272124085322b4fdc2040ba88984b7b

The IMC uses Zoom, which is free to use for listeners and works on PCs, Macs, and iOS and Android tablets and phones. You can be a part of the talk from anywhere around the world with access to the Internet. The presentation and Q&A will last for a total of one hour.

A webinar isn’t as much fun as a live talk, but you can feel free to check this one out. According to Justin, it’s open to all and there will be lots of time for questions.

No, I don’t think the Super Bowl is lowering birth weights

Hoaxed_photo_of_the_Loch_Ness_monster

In a news article entitled, “Inequality might start before we’re even born,” Carolyn Johnson reports:

Another study, forthcoming in the Journal of Human Resources, analyzed birth outcomes in counties where the home team goes to the Super Bowl. . . . The researchers found that women in their first trimester whose home team played in the Super Bowl had measurably different birth outcomes than pregnant women whose teams did not go to the championship. There was a small, 4 percent increase in the probability of having a baby with low birth weight when the team won.

Garden. Of. Forking. Paths.

And this:

The magnitude of the change was tiny, but what was striking to Mansour [one of the authors of the study] was that it was detectable at all, in studying Super Bowl history from 1969 to 2004.

On the contrary, I’m not surprised at all. Given that researchers have detected ESP, and the effects of himmicanes, and power pose, and beauty and sex ratio, etc etc etc., I’m not surprised they can detect the effect of the Super Bowl. That’s the point of researcher degrees of freedom: you can detect anything.

As a special bonus, we get the difference between significant and non-significant:

The chances of having a low birth weight baby were a bit higher when the team won in an upset, suggesting that surprise may have helped fuel the effect. There was little effect when the team lost.

Really no end to the paths in this garden.

To her credit, Johnson does express some skepticism:

There’s a huge caveat to interpreting these studies. . . . That means researchers have to use natural experiments and existing data sets to explore their hypothesis. That leads to imaginative studies — like the Super Bowl one — but also means that they can’t be certain that it’s the prenatal experiences and not some other factor that explains the result.

But not nearly enough skepticism, as far as I’m concerned. To say “they can’t be certain that . . .” is to way overstate the evidence. If someone shows a blurry photo that purports to show the Loch Ness Monster, the appropriate response is not “they can’t be certain that it’s Nessie and not some other factor that explains the result.”

Sure, you can come up with a story in which the Super Bowl adds stress that increases the risk of low birth weight. Or a story in which the Super Bowl adds positive feelings that decrease that risk. Or a story about the relevance of any other sporting event, or any other publicized event, maybe a major TV special or an election or the report of shark attacks or prisoners on the loose or whatever else is happening to you this trimester. Talk is cheap, and so is “p less than .05.”

P.S. One more thing. I just noticed that the news headline was “Inequality might start before we’re even born.” What do they mean, “might”? Of course inequality starts before we’re even born. You don’t have to be George W. Bush or Edward M. Kennedy to know that! It’s fine to be concerned about inequality; no need to try to use shaky science to back up a claim that’s evident from lots of direct observations and a huge literature on social mobility.

Heimlich

Paul Alper writes:

Heimlich who is 96, was in the news lately, saving a woman, 87 years old, using the technique he invented.

So, off to Wikipedia:

Henry Judah Heimlich (born February 3, 1920) is an American thoracic surgeon widely credited as the inventor of the Heimlich maneuver, a technique of abdominal …

where I [Alper] discovered some strange things about his wife:

Heimlich’s wife coauthored a book on homeopathy with Maesimund B. Panos called “Homeopathic Medicine at Home.”[6] She also wrote What Your Doctor Won’t Tell You, which advocates chelation therapy and other alternative therapies.

and kids:

Heimlich and his wife had four children: Phil Heimlich, a former Cincinnati elected official turned conservative Christian radio talk-show host; Peter Heimlich, whose website describes what he alleges to be his father’s “wide-ranging, unseen 50-year history of fraud”

and this:

From the early 1980s, Heimlich advocated malariotherapy, the deliberate infection of a person with benign malaria in order to treat ailments such as cancer, Lyme disease and (more recently) HIV. As of 2009 the treatments were unsuccessful, and attracted criticism as both scientifically unsound and dangerous.[24] The United States Food and Drug Administration and Centers for Disease Control and Prevention have rejected malariotherapy and, along with health professionals and advocates for human rights, consider the practice “atrocious”

and this:

Joseph P. Ornato MD, Medical College of Virginia:
“Dr. Heimlich continues to distort, misquote, fabricate, and mislead his peers and the public regarding the scientific ‘evidence’ supporting the safety and efficacy of his (drowning) theory. Dr. Heimlich’s ‘evidence’ consists of unsubstantiated, poorly documented anecdotes. He cites letters to the editor (published in the Journal of the American Medical Association) as though they represented rigorous scientific study.” (August 1992 letter to the American Red Cross as quoted in the Cincinnati Enquirer, May 10, 1993)

Searching to see if you have ever expounded on Heimlich and his research, I found this:

Maksim Gelman, butcher of Brighton Beach: ‘I killed 6 more that no …
www.dailymail.co.uk/…/Maksim-Gelman-butcher-Brighton-B…
Jan 16, 2012 – Maksim Gelman, 24, who faces sentencing on Wednesday for murdering … communications and intelligence devices; Dr. Henry Heimlich, 96, …

but for the life of me, Heimlich is missing on that page. Is there any Gelman-Heimlich connection?

My brother Alan is a doctor and he saved a kid from choking once using the Heimlich maneuver. It really happened—I was there! The kid was choking on a piece of meat.

That’s all I’ve got for ya. But the material here is pretty amazing! It doesn’t seem like Heimlich ever let the truth get in the way of self-promotion!

Note to journalists: If there’s no report you can read, there’s no study

Blogger Echidne caught a study by the British organization Demos which was reported in Newsweek as “Half of Misogyny on Twitter Comes From Women.” But, as Echidne points out, there’s no study to report:

I [Echidne] then e-mailed Demos to ask for the url of the study. The rapid response I received (thanks, Demos!) told me — and here comes the fun part! — that there IS NO WRITTEN REPORT THAT PEOPLE COULD ANALYZE.

That is bullshit. Absolute bullshit. . . .

That there is no report does not imply that the results are incorrect, only that we cannot tell if they are correct or incorrect. But a written report is very important. The reason that researchers write their studies up is so that others can see what they did, how they did it, and also so that others can judge whether the study was done right or not.

I agree. I’m reminded of the gay gene tabloid hype, where results presented in a 10-minute conference talk were promoted all around the world, without there being any paper describing the data and methods.

Or the Wall Street Journal article that reported on a claimed survey of the super-rich for which no documentation was provided and which we have no reason to trust.

Hey, journalists: Don’t get fooled. Demand to see the study! I think it would work.

Next time someone sends you a press release and you’re thinking of running the story, first contact the organization and ask to see the written report. If they say they don’t have a report, it’s simple: Either don’t run the study, or run a report that is appropriately dripping with skepticism, including the phrase “for which the organization refused to supply a written report” as many times as possible.

“The Prose Factory: Literary Life in England Since 1918” and “The Windsor Faction”

It’s been D. J. Taylor week here. I expect that something like 0% of you (rounding to the nearest percentage point) have heard of D. J. Taylor, and that’s ok. He’s an English literary critic. Several years ago I picked up a copy of his book, A Vain Conceit: British Fiction in the 1980s, and I’ve been a fan ever since. Then last week I bought The Prose Factory: Literary Life in England Since 1918, and read most of it on the plane back from London. It’s been a long time since I’ve put my work aside and just relaxed and read like this, and I hadn’t remembered how long it can take to read a book. 7 hours on the plane and I still wasn’t quite done. I’d also picked up a copy of Taylor’s most recent novel, The Windsor Faction, and reading it has been a revealing experience too. I don’t always want to read a novel by a literary critic, but the novel was ok and it also gave me insight into the criticism.

All of this is a pretty obscure hobby so most of you can just stop reading at this point. I have no statistical insights coming. And, just to be clear, I don’t there’s anything particularly special or admirable about enjoying literary journalism; it just happens to be an interest of mine. If you don’t care about the topic, again, you can skip reading, just as Shravan skips the baseball posts and other people skip anything on politics.

I hate to even have to say all this but there’s so much reverse snobbery out there that I feel like I have to apologize for writing about literary journalism.

The Prose Factory

Anyway, I enjoyed Taylor’s book of literary history, but there was something a bit, umm, off about it. I wasn’t quite sure what it was about. I understood that it’s not primarily a work of literary criticism, so there’s not so much discussion of individual stories or novels or nonfiction prose. It’s more about what it was like to be a writer or critic during this period. But there’s next to nothing about large classes of professional prose writers, including writers of genre fiction (no Agatha Christie or John Le Carre, also nothing on the many less-successful writers in their fields), plain old newspaper writers, playwrights, etc. Nothing on Michael Frayn, for example, who did various of these things. Next to nothing on writers of popular or scholarly books on history. That’s fine—the topics Taylor does write on are mostly interesting—I’d just’ve appreciated some discussion of what he felt the book was really about, along with some consideration of all the prose he’d decided not to write about.

I did a quick search and found this review by Stefan Collini that covers Taylor’s book well. So if you’re interested I suggest you start with Collini’s review.

Also Terry Eagleton wrote this review which I really hated. Actually much of Eagleton’s review is excellent: he knows a lot and has all sorts of interesting thoughts and reflections. Actually, I recommend you read it. But it still irritated me because it seemed to be all about taking sides. Eagleton kept picking fights from nowhere. For example:

The Bloomsbury group, he [Taylor] admits, were a jealously exclusive elite, but so what? ‘It was their club: why should they be expected to let non-members through the door?’ he protests. Does this extend to banning Jews from golf clubs?

Where did that come from?

Or this:

Lord David Cecil’s ‘gentlemanly and rather old-fashioned scholarship’ is duly noted, but ‘this is not to disparage Lord David’s accomplishments, either as critic or biographer.’ Why not?

Why not? It’s right here, dude! Taylor’s very next phrase: “his life of Max Beerbohm (1964) is still the standard account.”

That’s good enough for me: if you write a book of biography and criticism that is still the standard account—over fifty years later!—that’s an accomplishment. At the end of this sentence, Taylor characterizes Cecil as “at best backward-looing and at worst deeply reactionary.”

As I said, Eagleton has a lot of interesting things to say. I just am so sick of his attitude: he hates this David Cecil so much that he can’t accept that maybe the guy wrote a good book once!

The Windsor Faction

The Windsor Faction is Taylor’s most recent novel. I enjoyed it. The plot was fine, the characters were . . . well, they were ok. They didn’t really come alive, I don’t think they had any agency. Where the book really stood out was in its atmosphere. It took place in London in 1939-1940 and it seemed so real, much more so than in many other historical novels I’ve read. Not just the scenery and decor, also the way the characters wrote and talked, how they used their language. Here I can really see the influence of Taylor’s immersion in the English literature and journalism of that period.

(Just to interject: I recently read Expo 58 by Jonathan Coe. Now he’s a real writer. It’s fine for craftsmen such as Taylor to write fiction—if nothing else, it’s gotta make him a better critic—but it’s good to be reminded by reading Coe what a real novelist can do. I assume (hope) Taylor would agree with me on this one.)

Anyway, to return to The Windsor Faction . . . Lots of other influences too: There was a certain clever trick that Taylor described in his nonfiction book, something that this novelist from the 1920s or 30s did—I can’t remember the name of the author or his books—this author would describe certain actions occurring in the background of the scene which would subtly and humorously advance the plot, a sort of cinematic trick. Anyway, Taylor does this in his own novel. It works fine but it was a bit disconcerting to know exactly where it came from. Also the book had lots of observation of social class that would fit in with Orwell’s writing. Taylor doesn’t quite have the bucket-of-water-falling-on-the-hapless-hero’s-head style of Orwell or Jonathan Coe, but the early scene of the guy working in the junk shop and ripping off his boss had a bit of that Aspidistra feeling. There were also some scenes that mix broad comedy with deep discomfort—in particular I’m thinking of a scene of an awkward party where one of the guests simultaneously attempts suicide and floods the bathroom—that remind me very much of Kingsley Amis, another favorite subject of Taylor. And, finally, in its general air of foreboding, the entire book could be taken as a gloss on the unforgettable final three paragraphs of Homage to Catalonia.

And I think Taylor would have no problem at all with us closing out this review with those paragraphs from Orwell’s classic:

I think we stayed three days in Banyuls. It was a strangely restless time. In this quiet fishing-town, remote from bombs, machine-guns, food-queues, propaganda, and intrigue, we ought to have felt profoundly relieved and thankful. We felt nothing of the kind. The things we had seen in Spain did not recede and fall into proportion now that we were away from them; instead they rushed back upon us and were far more vivid than before. We thought, talked, dreamed incessantly of Spain. For months past we had been telling ourselves that ‘when we get out of Spain’ we would go somewhere beside the Mediterranean and be quiet for a little while and perhaps do a little fishing, but now that we were here it was merely a bore and a disappointment. It was chilly weather, a persistent wind blew off the sea, the water was dull and choppy, round the harbour’s edge a scum of ashes, corks, and fish-guts bobbed against the stones. It sounds like lunacy, but the thing that both of us wanted was to be back in Spain. Though it could have done no good to anybody, might indeed have done serious harm, both of us wished that we had stayed to be imprisoned along with the others. I suppose I have failed to convey more than a little of what those months in Spain meant to me. I have recorded some of the outward events, but I cannot record the feeling they have left me with. It is all mixed up with sights, smells, and sounds that cannot be conveyed in writing: the smell of the trenches, the mountain dawns stretching away into inconceivable distances, the frosty crackle of bullets, the roar and glare of bombs; the clear cold light of the Barcelona mornings, and the stamp of boots in the barrack yard, back in December when people still believed in the revolution; and the food-queues and the red and black flags and the faces of Spanish militiamen; above all the faces of militiamen—men whom I knew in the line and who are now scattered Lord knows where, some killed in battle, some maimed, some in prison—most of them, I hope, still safe and sound. Good luck to them all; I hope they win their war and drive all the foreigners out of Spain, Germans, Russians, and Italians alike. This war, in which I played so ineffectual a part, has left me with memories that are mostly evil, and yet I do not wish that I had missed it. When you have had a glimpse of such a disaster as this—and however it ends the Spanish war will turn out to have been an appalling disaster, quite apart from the slaughter and physical suffering—the result is not necessarily disillusionment and cynicism. Curiously enough the whole experience has left me with not less but more belief in the decency of human beings. And I hope the account I have given is not too misleading. I believe that on such an issue as this no one is or can be completely truthful. It is difficult to be certain about anything except what you have seen with your own eyes, and consciously or unconsciously everyone writes as a partisan. In case I have not said this somewhere earlier in the book I will say it now: beware of my partisanship, my mistakes of fact, and the distortion inevitably caused by my having seen only one corner of events. And beware of exactly the same things when you read any other book on this period of the Spanish war.

Because of the feeling that we ought to be doing something, though actually there was nothing we could do, we left Banyuls earlier than we had intended. With every mile that you went northward France grew greener and softer. Away from the mountain and the vine, back to the meadow and the elm. When I had passed through Paris on my way to Spain it had seemed to me decayed and gloomy, very different from the Paris I had known eight years earlier, when living was cheap and Hitler was not heard of. Half the cafés I used to know were shut for lack of custom, and everyone was obsessed with the high cost of living and the fear of war. Now, after poor Spain, even Paris seemed gay and prosperous. And the Exhibition was in full swing, though we managed to avoid visiting it.

And then England—southern England, probably the sleekest landscape in the world. It is difficult when you pass that way, especially when you are peacefully recovering from sea-sickness with the plush cushions of a boat-train carriage under your bum, to believe that anything is really happening anywhere. Earthquakes in Japan, famines in China, revolutions in Mexico? Don’t worry, the milk will be on the doorstep tomorrow morning, the New Statesman will come out on Friday. The industrial towns were far away, a smudge of smoke and misery hidden by the curve of the earth’s surface. Down here it was still the England I had known in my childhood: the railway-cuttings smothered in wild flowers, the deep meadows where the great shining horses browse and meditate, the slow-moving streams bordered by willows, the green bosoms of the elms, the larkspurs in the cottage gardens; and then the huge peaceful wilderness of outer London, the barges on the miry river, the familiar streets, the posters telling of cricket matches and Royal weddings, the men in bowler hats, the pigeons in Trafalgar Square, the red buses, the blue policemen—all sleeping the deep, deep sleep of England, from which I sometimes fear that we shall never wake till we are jerked out of it by the roar of bombs.

THE END