Skip to content

Modeling correlation of issue attitudes and partisanship within states

John Kuk writes:

I have taught myself multilevel modeling using your book and read your work with Delia Baldassarri about partisanship and issue alignment. I have a question about related to these two works.

I want to find the level of correlation between partisanship and issues at the state level. Your work with Professor Baldassarri estimated the correlation at the national level, but I want to estimate it at the state level. The problem is that ANES is designed to be a national representative sample, so without using multilevel modeling, the estimated state level correlation is useless.

If I run a varying-intercept, varying-slope model with states as the group variable, I can use these estimates as somewhat comparable to correlation coefficients, though they are not same. If I run a linear regression, I know the coefficient is different with the correlation coefficient as the ratio of standard deviation of x and y. However, even though I understand multilevel model coefficients are the weighted average of the group and the whole coefficients, I don’t know how to compare multilevel coefficients with correlation coefficients.

Given my situation, I have two questions.

1) Is it OK to use estimates from a varying-intercept, varying-slope model to compare the state level correlation of partisanship and issue positions?

2) If no, how can I derive a correlation coefficient to compare state level correlations?

My reply: Yes, I think that rather than modeling correlations, if you’re interested in partisanship and issue attitudes, it would make sense to simply regress issue attitudes on partisanship, with varying intercepts and slopes for states. The varying slopes are what you’re interested in. That said, I’m guessing that ANES won’t have nearly enough data to estimate varying slopes with any level of accuracy. You should pool several years of ANES or else use larger surveys such as Annenberg and Pew as we did for most of our Red State Blue State analysis.

Science reporters are getting the picture

Enrico Schaar points me to two news articles: What psychology’s crisis means for the future of science by Brian Resnick and These doctors want to fix a huge problem with drug trials. Why isn’t anyone listening? by Julia Belluz.

I don’t really have anything to add here beyond what I’ve blogged on these topics before. (I mean, sure, I could laugh at this quote, “The average person cannot evaluate a scientific finding for themselves any more easily than they can represent themselves in court or perform surgery on their own appendix,” which came from a psychology professor who is notorious for claiming that the replication rate in psychology is “statistically indistinguishable from 100%”—but I won’t go there.)

No, I just wanted to express pleasure that journalists are seeing the big picture here. At this point there’s a large cohort of science writers who’ve moved beyond the “Malcolm Gladwell” or “Freakonomics” model of scientist-as-hero, or the “David Brooks” model of believing anything that confirms your political views, or even the controversy-in-the-lab model, to a clearer view of science as a collective enterprise. We really do seem to be moving forward, even in the past five or ten years. Science reporters are no longer stenographers; they are active citizens of the scientific community.

Will youths who swill Red Bull become adult cocaine addicts?

enhanced-1029-1422391599-8

The above is the question asked to me by Michael Stutzer, who writes:

I have attached an increasingly influential paper [“Effects of Adolescent Caffeine Consumption on Cocaine Sensitivity,” by Casey O’Neill, Sophia Levis, Drew Schreiner, Jose Amat, Steven Maier, and Ryan Bachtell] purporting to show the effects of caffeine use in adolescents (well, lab rats anyway) on biomarkers of rewards to cocaine use later in life. I’d like to see a Gelmanized analysis of the statistics. Note, for example, Figure 2, panels A and B. Figure 2, Panel A contrasts the later (adult) response to cocaine between 16 rats given caffeine as adolescents, vs. 15 rats who weren’t given caffeine as adolescents. Panel B contrasts the adult response to cocaine between 8 rats given caffeine only as adults, vs. 8 rats who weren’t given caffeine. The authors make much of the statistically significant difference in means in Panel A, and the apparent lack of statistical significance in Panel B, although the sign of the effect appears to still be there. But N=8 likely resulted in a much larger calculated standard error in Panel B than the N=16 did in Panel A. I wonder if the results would have held in a balanced design with N=16 rats used in both experiments, or with larger N in both. In addition, perhaps a Bonferroni correction should be made, because he could have just lumped the 24 caffeine-swilling (16 adolescent + 8 adult) rats together and tested the difference in the mean response between them and the 23 adolescent and adult rats who weren’t given caffeine. The authors may have done that correction when they contrasted the separate Panels differences in means (they purport to do that in the other panels), but the legend doesn’t indicate it.

Because the paper is getting a lot of citations, some lab should try to replicate all this with larger sample sizes, and perturbations of the experimental procedures.

My reply:

I don’t have the equipment to replicate this one myself, so I’ll post your request here.

It’s hard for me to form any judgment about the paper because all these biology details are so technical, I just don’t have the energy to track everything that’s going on.

Just to look at some details, though: It’s funny how hard it is to find the total number of rats in the experiment, just by reading the paper. N is not in the abstract or in the Materials and Methods section. In the Results section I see that one cohort had 32 adolescent and 20 adult rats. So there must be other cohorts in the study?

I also find frustrating the convention that everything is expressed as a hypothesis test. The big big trouble with hypothesis tests is that the p-values are basically uninterpretable if the null hypothesis is false. Just for example:

Screen Shot 2016-04-09 at 6.10.09 PM

What’s the point of that sort of thing? If there’s a possibility there is no change, then, sure, I can see the merit of including the p-value. But when the p-value is .0001 . . . c’mon, who cares about the damn F statistic, just give me the numbers: what was the average fluid consumption per day for the different animals? Also I have a horrible feeling their F-test was not appropriate, cos I’m not clear on what those 377 and 429 are.

I’d like to conclude by saying two things that may at first seem contradictory, but which are not:

1. This paper looks like it has lots of statistical errors.

2. I’m not trying to pick on the authors of this paper. And, despite the errors, they may be studying something real.

The lack of contradiction comes because, as I wrote last month, statistics is like basketball, or knitting. It’s hard. There’s no reason we should expect a paper written by some statistical amateurs to not have mistakes, any more than we’d expect the local high school team to play flawless basketball or some recreational knitter to be making flawless sweaters.

It does not give me joy to poke holes in the statistical analysis of a random journal article, any more than I’d want to complain about Aunt Edna’s sweaters or laugh at the antics of whoever is the gawky kid who’s playing center for the Hyenas this year. Everyone’s trying their best, and I respect that. To point out statistical errors in a published paper is not an exercise in “debunking,” it’s just something that I’ll notice, and it’s relevant to the extent that the paper’s conclusions lean on its statistical analysis.

And one reason we sometimes want brute-force preregistered replications is because then we don’t have to worry about so many statistical issues.

A little story of the Folk Theorem of Statistical Computing

I know I promised I wouldn’t blog, but this one is so clean and simple. And I already wrote it for the stan-users list anyway so it’s almost no effort to post it here too:

A colleague and I were working on a data analysis problem, had a very simple overdispersed Poisson regression with a hierarchical, varying-intercept component. Ran it and it was super slow and not close to converging after 2000 iterations. Took a look and we found the problem: The predictor matrix of our regression lacked a constant term. The constant term comes in by default if you do glmer or rstanarm, but if you write it as X*beta in a Stan model, you have to remember to put in a column of 1’s in the X matrix (or to add a “mu” to the regression model), and we’d forgotten to do that.

Once we added in that const term, the model (a) ran much faster (cos adaptation was smoother) and (b) converged just fine after just 100 iterations.

Yet another instance of the folk theorem.

NBA in NYC

michael-jordan-bugs-bunny-space-jam-ftr-080216_ta5swy2mwcha1epsb1pfoe5qp

Jason Rosenfeld writes:

We’re holding the first ever NBA Basketball Analytics Hackathon on Saturday, September 24 at Terminal 23 in midtown Manhattan.

I can’t guarantee that Bugs will be there, but ya never know!

Are stereotypes statistically accurate?

Apparently there’s a debate in psychology about the accuracy of stereotypes.

Lin Bian and Andrei Cimpian write:

In his book Social Perception and Social Reality, Lee Jussim suggests that people’s beliefs about various groups (i.e., their stereotypes) are largely accurate. We unpack this claim using the distinction between generic and statistical beliefs—a distinction supported by extensive evidence in cognitive psychology, linguistics, and philosophy. Regardless of whether one understands stereotypes as generic or statistical beliefs about groups, skepticism remains about the rationality of social judgments.

Bian and Cimpian start by distinguishing what cognitive psychologists call “statistical” and “generic” beliefs about categories. This is pretty cool. Here they go:

Consider the statements below:

(1a) Fewer than 1% of mosquitoes carry the West Nile virus.
(1b) Mosquitoes carry the West Nile virus.
(2a) The majority of books are paperbacks.
(2b) Books are paperbacks.

Statements (1a) and (2a) are statistical: They express a belief about a certain number or proportion of the members of a category. Statements (1b) and (2b) are generic: They express a belief about the category as a whole rather than a specific number, quantity, or proportion. . . .

The fact that generic claims – and the beliefs they express – are not about numbers or quantities has a crucial consequence: It severs their truth conditions from the sort of statistical data that one could objectively measure in the world. . . .

This point is illustrated by the examples above. Both (1a) and (1b) are considered true: Although very few mosquitoes actually carry the West Nile virus, participants judge the generic claim (that mosquitoes, as a category, carry the West Nile virus) to be true as well. . . .

In contrast, even though (2a) is true – paperbacks are indeed very common – few believe that books, as a category, are paperbacks (i.e., [2b] is false). . . .

Bian and Cimpian continue:

These are not isolated examples. The literature is replete with instances of generic claims that either are judged true despite unimpressive statistical evidence or judged false despite overwhelming numbers . . . the rules that govern which generic beliefs are deemed true and which are deemed false are so baroque and so divorced from the statistical facts that many linguists and philosophers have spent the better part of 40 years debating them. . . .

And to return to stereotyping:

All of the foregoing applies to beliefs about social groups as well. . . . The distinction between statistical and generic beliefs is operative regardless whether these beliefs concern mosquitoes, books, and other categories of non-human entities, or women, African Americans, Muslims, and other categories of humans.

And, the punch line:

Generic beliefs about social groups, just like other generic beliefs, are typically removed from the underlying statistics.

Statistics vs. stereotypes

Bian and Cimpian follow up with two examples:

More people hold the generic belief that Muslims are terrorists than hold the generic belief that Muslims are female. However, there are vastly more Muslims who are female than there are Muslims who are terrorists. . . .

Compare, for instance, “Asians are really good at math” and “Asians are right-handed.” Many more people would agree with the former generic claim than with the latter, while simultaneously being aware that the statistics go the opposite way.

OK, let’s unpack these. Here the statistics are so obviously counter to the stereotype that there has to be something else going on. In this case, I’d say the relevant statistical probabilities are not that Muslims are likely to be terrorists, or that Asians are more likely to be math whizzes, but that Muslims are more likely than other groups to be terrorists, or that Asians are more likely than other groups to be math whizzes. Maybe these statements aren’t correct either (I guess it would all depend on how all these things are defined), but that would seem to be the statistics to look at.

The stereotypes of a group, that is, would seem to be defined relative to other groups.

This does not tell the whole story either, though, as I’m sure that lots of stereotyping is muddled by what Kahneman and Tversky called availability bias.

Bian and Cimpian continue—you can read the whole thing—by discussing whether stereotypes should be considered as “generic beliefs” or “statistical beliefs.” As a statistician I’m not so comfortable with this distinction—I’m inclined to feel that generic beliefs are also a form of statistical belief, if the statistical question is framed the right way—but I do think they’re on to something in trying to pin down what people are thinking when they use stereotypes in their reasoning.

P.S. I sent the above to Susan, who added:

The issues you’re raising are ones that have been discussed a fair amount in the literature. Some of these ideas have been studied with experiments, but others have not (i.e., they’ve been discussed but not formally tested).

I agree that statistical info goes beyond just P(feature|category) (e.g., P(West Nile Virus|mosquito). As I think you’re saying, one could also ask: what about distinctiveness, which is the opposite — P(category|feature) (e.g., P(Mosquite|WNV)? Although distinctiveness can make a generic more acceptable, generics need not be distinctive (e.g., “Lions eat meat”; “Dogs are 4-legged”; “Cats have good hearing” are all non-distinctive but good generics). There are even properties that are relatively infrequent (i.e., true of less than half the category) and are non-distinctive, but make good generics (e.g., “Ducks lay eggs”; “Goats produce milk”; “Peacocks are colorful”). Finally, there are features that are frequent and distinctive but don’t (ordinarily) make good generics (e.g., “People are right-handed”; “Bees are sterile”; “Turtles die in infancy”).

I think that people are doing some assessment of how conceptually central a feature is, where centrality could be cued by any of a number of factors, including: prevalence, distinctiveness, danger/harm/threat (we have data on this as well — dangerous features make for better generics than benign features), and biological folk theories (e.g., features that only adults have are more likely to be in generics than features that only babies have — e.g., we say “Swans are beautiful”, not “Swans are ugly”).

This in turn gives me two thoughts:

1. Why we think swans are beautiful . . . that’s an interesting one, I’m sure there’s been lots written about that!

2. “People are right-handed” . . . that’s a great example. We are much more likely to be right-handed, compared to other animals (which generally have weak or no hand preference). And the vast majority of people are righties. Yet, saying “people are right-handed” odes seem wrong. On the other hand, if 80% of cats, say, were right handed, maybe we’d be ok saying “cats are right handed.” I guess there must be some kind of Grice going on here too.

P.P.S. In comments, Chris Martin points to further responses by Jussim and others.

George Orwell on the Olympics

3285388_orig

From 1945:

If you wanted to add to the vast fund of ill-will existing in the world at this moment, you could hardly do it better than by a series of football matches between Jews and Arabs, Germans and Czechs, Indians and British, Russians and Poles, and Italians and Jugoslavs, each match to be watched by a mixed audience of 100,000 spectators. I do not, of course, suggest that sport is one of the main causes of international rivalry; big-scale sport is itself, I think, merely another effect of the causes that have produced nationalism. Still, you do make things worse by sending forth a team of eleven men, labelled as national champions, to do battle against some rival team, and allowing it to be felt on all sides that whichever nation is defeated will “lose face”.

I’m a sports fan myself but I can see his point.

You won’t be able to forget this one: Alleged data manipulation in NIH-funded Alzheimer’s study

OLYMPUS DIGITAL CAMERA

Nick Menzies writes:

I thought you might be interested in this case in our local news in Boston.

This is a case of alleged data manipulation as part of a grant proposal, with the (former) lead statistician as the whistleblower. Is a very large grant, so high stakes both in terms of reputation and money.

It seems that the courts have sided with the alleged data manipulators over the whistleblower.

I assume this does not look good for the statistician. Separate from the issue of whether data manipulation (or analytic searching) occurred, it makes clear that calling our malfeasance comes with huge professional risk.

The original link no longer works but a search yields the record of the case here. In a news report, Michael Macagnone writes:

Kenneth Jones has alleged that the lead author of the study used to justify the grants, Dr. Ronald Killiany, falsified data by remeasuring certain MRI scans. . . .

It was Jones’s responsibility as chief statistician in the 2001 study—an examination of whether brain scans could determine who would contract Alzheimer’s disease—to verify the reliability of data . . . he alleges he was fired from his position on the study for questioning the use of the information. . . .

I know nothing at all about this case.

Boostrapping your posterior

Demetri Spanos writes:

I bumped into your paper with John Carlin, Beyond Power Calculations, and encountered your concept of the hypothetical replication of the point estimate. In my own work I have used a similarly structured (but for technical reasons, differently motivated) concept which I have informally been calling the “consensus posterior.”

Specifically, supposing a prior distribution for the true effect size D, I observe the experimental data and compute the posterior belief over values of D. Then I consider the “replication thought experiment” as follows:

1. Assuming my *posterior* as the distribution for true value of D …

2. Consider the posterior distribution that another researcher would obtain if they performed the identical experiment, assuming they hold the same prior but that the true distribution of D is as my posterior distribution

3. I then get a distribution over posterior distributions that I think of as the set of beliefs other researchers might hold, given they start from the same assumptions and have the same experimental capability that I do. Then I can calculate various point values from this “consensus” distribution. As I’m sure is clear to you, the consensus distribution is always much wider and less spiky than even the semi-smoothed distributions arising from a weakly-informative prior.

4. In my field (automated control systems, distributed computing, sensor fusion, and some elements of machine learning), I have found this provides a much more stable and well-conditioned signal for triggering automated sensor-driven behaviors than conventional techniques (even conventional Bayesian techniques). We are especially sensitive in our work to multimodal distributions and especially relative magnitudes of local peaks (because we often want to trigger a complex response or set of responses, not just get one average-case point estimate).

5. perhaps the most important

Now, having just read your paper, the variant I describe above seems “obvious,” so I wonder if you have thought on this subject before and could point me to any additional resources. Or, perhaps, if you see some fatal flaw that I have missed I would appreciate being told about that as well. “Works in the field” doesn’t necessarily mean “sound in principle,” and I would like to know if I am treading somewhere dangerous.

My reply: This could work, but much depends on how precise is your posterior distribution. If your data and your prior are weak, you could still have problems with your distribution being under-regularized (that is, being too far from zero).

I know I said I wouldn’t blog for awhile, but this one was just too good to resist

Scott Adams endorsing the power pose:

Have you heard of the “victory pose.” It’s a way to change your body chemistry almost instantly by putting your hands above your head like you won something. That’s a striking example of how easy it is to manipulate your mood and thoughts by changing your body’s condition.

So easy to do, yet so hard to replicate . . .

Adams is a bit of a Gregg Easterbrook in that he alternates science speculation with savvy political commentary:

I’ve been watching the Democratic National Convention and wondering if this will be the first time in history that we see a candidate’s poll numbers plunge after a convention.

But even better is when he mixes the two together:

Based on what I know about the human body, and the way our thoughts regulate our hormones, the Democratic National Convention is probably lowering testosterone levels all over the country. Literally, not figuratively. And since testosterone is a feel-good chemical for men, I think the Democratic convention is making men feel less happy. They might not know why they feel less happy, but they will start to associate the low feeling with whatever they are looking at when it happens, i.e. Clinton.

Keep this up, Scott, and a Ted talk’s in your future!

You’re talking about Scott Adams. He’s not talking about you.

DrawMyData

Robert Grant writes:

This web page is something I constructed recently. You might find it useful for making artificial datasets that demonstrate a particular point for students. At any rate, if you have any feedback on it I’d be interested to hear it. I’ve tried to keep it as simple as possible but in due course, I’d like to add more descriptive stats, an optional third variable and maybe an option to make one categorical. It might also be interesting to get students to use it themselves to construct ‘pathological’ datasets where the stats would lead them astray, to help them understand how to spot problems and the importance of diagnostic plotting.

Shameless little bullies claim that published triathlon times don’t replicate

00cheat-illo-superJumbo

Paul Alper sends along this inspiring story of Julie Miller, a heroic triathlete who just wants to triathle in peace, but she keeps getting hassled by the replication police. Those shameless little bullies won’t let her just do her thing, instead they harp on technicalities like missing timing clips and crap like that. Who cares about missing timing clips? Her winning times were statistically significant, that’s what matters to me. And her recorded victories were peer reviewed. But, no, those second stringers can’t stop with their sniping.

I for one don’t think this running star should resist any calls for her to replicate her winning triathlon times. The replication rate of those things is statistically indistinguishable from 100%, after all! Track and field has become preoccupied with prevention and error detection—negative psychology—at the expense of exploration and discovery.

In fact, I’m thinking the American Statistical Association could give this lady the Founders Award, which hasn’t really had a worthy recipient since 2002.

On deck this week

Mon: Shameless little bullies claim that published triathlon times don’t replicate

Tues: Boostrapping your posterior

Wed: You won’t be able to forget this one: Alleged data manipulation in NIH-funded Alzheimer’s study

Thurs: Are stereotypes statistically accurate?

Fri: Will youths who swill Red Bull become adult cocaine addicts?

Sat: Science reporters are getting the picture

Sun: Modeling correlation of issue attitudes and partisanship within states

Documented forking paths in the Competitive Reaction Time Task

Screen Shot 2016-04-08 at 6.45.15 PM

Baruch Eitan writes:

This is some luscious garden of forking paths.

Indeed. Here’s what Malte Elson writes at the linked website:

The Competitive Reaction Time Task, sometimes also called the Taylor Aggression Paradigm (TAP), is one of the most commonly used tests to purportedly measure aggressive behavior in a laboratory environment. . . .

While the CRTT ostensibly measures how much unpleasant, or even harmful, noise a participant is willing to administer to a nonexistent confederate, that amount of noise can be extracted as a measure in myriad different ways using various combinations of volume and duration over one or more trials. There are currently 120 publications in which results are based on the CRTT, and they reported 147 different quantification strategies in total!

Elson continues:

This archive does not contain all variations of the CRTT, as some procedural differences are so substantial that their quantification strategies would be impossible to compare. . . . Given the number of different versions of the CRTT measure that can be extracted from its use in a study, it is very easy for a researcher to analyze several (or several dozen) versions of the CRTT outcome measures in a study, running hypothesis tests with one version of the measure after another until a version is found that produces the desired pattern of results. . . .

Smooth poll aggregation using state-space modeling in Stan, from Jim Savage

Jim Savage writes:

I just saw your post on poll bounces; have been thinking the same myself. Why are the poll aggregators so jumpy about new polls?

Annoyed, I put together a poll aggregator that took a state-space approach to the unobserved preferences; nothing more than the 8 schools (14 polls?) example with a time-varying mean process and very small updates to the state.

One of the things I was thinking of was to use aggregated demographic polling data (from the polling companies’ cross-tabs) as a basis for estimating individual states for each demographic cell, and then performing post-stratification on those. Two benefits: a) having a time-varying state deals nicely with the decaying importance of old polls, and b) getting hold of unit-level polling data for MRP is tough if you’re me (perhaps tough if you’re you?).

Here’s the plot:

statespace

A full writeup, automated data scraping, model etc. is below.

Here’s the zipfile with everything.

My only comment is that you should be able to do even better—much better—by also including party ID among the predictors in the model, then fitting a state-space model to the underlying party ID proportions and poststratifying on it as well. That would fix some of the differential nonresponse stuff we’ve been talking about.

And here’s Jim’s writeup:
Continue reading ‘Smooth poll aggregation using state-space modeling in Stan, from Jim Savage’ »

“What can recent replication failures tell us about the theoretical commitments of psychology?”

The_Scream

Psychology/philosophy professor Stan Klein was motivated by our power pose discussion to send along this article which seems to me to be a worthy entry in what I’ve lately been calling “the literature of exasperation,” following in the tradition of Meehl etc.

I offer one minor correction. Klein writes, “I have no doubt that the complete reinstatement of experimental conditions will ensure a successful replication of a task’s outcome.” I think this statement is too optimistic. Redo the same experiment on the same people but re-randomize, and anything can happen. If the underlying effect is near zero (as I’d guess is the case, for example, in the power pose example), then there’s no reason to expect success even in an exact replication.

More to the point is Klein’s discussion of the nature of theorizing in psychology research. Near the end of his article he discusses the materialist doctrine “that reality, in its entirety, must be composed of quantifiable, material substances.”

That reminds me of one of the most ridiculous of many ridiculous hyped studies in the past few decades, a randomized experiment purporting to demonstrate the effectiveness of intercessory prayer (p=.04 after performing 3 hypothesis tests, not that anyone’s counting; Deb and I mention it in our Teaching Statistics book). What amazed me about this study—beyond the philosophically untenable (to me) idea that God is unable to interfere with the randomization but will go to the trouble of improving the health of the prayed-for people by some small amount, just enough to assure publication—was the effort the researchers put in to diminish any possible treatment effect.

It’s reasonable to think that prayer could help people in many ways, for example it is comforting to know that your friends and family care enough about your health to pray for it. But in this study they chose people to pray who had no connection to the people prayed for—and the latter group were not even told of the intervention. The experiment was explicitly designed to remove all but supernatural effects, somewhat in the manner that a magician elaborately demonstrates that there are no hidden wires, nothing hidden in the sleeves, etc. Similarly with Bargh’s embodied cognition study: the elderly words were slipped into the study so unobtrusively as to almost remove any chance they could have an effect.

I suppose if you tell participants to think about elderly people and then they walk slower, this is boring; it only reaches the status of noteworthy research if the treatment is imperceptible. Similarly for other bank-shot ideas such as the correlation between menstrual cycle and political attitudes. There seems to be something that pushes researchers to attenuate their treatments to zero, at which point they pull out the usual bag of tricks to attain statistical significance. It’s as if they were taking ESP research as a model. See discussion here on “piss-poor omnicausal social science.”

Klein’s paper, “The Unplanned Obsolescence of Psychological Science and an Argument for Its Revival”, is relevant to this discussion.

Don’t believe the bounce

image001

Alan Abramowitz sent us the above graph, which shows the results from a series of recent national polls, for each plotting Hillary Clinton’s margin in support (that is, Clinton minus Trump in the vote-intention question) vs. the Democratic Party’s advantage in party identification (that is, percentage Democrat minus percentage Republican).

This is about as clear a pattern as you’ll ever see in social science: Swings in the polls are driven by swings in differential nonresponse. After the Republican convention, Trump supporters were stoked, and they were more likely to respond to surveys. After the Democratic convention, the reverse: Democrats are more likely to respond, driving Clinton up in the polls.

David Rothschild and I have the full story up at Slate:

Tou sort of know there is a convention bounce that you should sort of ignore, but why? What’s actually in a polling bump? The recent Republican National Convention featured conflict and controversy and one very dark acceptance speech—enlivened by some D-list celebrities (welcome back Chachi!)—but it was still enough to give nominee Donald Trump a big, if temporary, boost in many polls. This swing, which occurs predictably in election after election, is typically attributed to the persuasive power of the convention, with displays of party unity persuading partisans to vote for their candidate and cross-party appeals coaxing over independents and voters of the other party.

Recent research, however, suggests that swings in the polls can often be attributed not to changes in voter intention but in changing patterns of survey nonresponse: What seems like a big change in public opinion turns out to be little more than changes in the inclinations of Democrats and Republicans to respond to polls. We learned this from a study we performed [with Sharad Goel and Wei Wang] during the 2012 election campaign using surveys conducted on the Microsoft Xbox. . . .

Our Xbox study showed that very few respondents were changing their vote preferences—less than 2 percent during the final month of the campaign—and that most, fully two-thirds, of the apparent swings in the polls (for example, a big surge for Mitt Romney after the first debate) were explainable by swings in the percentages of Democrats and Republicans responding to the poll. This nonresponse is very loosely correlated with likeliness to vote but mainly reflects passing inclinations to participate in polling. . . . large and systematic changes in nonresponse had the effect of amplifying small changes in actual voter intention. . . .

[See this paper, also with Doug Rivers, with more, including supporting information from other polls.]

We can apply these insights to the 2016 convention bounces. For example, Reuters/Ipsos showed a swing from a 15-point Clinton lead on July 14 to a 2-point Trump lead on July 27. Who was responding in these polls? The pre-convention survey saw 53 percent Democrats, 38 percent Republican, and the rest independent or supporters of other parties. The post-convention respondents looked much different, at 46 percent Democrat, 43 percent Republican. The 17-point swing in the horse-race gap came with a 12-point swing in party identification. Party identification is very stable, and there is no reason to expect any real swings during that period; thus, it seems that about two-thirds of the Clinton-Trump swing in the polls comes from changes in response rates. . . .

Read the whole thing.

The political junkies among you have probably been seeing all sorts of graphs online showing polls and forecasts jumping up and down. These calculations typically don’t adjust for party identification (an idea we wrote about back in 2001, but without realizing the political implications that come from systematic, rather than random, variation in nonresponse) and thus can vastly overestimate swings in preferences.

The p-value is a random variable

Sam Behseta sends along this paper by Laura Lazzeroni, Ying Lu, and Ilana Belitskaya-Lévy, who write:

P values from identical experiments can differ greatly in a way that is surprising to many. The failure to appreciate this wide variability can lead researchers to expect, without adequate justification, that statistically significant findings will be replicated, only to be disappointed later.

I agree that the randomness of the p-value—the fact that it is a function of data and thus has a sampling distribution—is an important point that is not well understood. Indeed, I think that the z-transformation (the normal cdf, which takes a z-score and transforms it into a p-value) is in many ways a horrible thing, in that it takes small noisy differences in z-scores and elevates them into the apparently huge differences between p=.1, p=.01, p=.001. This is the point of the paper with Hal Stern, “The difference between ‘significant’ and ‘not significant’ is not itself statistically significant.” The p-value, like any data summary, is a random variable with a sampling distribution.

Incidentally, I have the same feeling about cross-validation-based estimates and even posterior distributions: all of these are functions of the data and thus have sampling distributions, but theoreticians and practitioners alike tend to forget this and instead treat them as truths.

This particular article is that it takes p-values at face value, whereas in real life p-values typically are the product of selection, as discussed by Uri Simonson et al. a few years ago in their “p-hacking” article and as discussed by Eric Loken and myself a couple years ago in our “garden of forking paths” article. I think real-world p-values are much more optimistic than the nominal p-values discussed by Lazzeroni et al. But in any case I think they’re raising an important point that’s been under-emphasized in textbooks and in the statistics literature.

Guy Fieri wants your help! For a TV show on statistical models for real estate

Fieri-DDD

I got the following email from David Mulholland:

I’m a producer at Citizen Pictures where we produce Food Network’s “Diners, Dives and Drive-Ins” and Bravo’s digital series, “Going Off The Menu,” among others. A major network is working with us to develop a show that pits “data” against a traditional real estate agent to see who can find a home buyer the best house for them. In this show, both the real estate agent and the data team each choose two properties using their very different methods. The show will ask the question: “Who will do a better job of figuring out what the client wants, ‘data’ or the traditional real estate agent?”

TV and real estate are two topics I know nothing about, so I pointed Mulholland to some Finnish dudes who do sophisticated statistical modeling of the housing market. They didn’t think it was such a good fit for them, with Janne Sinkkonen remarking that “Models are good at finding trends and averages from large, geographically or temporally sparse data. The richness of a single case, seen on the spot, is much better evaluated by a human.”

That makes sense, but it is also possible that a computer-assisted human can do better than a human alone. Say you have a model that gives quick price estimates for every house. Those estimates are sitting on the computer. A human then goes to house X and assesses its value at, say, $350,000. The human then looks up and sees that the computer gave an assessment, based on some fitted algorithm, of $420,000. What does the human conclude? Not necessarily that the computer is wrong; rather, at this point the human can introspect and consider why the computer estimate is so far off. What features of the house make it so much less valuable than the computer “thinks”? Perhaps some features not incorporated into the computer’s model, for example the state of the interior of the house, or a bad paint job and unkempt property, or something about the location that had not been in the model. This sort of juxtaposition can be valuable.

That said, I still know nothing about real estate or about what makes good TV, so I offered to post Mulholland’s question here. He said sure, and added:

I’m particularly delighted to hear your analysis of a “computer-assisted human” as that is a direction we have been investigating. Simply put, we do not have the resources to implement any sort of fully computerized solution. I think the computer-assisted human is definitely a direction we would take.

I’d love to hear the thoughts of blog readers. At the moment, the big question we are considering is, “Assuming that we have full access to a users data (with the user’s cooperation of course . . . data example include Facebook, web browser history, online shopping history, geotracking, etc), how can we use human and computer to best sort through this data to find the house the user will like the most?”

Ball’s in your court now, TV-savvy blog readers!

Amazon NYC decision analysis jobs

Dean Foster writes:

Amazon is having a hiring event (Sept 8/9) here in NYC. If you are interested in working on demand forecasting either here in NYC or in Seattle send your resume to rcariapp@amazon.com by September 1st, 2016.

Here’s the longer blurb:

Amazon Supply Chain Optimization Technologies (SCOT) builds systems that automate decisions in Amazon’s supply chain. These systems are responsible for predicting customer demand; optimization of sourcing, buying and placement of all inventory, and ensuring optimal customer experience from an order fulfillment promise perspective. In other words, our systems automatically decide how much to buy of which items, from which vendors/suppliers, which fulfilment centers to put them in, how to get it there, how to get it to the customer and what to promise customers – all while maximizing customer satisfaction and minimizing cost.

Could be interesting, and it’s always fun to work on real decision problems!