Skip to content

Exploring model fit by looking at a histogram of a posterior simulation draw of a set of parameters in a hierarchical model

Opher Donchin writes in with a question:

We’ve been finding it useful in the lab recently to look at the histogram of samples from the parameter combined across all subjects. We think, but we’re not sure, that this reflects the distribution of that parameter when marginalized across subjects and can be a useful visualization. It can be easier to interpret than the hyperparameters from which subjects are sampled and it is available in situations where the hyperparameters are not explicitly represented in the model.

I haven’t seen this being used much, and so I’m not confident that it is a reasonable thing to consider. I’m also not sure of my interpretation.

My reply:

Yes, I think this can make a lot of sense! We discuss an example of this technique on pages 155-157 of BDA3; see the following figures.

First we display a histogram of a draw from the posterior distribution of two sets of parameters in our model. Each histogram is a melange of parameters from 30 participants in the study.

The histograms do not look right; there is a conflict between these inferences and the prior distributions.

So we altered the model. Here are the corresponding histograms of the parameters under the new model:

These histograms seem like a good fit to the assumed prior (population) distributions in the new model.

The example comes from this 1998 article with Michel Meulders, Iven Van Mechelen, and Paul De Boeck.

When “nudge” doesn’t work: Medication Reminders to Outcomes After Myocardial Infarction

Gur Huberman points to this news article by Aaron Carroll, “Don’t Nudge Me: The Limits of Behavioral Economics in Medicine,” which reports on a recent study by Kevin Volpp et al. that set out “to determine whether a system of medication reminders using financial incentives and social support delays subsequent vascular events in patients following AMI compared with usual care”—and found no effect:

A compound intervention integrating wireless pill bottles, lottery-based incentives, and social support did not significantly improve medication adherence or vascular readmission outcomes for AMI survivors.

That said, there were some observed differences between the two groups, most notably:

Mean (SD) medication adherence did not differ between control (0.42 [0.39]) and intervention (0.46 [0.39]) (difference, 0.04; 95% CI, −0.01 to 0.09; P = .10).

An increase in adherence from 42% to 46% ain’t nothing, but, yes, a null effect is also within the margin of error. And, in any case, 46% adherence is not so impressive.

Here’s Carroll:

A thorough review published in The New England Journal of Medicine about a decade ago estimated that up to two-thirds of medication-related hospital admissions in the United States were because of noncompliance . . . To address the issue, researchers have been trying various strategies . . . So far, there hasn’t been much progress. . . . A more recent Cochrane review concluded that “current methods of improving medication adherence for chronic health problems are mostly complex and not very effective.” . . .

He then describes the Volpp et al. study quoted above:

Researchers randomly assigned more than 1,500 people to one of two groups. All had recently had heart attacks. One group received the usual care. The other received special electronic pill bottles that monitored patients’ use of medication. . . .

Also:

Those patients who took their drugs were entered into a lottery in which they had a 20 percent chance to receive $5 and a 1 percent chance to win $50 every day for a year.

That’s not all. The lottery group members could also sign up to have a friend or family member automatically be notified if they didn’t take their pills so that they could receive social support. They were given access to special social work resources. There was even a staff engagement adviser whose specific duty was providing close monitoring and feedback, and who would remind patients about the importance of adherence.

But, Carroll writes:

The time to first hospitalization for a cardiovascular problem or death was the same between the two groups. The time to any hospitalization and the total number of hospitalizations were the same. So were the medical costs. Even medication adherence — the process measure that might influence these outcomes — was no different between the two groups.

This is not correct. There were, in fact, differences. But, yes, the differences were not statistically significant and it looks like differences of that size could’ve occurred by chance alone. So we can say that the treatment had no clear or large apparent effects.

Carroll also writes:

Maybe financial incentives, and behavioral economics in general, work better in public health than in more direct health care.

I have no idea why he is saying this. Also it’s not clear to me how he distinguishes “public health” from “direct health care.” He mentions weight loss and smoking cessation but these seem to blur the boundary, as they’re public health issues that are often addressed by health care providers.

Anyway, my point here is not to criticize Carroll. It’s an interesting topic. My quick thought on why nudges seem so ineffective here is that people must have good reasons for not complying—or they must think they have good reasons. After all, complying would seem to be a good idea, and it’s close to effortless, no? So if the baseline rate of compliance is really only 40%, maybe it would take a lot to convince those other 60% to change their behaviors.

It’s similar to the difficulty of losing weight or quitting smoking. It’s not that it’s so inherently hard to lose weight or to quit smoking; it’s that people who can easily lose weight or quit smoking have already done so, and it’s the tough cases that remain. Similarly, the people for whom it’s easy to convince to comply . . . they’re already complying with the treatment. The noncompliers are a tougher nut to crack.

Comparing racism from different eras: If only Tucker Carlson had been around in the 1950s he could’ve been a New York Intellectual.

TV commentator Carlson in 2018 recently raised a stir by saying that immigration makes the United States “poorer, and dirtier, and more divided,” which reminded me of this rant from literary critic Alfred Kazin in 1957:

Screen Shot 2013-03-16 at 6.12.03 PM

Kazin put it in his diary and Carlson broadcast it on TV, so not quite the same thing.

But this juxtaposition made me think of Keith Ellis’s comment that “there’s much less difference between conservatives and progressives than most people think. Maybe one or two generations of majority opinion, at most.”

When people situate themselves on political issues, I wonder how much of this is on the absolute scale and how much is relative to current policies or the sense of the prevailing opinion. Is Tucker Carlson more racist than Alfred Kazin? Does this question even make sense? Maybe it’s like comparing baseball players from different eras, e.g. Mike Trout vs. Babe Ruth as hitters. Or, since we’re on the topic of racism, Ty Cobb vs. John Rocker.

Classifying yin and yang using MRI

Zad Chow writes:

I wanted to pass along this study I found a while back that aimed to see whether there was any possible signal in an ancient Chinese theory of depression that classifies major depressive disorder into “yin” and “yang” subtypes. The authors write the following,

The “Yin and Yang” theory is a fundamental concept of traditional Chinese Medicine (TCM). The theory differentiates MDD patients into two subtypes, Yin and Yang, based on their somatic symptoms, which had empirically been used for the delivery of effective treatment in East Asia. Nonetheless, neural processes underlying Yin and Yang types in MDD are poorly understood. In this study, we aim to provide physiological evidence using functional magnetic resonance imaging (fMRI) to identify altered resting-state brain activity associated with Yin and Yang types in drug-naïve MDD patients.

They didn’t really have much prior evidence to go on with this study, so a lot of the analyses seemed exploratory,

The aim of this exploratory study is to provide physiological evidence, using functional magnetic resonance imaging (fMRI), to identify altered resting-state brain activity associated with Yin and Yang types in drug-naïve MDD patients. Previous studies using functional connectivity (FC) method of resting-sate fMRI demonstrated altered inter- and intral-regional brain connectivity, including local functional connectivity in the medial prefrontal cortex and frontoparietal hypoconnectivity in MDD brains (14, 15). As proposed in Drysdale’s work (8), differential brain function at resting-state may be a useful physiological marker to identify specific subpopulation of MDD patients. Thus, we hypothesize that resting-state brain activity and FC in MDD patients with Yin type are altered when compared to those with Yang type. To test this hypothesis, we examined resting-state functional activities across the entire brain in MDD patients in both Yin and Yang groups as well as matched healthy controls.

The authors ended up finding a few differences that were corrected for using the AlphaSims approach (a method to correct for multiple comparisons in fMRI studies) and a few exploratory comparisons that weren’t corrected for because those were considered exploratory. The authors state,

To the best of our knowledge, this is the first study demonstrating [emphasis added] biological differences in brain function associated with Yin and Yang types characterized by somatic symptoms.

I think the conclusions the authors draw here are fairly interesting because it seems there wasn’t that much evidence to go on with this theory besides ancient Chinese traditions. They acknowledge that a lot of the study is exploratory, but they’re able to say so confidently that they’ve demonstrated biological differences between participants that are classified as “yin” and those classified as “yang”.

I personally believe that subtypes of depression likely do exist. We’ve had some interesting discoveries using data-driven clustering (which is a method that obviously has problems of its own) and it would be in our best interest to discover accurate subgroups so we could tailor therapies for them, but the idea of depressed patients being classified as yin and yang doesn’t seem to sound very realistic to me.

And the conclusions of studies like this, even when correcting for multiple comparisons (which I know you think is unnecessary when using multilevel modeling), make me incredibly skeptical of fMRI studies. Would love to hear your thoughts.

My reply: I took a very quick look at the article. It seems that there are 48 people in the study, and it’s not clear at all how we are supposed to draw conclusions based on the general population. The groups identified as “yin” and “yang” are different in systematic ways—something about somatic symptoms and responses to a questionnaire—so you’d expect to see some differences in other measures too. But, again, I don’t know what this really tells us about people not in the group.

The point of the study can’t be just to demonstrate that the two groups are different. We already knew they were different in some systematic ways, even before doing a single MRI scan. The real question is what are the systematic differences. And, for that, statistical significance is not so useful.

I guess they could consider a preregistered replication. But I share your concern, as it does seem like a bit of a fishing expedition. And I don’t think the researchers would have much of a motivation to do a replication study, as the potential losses from a failed replication are greater than the potential gains from a successful replication.

Just to be clear: I know nothing about yin and yang and I only skimmed the article, I did not read it carefully. So I’m just giving my general impression, which is that I’d be cautious about generalizing beyond these 48 people in the particular setting of the study.

Why do sociologists (and bloggers) focus on the negative? 5 possible explanations. (A post in the style of Fabio Rojas)

Fabio Rojas asks why the academic field of sociology seems so focused on the negative. As he puts it, why doesn’t the semester begin with the statement, “Hi, everyone, this is soc 101, the scientific study of society. In this class, I’ll tell you about how American society is moving in some great directions as well as some lingering problems”?

Rojas writes:

If sociology is truly a broad social science, and not just the study “social problems,” then we might encourage more research into the undeniably positive improvements in human well being.

This suggestion interests me, in part because on this blog we are often negative. We sometimes write about cool new methods or findings in statistical modeling, causal inference, and social science, but we also spend a lot of time on the negative. And it’s not just us; it’s my impression that blogs in general have a lot of negativity, in the same way that movie reviews are often negative. Even if a reviewer likes a movie, he or she will often take some space to point out possible areas of improvements. And many of the most-remembered reviews are slams.

Rather than getting into a discussion of whether blogs, or academic sociology, or movie reviews, should be more positive or negative, let’s get into the more interesting question of Why.

Why is negativity such a standard response? Let me try to answer in Rojas style:

1. Division of labor. Within social science, sociology’s “job” is to confront us with the bad news, to push us to study inconvenient truths. If you want to hear good news, you can go listen to the economists. Similarly, blogs took the “job” of criticizing the mainstream media (and, later, the scientific establishment); it was a niche that needed filling. If you want to be a sociologist or blogger and focus on the good things, that’s fine, but you’ll be atypical. Explanation 1 suggests that sociologists (and bloggers, and movie reviewers) have adapted to their niches in the intellectual ecosystem, and that each field has the choice of continuing to specialize or to broaden by trying to occupy some of the “positivity” space occupied by other institutions.

2. Efficient allocation of resources. Where can we do the most good? Reporting positive news is fine, but we can do more good by focusing on areas of improvement. I think this is somewhat true, but not always. Yes, it’s good to point out where people can do better, but we can also do good by understanding how good things happen. This is related to the division-of-labor idea above, or it could be considered an example of comparative advantage.

3. Status. Sociology doesn’t have the prestige of economics (more generally, social science doesn’t have the prestige of the natural sciences); blogs have only a fraction of the audience of the mass media (and we get paid even less for blogging then they get paid for their writing); and movie reviewers, of course, are nothing but parasites on the movie industry. So maybe we are negative for emotional reasons—to kick back at our social superiors—or for strategic reasons, to justify our existence. Either way, these are actions of insecure people in the middle, trying to tear down the social structure and replace it with a new one where they’re at the top. This is kind of harsh and it can’t fully be true—how, for example, would it explain that even the sociologists who are tenured professors at top universities still (presumably) focus on the bad news, or that even star movie reviewers can be negative—but maybe it’s part of the way that roles and expectations are established and maintained.

4. Urgency. Psychiatrists work with generally-healthy people as well as the severely mentally ill. But caring for the sickest is the most urgent: these are people who are living miserable lives, or who pose danger to themselves and others. Similarly (if on a lesser scale of importance), we as social scientists might feel that progress will continue on its own, while there’s no time to wait to fix serious social ills. Similarly, as a blogger, I might not bother saying much about a news article that was well reported, because the article itself did a good job of sending its message. But it might seem more urgent to correct an error. Again, this is not always good reasoning—it could be that understanding a positive trend and keeping it going is more urgent than alerting people to a problem—but I think this may be one reason for a seeming focus on negativity. As Auden put it,

To-morrow, perhaps the future. The research on fatigue
And the movements of packers; the gradual exploring of all the
Octaves of radiation;
To-morrow the enlarging of consciousness by diet and breathing.

To-morrow the rediscovery of romantic love,
the photographing of ravens; all the fun under
Liberty’s masterful shadow;
To-morrow the hour of the pageant-master and the musician,

The beautiful roar of the chorus under the dome;
To-morrow the exchanging of tips on the breeding of terriers,
The eager election of chairmen
By the sudden forest of hands. But to-day the struggle.

5. Man bites dog. Failures are just more interesting to write about, and to read about, than successes. We’d rather hear the story of “secrets and lies in a Silicon Valley startup,” than hear the boring story of a medical device built by experienced engineers and sold at a reasonable price. Hence the popularity within social science (not just sociology!) of stories of the form, Everything looks like X but not Y; the popularity among bloggers of Emperor’s New Clothes narratives; and the popularity among movie reviewers of, This big movie isn’t all that. You will occasionally get it the other way—This seemingly bad thing is really good—but it’s generally in the nature of contrarian takes to be negative, because they’re reacting to some previous positive message coming from public relations and the news media.

Finally, some potential explanations that I don’t think really work: Laziness. Maybe it’s less effort to pick out things to complain about then to point out good news. I don’t think so. When it comes to society, as Rojas notes in his post, there are lots of positive trends to point out. Similar, science is full of interesting papers—open up just about any journal and look for the best, most interesting ideas—and there are lots of good movies too. Rewards. You get more credit, pay, and glory for being negative than positive. Again, I don’t think so. Sure, there are the occasional examples such as H. L. Mencken, but I think the smoother path to career success is to say positive things. Pauline Kael, for example, had some memorable pans but I’d say her characteristic stance was enthusiasm. For every Thomas Frank there are three Malcolm Gladwells (or so I say based on my unscientific guess), and it’s the Gladwells who get more of the fame and fortune. Personality. Sociologists, bloggers, and reviewers are, by and large, malcontents. They grumble about things cos that’s what they do, and whiny people are more likely to gravitate to these activities. OK, maybe so, but this doesn’t really explain why negativity is concentrated in these fields and media rather than others. The “personality” explanation just takes us back to our first explanation, “division of labor.”

And, yes, I see the irony that this post, which is all about why sociologists and bloggers are so negative, has been sparked by a negative remark made by a sociologist on a blog. And I’m sure you will have some negative things to say in the comments. After all, the only people more negative than bloggers, are blog commenters!

Surprise-hacking: “the narrative of blindness and illusion sells, and therefore continues to be the central thesis of popular books written by psychologists and cognitive scientists”

Teppo Felin sends along this article with Mia Felin, Joachim Krueger, and Jan Koenderink on “surprise-hacking,” and writes:

We essentially see surprise-hacking as the upstream, theoretical cousin of p-hacking. Though, surprise-hacking can’t be resolved with replication, more data or preregistration. We use perception and priming research to make these points (linking to Kahneman and priming, Simons and Chabris’s famous gorilla study and its interpretation, etc).

We think surprise-hacking implicates theoretical issues that haven’t meaningfully been touched on – at least in the limited literatures that we are aware of (mostly in cog sci, econ, psych). Though, there are probably related literatures out there (which you are very likely to know) – so I’m curious if you are aware of papers in other domains that deal with this or related issues?

I think the point that Felin et al. are making is that results obtained under conditions of surprise might not generalize to normal conditions. The surprise in the experiment is typically thought of as a mechanism for isolating some phenomenon—part of the design of the experiment—but arguably is it one of the conditions of the experiment as well. Thus, the conclusion of a study conducted under surprise should not be, “People show behavior X,” but rather, “People show behavior X under a condition of surprise.”

Regarding Felin’s question to me: I am not aware of any discussion of this issue in the political science literature, but maybe there’s something out there, or perhaps something related? All I can think of right now is experiments on public opinion and voting, where there is some discussion of relevance of isolated experiments to real-world behavior when people are subject to many influences.

I’ll conclude with a line from Felin et al.’s paper:

The narrative of blindness and illusion sells, and therefore continues to be the central thesis of popular books written by psychologists and cognitive scientists.

Continue reading ‘Surprise-hacking: “the narrative of blindness and illusion sells, and therefore continues to be the central thesis of popular books written by psychologists and cognitive scientists”’ »

How we should they carry out repeated cross-validation? They would like a third expert opinion…”

Someone writes:

I’m a postdoc studying scientific reproducibility. I have a machine learning question that I desperately need your help with. . . .

I’m trying to predict whether a study can be successfully replicated (DV), from the texts in the original published article. Our hypothesis is that language contains useful signals in distinguishing reproducible findings from irreproducible ones. The nuances might be blind to human eyes, but can be detected by machine algorithms.

The protocol is illustrated in the following diagram to demonstrate the flow of cross-validation. We conducted a repeated three-fold cross-validation on the data.

STEP 1) Train a doc2vec model on the training data (2/3 of the data) to convert raw texts into vectors representing language features (this algorithm is non-deterministic, the models and the outputs can be different even with the same input and parameter)
STEP 2) Infer vectors using the doc2vec model for both training and test sets
STEP 3) Train a logistic regression using the training set
STEP 4) Apply the logistic regression to the test set, generate a predicted probability of success

Because doc2vec is not deterministic, and we have a small training sample, we came up with two choices of strategies:

(1) All studies were first divided into three subsamples A, B, and C. Step 1 through 4 was done once with sample A as the test set, and a combined sample of B and C as the training set, generating on predicted probability for each study in sample A. To generate probabilities for the entire sample, Step 1 through 4 was repeated two more times, setting sample B or C as the test set respectively. At this moment, we had one predicted probability for each study. Subsequently, the entire sample was shuffled to create a different random three-fold partition, followed by same three-fold cross-validation. A new probability was generated for each study this time. The whole procedure was iterated 100 times, so each study had 100 different probabilities. We averaged the probabilities and compared the average probabilities with the ground truth to generate a single AUC score.

(2) All studies were first divided into three subsamples A, B, and C. Step 1 through 4 was first repeated 100 times with sample A as the test set, and a combined sample of B and C as the training set, generating 100 predicted probabilities for each study in sample A. As I said, these 100 probabilities are different because doc2vec isn’t deterministic. We took the average of these probabilities and treated that as our final estimate for the studies. To generate average probabilities for the entire sample, each group of 100 runs was repeated two more times, setting sample B or C as the test set respectively. An AUC was calculated upon completion, between the ground truth and the average probabilities. Subsequently, the entire sample was shuffled to create a different random three-fold partition, followed by the same 3×100 runs of modeling, generating a new AUC. The whole procedure was iterated on 100 different shuffles, and an AUC score was calculated each time. We ended up having a distribution of 100 AUC scores.

I personally thought strategy two is better because it separates variation in accuracy due to sampling from the non-determinism of doc2vec. My colleague thought strategy one is better because it’s less computationally intensive and produce better results, and doesn’t have obvious flaws.

My first thought is to move away from the idea of declaring a study as being “successfully replicated.” Better to acknowledge the continuity of the results from any study.

Getting to the details of your question on cross-validation: Jeez, this really is complicated. I keep rereading your email over and over again and getting confused each time. So I’ll throw this one out to the commenters. I hope someone can give a useful suggestion . . .

OK, I do have one idea, and that’s to evaluate your two procedures (1) and (2) using fake-data simulation: Start with a known universe, simulate fake data from that universe, then apply procedures (1) and (2) and see if they give much different answers. Loop the entire procedure and see what happens, comparing your cross-validation results to the underlying truth which in this case is assumed known. Fake-data simulation is the brute-force approach to this problem, and perhaps it’s a useful baseline to help understand your problem.

A couple of thoughts regarding the hot hand fallacy fallacy

For many years we all believed the hot hand was a fallacy. It turns out we were all wrong. Fine. Such reversals happen.

Anyway, now that we know the score, we can reflect on some of the cognitive biases that led us to stick with the “hot hand fallacy” story for so long.

Jason Collins writes:

Apart from the fact that this statistical bias slipped past everyone’s attention for close to thirty years, I [Collins] find this result extraordinarily interesting for another reason. We have a body of research that suggests that even slight cues in the environment can change our actions. Words associated with old people can slow us down. Images of money can make us selfish. And so on. Yet why haven’t these same researchers been asking why a basketball player would not be influenced by their earlier shots – surely a more salient part of the environment than the word “Florida”? The desire to show one bias allowed them to overlook another.

Also I was thinking a bit more about the hot hand, in particular a flaw in the underlying logic of Gilovich etc (and also me, before Miller and Sanjurjo convinced me about the hot hand): The null model is that each player j has a probability p_j of making a given shot, and that p_j is constant for the player (considering only shots of some particular difficulty level). But where does p_j come from? Obviously players improve with practice, with game experience, with coaching, etc. So p_j isn’t really a constant. But if “p” varies among players, and “p” varies over the time scale of years or months for individual players, why shouldn’t “p” vary over shorter time scales too? In what sense is “constant probability” a sensible null model at all?

I can see that “constant probability for any given player during a one-year period” is a better model than “p varies wildly from 0.2 to 0.8 for any player during the game.” But that’s a different story. The more I think about the “there is no hot hand” model, the more I don’t like it as any sort of default.

In any case, it’s good to revisit our thinking about these theories in light of new arguments and new evidence.

Oh, I hate it when work is criticized (or, in this case, fails in attempted replications) and then the original researchers don’t even consider the possibility that maybe in their original work they were inadvertently just finding patterns in noise.

I have a sad story for you today.

Jason Collins tells it:

In The (Honest) Truth About Dishonesty, Dan Ariely describes an experiment to determine how much people cheat . . . The question then becomes how to reduce cheating. Ariely describes one idea:

We took a group of 450 participants and split them into two groups. We asked half of them to try to recall the Ten Commandments and then tempted them to cheat on our matrix task. We asked the other half to try to recall ten books they had read in high school before setting them loose on the matrices and the opportunity to cheat. Among the group who recalled the ten books, we saw the typical widespread but moderate cheating. On the other hand, in the group that was asked to recall the Ten Commandments, we observed no cheating whatsoever.

Sounds pretty impressive! But these things all sound impressive when described at some distance from the data.

Anyway, Collins continues:

This experiment has now been subject to a multi-lab replication by Verschuere and friends. The abstract of the paper:

. . . Mazar, Amir, and Ariely (2008; Experiment 1) gave participants an opportunity and incentive to cheat on a problem-solving task. Prior to that task, participants either recalled the 10 Commandments (a moral reminder) or recalled 10 books they had read in high school (a neutral task). Consistent with the self-concept maintenance theory . . . moral reminders reduced cheating. The Mazar et al. (2008) paper is among the most cited papers in deception research, but it has not been replicated directly. This Registered Replication Report describes the aggregated result of 25 direct replications (total n = 5786), all of which followed the same pre-registered protocol. . . .

And what happened? It’s in the graph above (from Verschuere et al., via Collins). The average estimated effect was tiny, it was not conventionally “statistically significant” (that is, the 95% interval included zero), and it “was numerically in the opposite direction of the original study.”

As is typically the case, I’m not gonna stand here and say I think the treatment had no effect. Rather, I’m guessing it has an effect which is sometimes positive and sometimes negative; it will depend on person and situation. There doesn’t seem to be any large and consistent effect, that’s for sure. Which maybe shouldn’t surprise us. After all, if the original finding was truly a surprise, then we should be able to return to our original state of mind, when we did not expect this very small intervention to have such a large and consistent effect.

I promised you a sad story. But, so far, this is just one more story of a hyped claim that didn’t stand up to the rigors of science. And I can’t hold it against the researchers that they hyped it: if the claim had held up, it would’ve been an interesting and perhaps important finding, well worth hyping.

No, the sad part comes next.

Collins reports:

Multi-lab experiments like this are fantastic. There’s little ambiguity about the result.

That said, there is a response by Amir, Mazar and Ariely. Lots of fluff about context. No suggestion of “maybe there’s nothing here”.

You can read the response and judge for yourself. I think Collins’s report is accurate, and that’s what made me sad. These people care enough about this topic to conduct a study, write it up in a research article and then in a book—but they don’t seem to care enough to seriously entertain the possibility they were mistaken. It saddens me. Really, what’s the point of doing all this work if you’re not going to be open to learning?

(See this comment for further elaboration of these points.)

And there’s no need to think anything done in the first study was unethical at the time. Remember Clarke’s Law.

Another way of putting it is: Ariely’s book is called “The Honest Truth . . .” I assume Ariely was honest when writing this book; that is, he was expressing sincerely-held views. But honesty (and even transparency) are not enough. Honesty and transparency supply the conditions under which we can do good science, but we still need to perform good measurements and study consistent effects. The above-discussed study failed in part because of the old, old problem that they were using a between-person design to study within-person effects; see here and here. (See also this discussion from Thomas Lumley on a related issue.)

P.S. Collins links to the original article by Mazar, Amir, and Ariely. I guess that if I’d read it in 2008 when it appeared, I’d’ve believed all its claims too. A quick scan shows no obvious problems with the data or analyses. But there can be lots of forking paths and unwittingly opportunistic behavior in data processing and analysis; recall the 50 Shades of Gray paper (in which the researchers performed their own replication and learned that their original finding was not real) and its funhouse parody 64 Shades of Gray paper, whose authors appeared to take their data-driven hypothesizing all too seriously. The point is: it can look good, but don’t trust yourself; do the damn replication.
Continue reading ‘Oh, I hate it when work is criticized (or, in this case, fails in attempted replications) and then the original researchers don’t even consider the possibility that maybe in their original work they were inadvertently just finding patterns in noise.’ »

Time series of Democratic/Republican vote share in House elections

Yair prepared this graph of average district vote (imputing open seats at 75%/25%; see here for further discussion of this issue) for each House election year since 1976:

Decades of Democratic dominance persisted through 1992; since then the two parties have been about even.

As has been widely reported, a mixture of geographic factors and gerrymandering have given Republicans the edge in House seats in recent years (most notably in 2012 where they retained control even after losing the national vote), but if you look at aggregate votes it’s been a pretty even split.

The above graph also shows that the swing in 2018 was pretty big: not as large as the historic swings in 1994 and 2010, but about the same as the Democratic gains in 2006 and larger than any other swing in the past forty years.

See here and here for more on what happened in 2018.

“Do you have any recommendations for useful priors when datasets are small?”

A statistician who works in the pharmaceutical industry writes:

I just read your paper (with Dan Simpson and Mike Betancourt) “The Prior Can Often Only Be Understood in the Context of the Likelihood” and I find it refreshing to read that “the practical utility of a prior distribution within a given analysis then depends critically on both how it interacts with the assumed probability model for the data in the context of the actual data that are observed.” I also welcome your comment about the importance of “data generating mechanism” because, for me, is akin to selecting the “appropriate” distribution for a given response. I always make the point to the people I’m working with that we need to consider the clinical, scientific, physical and engineering principles governing the underlying phenomenon that generates the data; e.g., forces are positive quantities, particles are counts, yield is bounded between 0 and 1.

You also talk about the “big data, small signal revolution.” In industry, however, we face the opposite problem, our datasets are usually quite small. We may have a new product, for which we want to make some claims, and we may have only 4 observations. I do not consider myself a Bayesian, but I do believe that Bayesian methods can be very helpful in industrial situations. I also read your Prior Choice Recommendations [see also discussion here — AG] but did not find anything specific about small sample sizes. Do you have any recommendations for useful priors when datasets are small?

My reply:

When datasets are small, and when data are noisier, that’s when priors are more important. When in doubt, I think the way to explore anything in statistics, including priors, is through fake data simulation, which in this case will give you a sense of what is implied, in terms of potential patterns in data, from any particular set of prior assumptions. Typically we set priors to be too weak, and this can be seen in replicated data that include extreme and implausible results.

Prior distributions for covariance matrices

Someone sent me a question regarding the inverse-Wishart prior distribution for covariance matrix, as it is the default in some software he was using. Inverse-Wishart does not make sense for prior distribution; it has problems because the shape and scale are tangled. See this paper, “Visualizing Distributions of Covariance Matrices,” by Tomoki Tokuda, Ben Goodrich, Iven Van Mechelen, Francis Tuerlinckx and myself. Right now I’d use the LKJ family. In Stan there are lots of options. See also our wiki on prior distributions.

Should we be concerned about MRP estimates being used in later analyses? Maybe. I recommend checking using fake-data simulation.

Someone sent in a question (see below). I asked if I could post the question and my reply on blog, and the person responded:

Absolutely, but please withhold my name because this is becoming a touchy issue within my department.

The boldface was in the original.

I get this a lot. There seems to be a lot of fear out there when it comes to questioning established procedures.

Anyway, here’s the question that the person sent in:

CDC has recently been using your multilevel estimation with post-stratification method to produce county, city, and census tract-level disease prevalence estimates (see https://www.cdc.gov/500cities/). The data source is the annual phone-based Behavioral Risk Factor Surveillance System (n=450k). CDC is not transparent about covariates included in the models used to construct the estimates, but as I understand it they are mostly driven by national individual-level associations between sociodemographic factors and disease prevalence. Presumably, the random effects would not influence a unit’s estimated prevalence much if the sample size from that unit is small (as is true for most cities/counties, and for many census tracts the sample size is zero).

I am wondering if you are as troubled as I am by how these estimates are being used. First, websites like County Health Rankings and City Health Dashboard are providing these estimates to the public without any disclaimer that these are not actually random samples of cities/counties/tracts and may not reflect reality. Second, and more problematically, researchers are starting to conduct ecologic studies that analyze the association, for example, between census tract socioeconomic composition and obesity prevalence (It seems quite likely that the study is actually just identifying the individual-level association between income and obesity used to produce the estimates).

I’ve now become involved in a couple of projects that are trying to analyze these estimates so it seems as though their use will increase over time. The only disclaimer that CDC provides is that the estimates shouldn’t be used to evaluate policy.

Are you more confident about the use of these estimates than I am? I am also wondering if CDC should be more explicit in disclosing their limitations to prevent misuse.

My reply:

Wow, N = 450K. That’s quite a survey. (I know my correspondent called it “n,” but when it’s this big, I think the capital letter is warranted.) And here’s the page where they mention Mister P! And they have a web interface.

I’m not quite sure why you say the website provides the estimate “without any disclaimer.” Here’s one of the displays:

It’s not the prettiest graph in the world—I’ll grant you that—but it’s clearly labeled “Model-based estimates” right at the top.

I agree with you, though, in your concern that if these model-based estimates are being used in later analyses, there’s a risk of reification, in which county or city-level predictors that are used in the model can look automatically like good predictors of the outcomes. I’d guess this would be more of a concern with rare conditions than with something like coronary heart disease where the sample size will be (unfortunately) so large.

The right thing to do next, I think, is some fake-data simulation to see how much this should be a concern. CDC has already done some checking (from their methodology page, “CDC’s internal and external validation studies confirm the strong consistency between MRP model-based SAEs and direct BRFSS survey estimates at both state and county levels.”) and I guess you could do more.

Overall, I’m positively inclined toward these MRP estimates because I’d guess it’s much better than the alternatives such as raw or weighted local averages or some sort of postprocessed analysis of weighted averages. I think those approaches would have lots more problems.

In any case, it’s cool to see my method being used by people who’ve never met me! Mister P is all grown up.

P.S. My correspondent provides further background:

The CDC generates prevalence estimates for various diseases at the county level (or smaller) by applying MRP to the national Behavioral Risk Factor Surveillance System. Unlike for other diseases, they’ve documented their methods for diabetes. Their model defines 12 population strata per county (2 races x 2 genders x 3 age groups) and incorporates random effects for stratum, county, and state. There are no other variables at any level in the model.

A number of papers use the MRP-derived data to estimate associations between, for example, PM2.5 and diabetes prevalence. Do you think this is a valid approach? Would it be valid if all of the MRP covariates are included in the model?

My response:

1. Regarding the MRP model, it is what it is. Including more demographic factors is better, but adjusting for these 12 cells per county is better than not adjusting, I’d think. One thing I do recommend is to use group-level predictors. In this case, the group is county, and lots of county-level predictors will be available that will be relevant for predicting health outcomes.

2. Regarding the postprocessing using the MRP estimates: Sure, it should be better to fold the two models together, but the two-stage approach (first use MRP to estimate prevalences, then fit another model) could work ok too, with some loss of efficiency. Again, I’d recommend using fake-data simulation to estimate the statistical properties of this approach for the problem at hand.

My footnote about global warming

At the beginning of my article, How to think scientifically about scientists’ proposals for fixing science, which we discussed yesterday, I wrote:

Science is in crisis. Any doubt about this status has surely been been dispelled by the loud assurances to the contrary by various authority figures who are deeply invested in the current system . . . When leaders go to that much trouble to insist there is no problem, it’s only natural for outsiders to worry.

And at that point came a footnote, which I want to share with you here:

At this point a savvy critic might point to global-warming denialism and HIV/AIDS denialism as examples where the scientific consensus is to be trusted and where the dissidents are the crazies and the hacks. Without commenting on the specifics of these fields, I will just point out that the research leaders in those areas are not declaring a lack of crisis—far from it!—nor are they shilling for their “patterns of discovery.” Rather, the leaders in these fields have been raising the alarm for decades and have been actively pointing out inconsistencies in their theories and gaps in their understanding. Thus, I do not think that my recommendation to watch out when the experts tell you to calm down, implies blanket support for dissidents in all areas of science. One’s attitude toward dissidents should depend a bit on the openness to inquiry of the establishments from which they are dissenting.

Latour Sokal NYT

Alan Sokal writes:

I don’t know whether you saw the NYT Magazine’s fawning profile of
sociologist of science Bruno Latour about a month ago.

I wrote to the author, and later to the editor, to critique the gross lack of balance (and even of the most minimal fact-checking). No reply. So I posted my critique on my webpage.

From that linked page from Sokal:

The basic trouble with much of Latour’s writings—as with those of some other sociologists and philosophers of a “social constructivist” bent—is that (as Jean Bricmont and I [Sokal] pointed out already in 1997)

these texts are often ambiguous and can be read in at least two distinct ways: a “moderate” reading, which leads to claims that are either worth discussing or else true but trivial; and a “radical” reading, which leads to claims that are surprising but false. Unfortunately, the radical interpretation is often taken not only as the “correct” interpretation of the original text but also as a well-established fact (“X has shown that …”) . . .

numerous ambiguous texts that can be interpreted in two different ways: as an assertion that is true but relatively banal, or as one that is radical but manifestly false. And we cannot help thinking that, in many cases, these ambiguities are deliberate. Indeed, they offer a great advantage in intellectual battles: the radical interpretation can serve to attract relatively inexperienced listeners or readers; and if the absurdity of this version is exposed, the author can always defend himself by claiming to have been misunderstood, and retreat to the innocuous interpretation.

Sokal offers a specific example.

First, he quotes the NYT reporter who wrote:

When [Latour] presented his early findings at the first meeting of the newly established Society for Social Studies of Science, in 1976, many of his colleagues were taken aback by a series of black-and-white photographic slides depicting scientists on the job, as though they were chimpanzees. It was felt that scientists were the only ones who could speak with authority on behalf of science; there was something blasphemous about subjecting the discipline, supposedly the apex of modern society, to the kind of cold scrutiny that anthropologists traditionally reserved for “premodern” peoples.

Sokal responds:

In reality, it beggars belief to imagine that sociologists of science—whose entire raison d’être is precisely to subject the social practice of science to “cold scrutiny”—could possibly think that “scientists were the only ones who could speak with authority on behalf of science”. Did you bother to seek confirmation of this self-serving claim from anyone present at that 1976 meeting, other than Latour himself?

Sokal continues in his letter to the NYT reporter:

In the same way, you faithfully reproduce Latour’s ambiguities concerning the notion of “fact”:

It had long been taken for granted, for example, that scientific facts and entities, like cells and quarks and prions, existed “out there” in the world before they were discovered by scientists. Latour turned this notion on its head. In a series of controversial books in the 1970s and 1980s, he argued that scientific facts should instead be seen as a product of scientific inquiry. …

In your article you take for granted that Latour’s view is correct: indeed, a few paragraphs later you say that Latour showed “that scientific facts are the product of all-too-human procedures”. But, like Latour, you never explain in what sense the traditional view—that cells and quarks and prions existed “out there” in the world before they were discovered by scientists—is mistaken.

I’m with Sokal: Scientific facts are real. Their discovery, expression, and (all too often) misrepresentation are the product of human procedures, but the facts and entities exist.

As Sokal discusses, the whole thing is slippery, as can be seen even in the brief discussion excerpted above. If you give Latour’s statements a minimalist interpretation—the concepts of “cells,” “quarks,” etc. are human-constructed—there’s really no problem. Yes, the phenomena described by our concepts of cells, quarks, etc. are real and would exist even if humans had never appeared on the Earth, but one could imagine completely different ways of expressing and formulating models for these scientific facts, in forms that might look nothing like “cells” and “quarks.” Just as one can, for example, express classical mechanics with or without the concept of “force.”

And, of course, if you want to go further, there’s lots of apparent scientific facts that, it seems, are simply human-created mistakes: I’m thinking here of examples such as recent studies of ESP, himmicanes, air rage, beauty and sex ratio, etc.

So Latour’s general perspective is valuable. But Sokal argues, convincingly to me, that much of the reading of Latour, including in that news article, takes the strong view, what might be called the postmodern view, which throws the baby of replicable science out with the bathwater of contingent theories.

Sokal writes:

If Latour had really shown that scientific facts are the product of all-too-human procedures, then the critics’ charge would be unfair. But in reality Latour had not shown anything of the sort; he had simply asserted it, and many others (not cited by you) had criticized those assertions. Of course, it goes without saying that scientists’ beliefs (and assertions of alleged fact) about the external world are the product of all-too-human procedures — that is true and utterly banal. But Latour’s claims are nothing more than deliberate confusion between two senses of the word “fact” (namely, the usual one and his own idiosyncratic one). . . . muddying the distinction between facts and assertions of fact undermines our ability to think clearly about this crucial psychological/sociological/political problem.

Sokal continues with his correspondence with the New York Times (they eventually replied after he sent them several emails).

Just to be clear here, I don’t think there are any villains in this story.

Latour has a goofy view of science, and I agree with Sokal that his (Latour’s) expressions of his ideas are a bit slippery—but, hey, Latour entitled to express his views, and you gotta give him credit for being influential. Latour’s successes must in some part be a consequence of previous gaps or at least underemphasized points in discussions of science.

The author of the NYT article, Ava Kofman, found a good story and ran with it. I agree with Sokal that she missed the point—or, to put it another way, that she might well be doing a good job telling the story of Latour, she’s not doing a good job telling the story of Latour’s ideas. But, that’s not quite her job: even if, as the saying goes, Latour’s work “contains much that is original and much that is correct; unfortunately that which is correct is not original, and that which is original is not correct,” Kofman is not really writing about this; she’s writing more about Latour’s influence.

The ironic thing, though, is that Kofman’s article is following the standard template of feature stories about a scientist or academic, which is to treat him as a hero. If there’s one idea that Latour stands for, it’s that scientists are part of a social process, and it misses the point to routinely treat them as misunderstood geniuses.

Anyway, although I share Sokal’s annoyance that the author of an article on Latour missed key aspects of Latour’s ideas and then didn’t even reply to his thoughtful criticism, I can understand why the reporter wants to move on to her next project. In my experience, journalists are more forward-looking than academics: we worry about our past errors, they just move on. It’s a different style, perhaps deriving from the difference between traditional publication in bound volumes and publication in fishwrap.

Finally, perhaps there’s not much the NYT editors can do at this point. Newspapers, and for that matter scientific journals, rarely run corrections even of clear factual errors—at least, that’s been my experience. So I can’t blame them too much for following common practice.

Ultimately, this all comes down to questions of emphasis and interpretation. Latour has, for better or worse, expressed ideas that have been influential in the sociology of science; his story is interesting and worth a magazine article; writing a story with Latour as hero leads to some confusion about what is understood by others in that field. In that sense it’s not so different from a story in the sports or business pages that presents a contest from one side. That’s a journalistic convention, and that’s fine, and it’s also fine for someone such as Sokal who has a different perspective (one that I happen to agree with) to share that too.

As Sokal puts it:

The ironic thing is that Latour has spent his life decrying (and rightly so) the scientist-as-hero approach to the presenting science to the general public; but here is an article that takes an extreme version of the same approach, albeit applied to a sociologist/philosopher rather than a scientist.

A newspaper or magazine article about a thinker should not merely be a fawning and uncritical celebration of his brilliance; it should also discuss his ideas. Indeed, this article does purport to explain and discuss Latour’s ideas, not just his personal story; but it does so in a completely uncritical way, not even letting on that there might be people who have cogent critiques of his ideas. That, it seems to me, is a gross failure of balance—and more importantly, a gross abdication of the newspaper’s mission to inform its readers about important subjects. (In this case, a subject that has serious real-world consequences.) Not to mention the gross lack of elementary fact-checking that I pointed out.

Of course, one could also question whether the “hero” mode of writing is appropriate even on the sports or business pages. This mode of writing presents a contest from one side only; and it is not very often the case in sports or business that there is in fact only one side.

So, yeah, the NYT article was not so bad as feature articles go—it told an engaging story from one particular perspective—but there was an opportunity to do better. Hence Sokal’s post, and this post linking to it.

P.S. Hey, the name Bruno Latour rings a bell . . . Unfortunately, he didn’t make it out of the first round of our seminar speaker competition.

A parable regarding changing standards on the presentation of statistical evidence

Now, the P-value Sneetches
Had tables with stars.
The Bayesian Sneetches
Had none upon thars.

Those stars weren’t so big. They were really so small.
You might think such a thing wouldn’t matter at all.

But, because they had stars, all the P-value Sneetches
Would brag, “We’re the best kind of Sneetch on the Beaches.
With their snoots in the air, they would sniff and they’d snort
“We’ll have nothing to do with the Bayesian sort!”
And whenever they met some, when they were out walking,
They’d hike right on past them without even talking.

When the P-value children went out to play ball,
Could a Bayesian get in the game… ? Not at all.
You only could play if your tables had stars
And the Bayesian children had none upon thars.

When the P-value Sneetches had frankfurter roasts
Or picnics or parties or PNAS toasts,
They never invited the Bayesian Sneetches.
They left them out cold, in the dark of the beaches.
They kept them away. Never let them come near.
And that’s how they treated them year after year.

Then ONE day, seems… while the Bayesian Sneetches
Were moping and doping alone on the beaches,
Just sitting there wishing their tables had stars…
A stranger zipped up in the strangest of cars!

“My friends,” he announced in a voice clear and keen,
“My name is Savage McJeffreys McBean.
And I’ve heard of your troubles. I’ve heard you’re unhappy.
But I can fix that. I’m the Fix-it-Up Chappie.
I’ve come here to help you. I have what you need.
And my prices are low. And I work at great speed.
And my work is one hundred per cent guaranteed!

Then, quickly Savage McJeffreys McBean
Put together a Bayes Factor machine.
And he said, “You want stars like a Star-Tabled Sneetch… ?
My friends, you can have them for three dollars each!”

“Just pay me your money and hop right aboard!”
So they clambered inside. Then the big machine roared
And it klonked. And it bonked. And it jerked. And it berked
And it bopped them about. But the thing really worked!
When the Bayesian Sneetches popped out, they had stars!
They actually did. They had stars upon thars!

Then they yelled at the ones who had stars at the start,
“We’re exactly like you! You can’t tell us apart.
We’re all just the same, now, you snooty old smarties!
And now we can go to your NPR parties.”

“Good grief!” groaned the ones who had stars at the first.
“We’re still the best Sneetches and they are the worst.
But, now, how in the world will we know,” they all frowned,
“If which kind is what, or the other way round?”

Then came McBean with a very sly wink.
And he said, “Things are not quite as bad as you think.
So you don’t know who’s who. That is perfectly true.
But come with me, friends. Do you know what I’ll do?
I’ll make you, again, the best Sneetches on beaches
And all it will cost you is ten dollars eaches.”

“P-value stars are no longer in style,” said McBean.
“What you need is a trip through my Replication Machine.
This wondrous contraption will take off your stars
So you won’t look like Sneetches who have them on thars.”
And that handy machine
Working very precisely
Removed all the stars from their tables quite nicely.

Then, with snoots in the air, they paraded about
And they opened their beaks and they let out a shout,
“We know who is who! Now there isn’t a doubt.
The best kind of Sneetches are Sneetches without!”

Then, of course, those with stars all got frightfully mad.
To be wearing a star now was frightfully bad.
Then, of course, old Savage McJeffreys McBean
Invited them into his Star-Off machine.

Then, of course from THEN on, as you probably guess,
Things really got into a horrible mess.
All the rest of that day, on those wild screaming beaches,
The fix-it-up Chappie kept fixing up Sneetches.
Off again! On Again!
In again! Out again!
Through the machines they raced round and about again,
Changing their stars every minute or two.
They kept paying money. They kept running through
Until neither the Plain nor the Star-Tables knew
Whether this one was that one… or that one was this one
Or which one was what one… or what one was who.

Then, when every last cent
Of their money was spent,
The Fix-it-Up Chappie packed up
And he went.

And he laughed as he drove
In his car up the beach,
“They never will learn.
No. You can’t teach a Sneetch!”

But McBean was quite wrong. I’m quite happy to say
That the Sneetches got really quite smart on that day,
The day they decided that Sneetches are Sneetches
And no kind of Sneetch is the best on the beaches
That day, all the Sneetches forgot about stars
And whether they had one, or not, upon thars.

[Original is on the web, for example here. I was inspired to construct the above adaptation after thinking of the series of public advice I’ve given over the years regarding prior distributions: first we recommended uniform priors, then scaled-inverse-Wishart and Cauchy and half-Cauchy, now LKJ and normal and half-normal and horseshoe, and who knows what in the future. And I used to recommend p-values and now I don’t. It’s hard to keep up . . .]

Niall Ferguson and the perils of playing to your audience

History professor Niall Ferguson had another case of the sillies.

Back in 2012, in response to Stephen Marche’s suggestion that Ferguson was serving up political hackery because “he has to please corporations and high-net-worth individuals, the people who can pay 50 to 75K to hear him talk,” I wrote:

But I don’t think it’s just about the money. By now, Ferguson must have enough money to buy all the BMWs he could possibly want. To say that Ferguson needs another 50K is like saying that I need to publish in another scientific journal. No, I think what Ferguson is looking for (as am I, in my scholarly domain) is influence. He wants to make a difference. And one thing about being paid $50K is that you can assume that whoever is paying you really wants to hear what you have to say.

The paradox, though, as Marche notes, is that Ferguson gets and keeps the big-money audience is by telling them not what he (Ferguson) wants to say—not by giving them his unique insights and understanding—but rather by telling his audience what they want to hear.

That’s what I called The Paradox of Influence.

But then, a year later, Ferguson went too far, even by his own standards, when during a talk to a bunch of richies he attributed Keynes’s economic views (I don’t actually know exactly what Keyesianism is, but I think a key part is for the government to run surpluses during economic booms and deficits during recessions) to Keynes being gay and marrying a ballerina and talking about poetry. The general idea, I think, is that people without kids don’t care so much about the future, and this motivated Keynes’s party-all-the-time attitude, which might have worked just fine for Eddie Murphy’s girl in the 1980s and in San Francisco bathhouses of the 1970s but, according to Ferguson, is not the ticket for preserving today’s American empire.

My theory on that one is not that Ferguson is a flaming homophobe or a shallow historical determinist (the expression is “piss-poor monocausal social science,” I believe) but rather that he misjudged his audience and threw them some academic frat-boy-style humor that he mistakenly thought they’d enjoy. He served them red meat, but the wrong red meat. Probably would’ve been better for him to have just preached the usual get-the-government-off-our-backs sermon and not tried to get cute by bring up the whole ballerina thing.

Anyway, it happened again! Fergie made a fool of himself, just for trying to make some people happy.

Brian Contreras, Ada Statler, and Courtney Douglas (link from Jeet Heer via Mark Palko) report:

Leaked emails show Hoover academic conspiring with College Republicans to conduct ‘opposition research’ on student . . . “[The original Cardinal Conversations steering committee] should all be allies against O. Whatever your past differences, bury them. Unite against the SJWs. [Christos] Makridis [a fellow at Vox Clara, a Christian student publication] is especially good and will intimidate them,” Ferguson wrote. “Now we turn to the more subtle game of grinding them down on the committee. The price of liberty is eternal vigilance” . . . In the email chain, Ferguson wrote, “Some opposition research on Mr. O might also be worthwhile,” referring to Ocon.
Minshull wrote in response that he would “get on the opposition research for Mr. O.” Minshull is presently Ferguson’s research assistant . . .

It’s hard for me to imagine that Ferguson, globetrotting historian and media personality that he is, would really care so much about “grinding down” some students in a university committee. I’m guessing he was just trying to ingratiate himself with these youngsters, who I guess he views as the up-and-coming new generation of college politicians. Ferguson’s just the modern version of the stock figure, the middle-aged guy trying to talk groovy like the kids. “Some opposition research on Mr. O might also be worthwhile,” indeed. It’s the university-politics version of, ummm, I dunno, building a treehouse with some 12-year-olds, or playing hide-and-seek with a group of 4-year-olds.

The whole thing’s kinda sad in that Fergie seems so clueless. Even in the aftermath, he says, “I very much regret the publication of these emails. I also regret having written them.” Which is fine, but he still doesn’t seem to recognize the absurdity of the situation, a professor in his fifties playing student politics. As with his slurs of Keynes, the man is just a bit too eager to give his audience what he thinks they want to hear.

(pre-2000) academic historian
(2000-2005) propagandist for Anglo-American empire
(2010-2015) TV talking head and paid speaker for rich people
(2018) player in undergraduate campus politics.

At this point, he’s gotta be thinking: Could I have stopped somewhere along the way? Or was the whole trajectory inevitable. It’s a question of virtual history.

“Statistical insights into public opinion and politics” (my talk for the Columbia Data Science Society this Wed 9pm)

7pm in Fayerweather 310:

Why is it more rational to vote than to answer surveys (but it used to be the other way around)? How does this explain why we should stop overreacting to swings in the polls? How does modern polling work? What are the factors that predict election outcomes? What’s good and bad about political prediction markets? How do we measure political polarization, and what does it imply for our politics? We will discuss these and other issues in American politics and more generally how we can use data science to learn about the social world.

People can read the following articles ahead of time if they would like.

Short:
https://slate.com/news-and-politics/2018/11/midterms-blue-wave-statistics-data-analysis.html
http://www.slate.com/articles/news_and_politics/politics/2016/08/why_trump_clinton_won_t_be_a_landslide.html
https://slate.com/news-and-politics/2016/08/dont-be-fooled-by-clinton-trump-polling-bounces.html
http://www.slate.com/articles/news_and_politics/moneybox/2016/07/why_political_betting_markets_are_failing.html

Longer:
http://www.stat.columbia.edu/~gelman/research/published/what_learned_in_2016_5.pdf
http://www.stat.columbia.edu/~gelman/research/published/swingers.pdf

Bayes, statistics, and reproducibility: “Many serious problems with statistics in practice arise from Bayesian inference that is not Bayesian enough, or frequentist evaluation that is not frequentist enough, in both cases using replication distributions that do not make scientific sense or do not reflect the actual procedures being performed on the data.”

This is an abstract I wrote for a talk I didn’t end up giving. (The conference conflicted with something else I had to do that week.) But I thought it might interest some of you, so here it is:

Bayes, statistics, and reproducibility

The two central ideas in the foundations of statistics—Bayesian inference and frequentist evaluation—both are defined in terms of replications. For a Bayesian, the replication comes in the prior distribution, which represents possible parameter values under the set of problems to which a given model might be applied; for a frequentist, the replication comes in the reference set or sampling distribution of possible data that could be seen if the data collection process were repeated. Many serious problems with statistics in practice arise from Bayesian inference that is not Bayesian enough, or frequentist evaluation that is not frequentist enough, in both cases using replication distributions that do not make scientific sense or do not reflect the actual procedures being performed on the data. We consider the implications for the replication crisis in science and discuss how scientists can do better, both in data collection and in learning from the data they have.

P.S. I wrote the above abstract in January for a conference that ended up being scheduled for October. It is now June, and this post is scheduled for December. There’s no real rush, I guess; this topic is perennially of interest.

P.P.S. In writing Bayesian “inference” and frequentist “evaluation,” I’m following Rubin’s dictum that Bayes is one way among many to do inference and make predictions from data, and frequentism refers to any method of evaluating statistical procedures using their modeled long-run frequency properties. Thus, Bayes and freq are not competing, despite what you often hear. Rather, Bayes can be a useful way of coming up with statistical procedures, which you can then evaluate under various assumptions.

Both Bayes and freq are based on models. The model in Bayes is obvious: It’s the data model and the prior or population model for the parameters. The model in freq is what you use to get those long-run frequency properties. Frequentist statistics is not based on empirical frequencies: that’s called external validation. All the frequentist stuff—bias, variance, coverage, mean squared error, etc.—that all requires some model or reference set.

And that last paragraph is what I’m talkin bout, how Bayes and freq are two ways of looking at the same problem. After all, Bayesian inference has ideal frequency properties—if you do these evaluations, averaging over the prior and data distributions you used in your model fitting. The frequency properties of Bayesian (or other) inference when the model is wrong—or, mathematically speaking, when you want to average over a joint distribution that’s not the same as the one in your inferential model—that’s another question entirely. That’s one thing makes frequency evaluation interesting and challenging. If we knew all our models were correct, statistics would simply be a branch of probability theory, hence a branch of mathematics, and nothing more.

OK, that was kinda long for a P.P.S. It felt good to write it all down, though.

My talk tomorrow (Tues) noon at the Princeton University Psychology Department

Integrating collection, analysis, and interpretation of data in social and behavioral research

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

The replication crisis has made us increasingly aware of the flaws of conventional statistical reasoning based on hypothesis testing. The problem is not just a technical issue with p-values, not can it be solved using preregistration or other purely procedural approaches. Rather, appropriate solutions have three aspects. First, in collecting your data there should be a concordance between theory and measurement: for example, in studying the effect of an intervention applied to individuals, you should measure within-person comparisons. Second, in analyzing your data, you should study all comparisons of potential interest, rather than selecting based on statistical significance or other inherently noisy measures. Third, you should interpret your results in the context of theory, background knowledge, and the data collection and analysis you have performed. We discuss these issues on a theoretical level and with examples in psychology, political science, and policy analysis.

Here are some relevant references:

Some natural solutions to the p-value communication problem—and why they won’t work.

Honesty and transparency are not enough.

The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective.

And this:

No guru, no method, no teacher, Just you and I and nature . . . in the garden. Of forking paths.

The talk will be Tuesday, December 4, 2018, 12:00pm, in A32 Peretsman Scully Hall.