Burn-in for MCMC, why we prefer the term warm-up

Here’s what we say on p.282 of BDA3:

In the simulation literature (including earlier editions of this book), the warm-up period is called burn-in, a term we now avoid because we feel it draws a misleading analogy to industrial processes in which products are stressed in order to reveal defects. We prefer the term ‘warm-up’ to describe the early phase of the simulations in which the sequences get closer to the mass of the distribution.

Stan does adaptation during the warm-up phase.

The piranha problem in social psychology / behavioral economics: The “take a pill” model of science eats itself

[cat picture]

A fundamental tenet of social psychology, behavioral economics, at least how it is presented in the news media, and taught and practiced in many business schools, is that small “nudges,” often the sorts of things that we might not think would affect us at all, can have big effects on behavior. Thus the claims that elections are decided by college football games and shark attacks, or that the subliminal flash of a smiley face can cause huge changes in attitudes toward immigration, or that single women were 20% more likely to vote for Barack Obama, or three times more likely to wear red clothing, during certain times of the month, or that standing in a certain position for two minutes can increase your power, or that being subliminally primed with certain words can make you walk faster or slower, etc.

The model of the world underlying these claims is not just the “butterfly effect” that small changes can have big effects; rather, it’s that small changes can have big and predictable effects. It’s what I sometimes call the “button-pushing” model of social science, the idea that if you do X, you can expect to see Y. Indeed, we sometimes see the attitude that the treatment should work every time, so much so that any variation is explained away with its own story.

In response to this attitude, I sometimes present the “piranha argument,” which goes as follows: There can be some large and predictable effects on behavior, but not a lot, because, if there were, then these different effects would interfere with each other, and as a result it would be hard to see any consistent effects of anything in observational data.

The analogy is to a fish tank full of piranhas: it won’t take long before they eat each other.

An example

I recently came across an old post which makes the piranha argument pretty clearly in an example. I was talking about a published paper from 2013, “‘Black and White’ thinking: Visual contrast polarizes moral judgment,” which fell in the embodied cognition category. The claim in that paper was that “incidental visual cues without any affective connotation can similarly shape moral judgment by priming a certain mindset . . . exposure to an incidental black and white visual contrast leads people to think in a ‘black and white’ manner, as indicated by more extreme moral judgments.”

The study had the usual statistical problem of forking paths so I don’t think it makes sense to take its empirical claims seriously. But that’s not where I want to go today. Rather, my point here is the weakness of the underlying theory, in light of all the many many other possible stories that have been advanced to explain attitudes and behavior.

Here’s what I wrote:

I don’t know whether to trust this claim, in light of the equally well-documented finding, “Blue and Seeing Blue: Sadness May Impair Color Perception.” Couldn’t the Zarkadi and Schnall result be explained by an interaction between sadness and moral attitudes? It could go like this: Sadder people have difficulty with color perception so they are less sensitive to the different backgrounds in the images in question. Or maybe it goes the other way: sadder people have difficulty with color perception so they are more sensitive to black-and-white patterns.

I’m also worried about possible interactions with day of the month for female participants, given the equally well-documented findings correlating cycle time with political attitudes and—uh oh!—color preferences. Again, these factors could easily interact with perceptions of colors and also moral judgment.

What a fun game! Anyone can play.

Hey—here’s another one. I have difficulty interpreting this published finding in light of the equally well-documented finding that college students have ESP. Given Zarkadi and Schnall’s expectations as stated it in their paper, isn’t it possible that the participants in their study simply read their minds? That would seem to be the most parsimonious explanation of the observed effect.

Another possibility is the equally well-documented himmicanes and hurricanes effect—I could well imagine something similar with black-and-white or color patterns.

But I’ve saved the best explanation for last.

We can most easily understand the effect discovered by Zarkadi and Schnall’s in the context of the well-known smiley-face effect. If a cartoon smiley face flashed for a fraction of a second can create huge changes in attitudes, it stands to reason that a chessboard pattern can have large effects too. The game of chess, after all, was invented in Persia, and so it makes sense that being primed by a chessboard will make participants think of Iran, which in turn will polarize their thinking, with liberals and conservatives scurrying to their opposite corners. In contrast, a blank pattern or a colored grid will not trigger these chess associations.

Aha, you might say: chess may well have originated in Persia but now it’s associated with Russia. But that just bolsters my point! An association with Russia will again remind younger voters of scary Putin and bring up Cold War memories for the oldsters in the house: either way, polarization here we come.

In a world in which merely being primed with elderly-related words such as “Florida” and “bingo” causes college students to walk more slowly (remember, Daniel Kahneman told us “You have no choice but to accept that the major conclusions of these studies are true”), it is no surprise that being primed with a chessboard can polarize us.

I can already anticipate the response to the preregistered replication that fails: There is an interaction with the weather. Or with relationship status. Or with parents’ socioeconomic status. Or, there was a crucial aspect of the treatment that was buried in the 17th paragraph of the publish paper but turns out to be absolutely necessary for this phenomenon to appear.

Or . . . hey, I have a good one: The recent nuclear accord with Iran and rapprochement with Russia over ISIS has reduced tension with those two chess-related countries, so this would explain a lack of replication in a future experiment.

I wrote the above in a silly way but my point is real:  Once you accept that all these large effects are out there, it becomes essentially impossible to interpret any claim—even from experimental data—as it can also be explained as an interaction of two previously-identified large effects.

Randomized experiment is not enough

Under the button-pushing model of science, there’s nothing better than a randomized experiment: it’s the gold standard! Really, though, there are two big problems with the sort of experimental data described above:

1. Measurement error. When measurements are noisy and biased, any patterns you see will not in general replicate—that is, type M and type S errors will be large. Meanwhile, forking paths allow researchers the illusion of success, over and over again, and enablers such as the editors of PNAS keep this work in the public eye.

2. Interactions. Even if you do unequivocally establish a treatment effect from your data, the estimate only applies to the population and scenario under study: psychology students in university X in May, 2017; or Mechanical Turk participants in May, 2017, asked about topic Y; etc. And in the “tank full of piranhas” context where just about anything can have a large effect—from various literatures, there’s menstrual cycle, birth order, attractiveness of parents, lighting in the room, subliminal smiley faces, recent college football games, parents’ socioeconomic status, outdoor temperature, names of hurricanes, the grid pattern on the edge of the survey form, ESP, the demographic characteristics of the experimenter, and priming on just about any possible stimulus. In this piranha-filled world, the estimate from any particular experiment is pretty much uninterpretable.

To put it another way: if you do one of these experiments and find a statistically significant pattern, it’s not enough for you to defend your own theory. You also have to make the case that just about everything else in the social psychology / behavioral economics literature is wrong. Cos otherwise your findings don’t generalize. But we don’t typically see authors of this sort of paper disputing the rest of the field: they all seem happy thinking of all this work as respectable.

I put this post in the Multilevel Modeling category because ultimately I think we should think about all these effects, or potential effects, in context. All sorts of confusion arise when thinking about them one step at a time. For just one more example, consider ovulation-and-voting and smiley-faces-and-political attitudes. Your time of the month could affect your attention span and your ability to notice or react to subliminal stimuli. Thus, any ovulation-and-voting effect could be explained merely as an interaction with subliminal images on TV ads, for example. And any smiley-faces-and-political-attitudes effect could, conversely, be explained as an interaction with changes in attitudes during the monthly cycle. I don’t believe any of these stories; my point is that if you really buy into the large-predictable-effects framework of social psychology, then it does not make sense to analyze these experiments in isolation.

P.S. Just to clarify: I don’t think all effects are zero. We inherent much of our political views from our parents; there’s also good evidence that political attitudes and voting behavior are affected by the economic performance, candidate characteristics, and the convenience of voter registration. People can be persuaded by political campaigns to change their opinions, and attitudes are affected by events, as we’ve seen with the up-and-down attitudes on health care reform in recent decades. The aquarium’s not empty. It’s just not filled with piranhas.