Ptolemaic inference

Posted on October 24, 2016 7:26 PM by Andrew

OK, we’ve been seeing this a lot recently. A psychology study gets published, with a key idea that at first seems wacky but, upon closer reflection, could very well be true!

Examples:

– That “dentist named Dennis” paper suggesting that people pick where they live and what job to take based on their names.

– Power pose: at first it sounds ridiculous that you could boost your hormones and have success just by holding your body differently. But, sure, think about it some more and it could be possible.

– Ovulation and voting: do your political preferences change this much based on the time of the month? OK, this one seems ridiculous even upon reflection, but that’s just because I’ve seen a lot of polling data. To an outsider, sure, it seems possible, everybody knows voters are irrational.

– Embodied cognition: as Daniel Kahneman memorably put it, “When I describe priming studies to audiences, the reaction is often disbelief . . . The idea you should focus on, however, is that disbelief is not an option.”

– And lots more: himmicanes, air rage, beauty and sex ratio, football games and elections, subliminal smiley faces and attitudes toward immigration, etc. Each of these seems at first to be a bit of a stretch, but, upon reflection, could be real. Maybe people really do react differently to hurricanes with boy or girl names! And so on.

These examples all have the following features:

1. The claimed phenomenon is some sort of bank shot, an indirect effect without a clear mechanism.

2. Still, the effect seems like it could be true; the indirect mechanism seems vaguely plausible.

3. The exact opposite effect is also plausible. One could easily imagine people avoiding careers that sound like their names, or voting in the opposite way during that time of the month, or responding to elderly-themed words by running faster, or reacting with more alacrity to female-named hurricanes, and so on.

Item 3 is not always mentioned but it’s a natural consequence of items 1 and 2. The very vagueness of the mechanisms which allow plausibility, also allow plausibility for virtually any interaction and effects of virtually any sign. Which is why sociologist Jeremy Freese so memorably described these theories as “vampirical rather than empirical—unable to be killed by mere evidence.”

Think about it: If A is plausible, and not-A is plausible, and if the garden of forking paths and researcher degrees of freedom allow you to get statistical significance from just about any dataset, you can’t lose.

Enter Ptolemy

But we’ve discussed all that before, many times. What I want to talk about today is how many of these stories proceed. It goes like this:

– Original paper gets published and publicized. A stunning counterintuitive finding.

– A literature develops. Conceptual replications galore. Each new study finds a new interaction. The flexible nature of scientific discovery, along with the requirement by journals of (a) originality and (b) p less than .05, in practice requires that each new study is a bit different from everything that came before. From the standpoint of scientific replication, this is a minus, but from the standpoint of producing a scientific literature, it’s a plus. A paper that is nothing but noise mining can get thousands of citations:

Screen Shot 2016-06-09 at 11.06.41 PM

– The literature is criticized on methodological grounds, and some attempted replications fail.

Now here’s where Ptolemy comes in. There is, sometimes, an attempt to square the circle, to resolve the apparent contradiction between the original seemingly successful study, the literature of seemingly successful conceptual replications, and the new, discouraging, failed replications.

I say “apparent contradiction” because this pattern of results is typically consistent with a story in which the true effect is zero, or is so highly variable and situation-dependent as to be undetectable, and in which the original study and the literature of apparent successes are merely testimony to the effectiveness of p-hacking or the garden of forking paths to uncover statistically significant comparisons in the presence of researcher degrees of freedom.

But there is this other story that often gets told, which is that effects are contextually dependent in a particular way, a way which preserves the validity of all the published results while explaining away the failed replications as just being done wrong, as not true replications.

This was what the power pose authors said about the unsuccessful replication performed by Ranehill et al., and this is what Gilbert et al. said about the entire replication project. (And see here for Nosek et al.’s compelling (to me) criticism of Gilbert et al.’s argument.)

I call this reasoning Ptolemaic because it’s an attempt to explain an entire pattern of data with an elaborate system of invisible mechanisms. On days where you’re more fertile you’re more likely to wear red. Unless it’s a cold day, then it doesn’t happen. Or maybe it’s not the most fertile days, maybe it’s the days that precede maximum fertility. Or, when you’re ovulating you’re more likely to vote for Barack Obama. Unless you’re married, then ovulation makes you more likely to support Mitt Romney. Or, in the words of explainer-in-chief John Bargh, “Both articles found the effect but with moderation by a second factor: Hull et al. 2002 showed the effect mainly for individuals high in self consciousness, and Cesario et al. 2006 showed the effect mainly for individuals who like (versus dislike) the elderly.”

It’s all possible but this sort of interpretation of the data is a sort of slalom that weaves back and forth in order to be consistent with every published claim. Which would be fine if the published claims were deterministic truths, but in fact they’re noisy and selected data summaries. It’s classic overfitting.

Look. I’m not some sort of Occam fundamentalist. Maybe these effects are real. But in any case you should take account of all these sources of random and systematic error and recognize that, once you open that door, you have to allow for the possibility that these effects are real, and go in the opposite direction as claimed. You have to allow for the very real possibility that power pose hurts people, that Cornell students have negative ESP, that hurricanes with boys’ names create more damage, and so forth. Own your model.

Remember: the ultimate goal is to describe reality, not to explain away a bunch of published papers in a way that will cause the least offense to their authors and their supporters in the academy and the news media.

16 thoughts on “Ptolemaic inference”

jrc on October 24, 2016 7:49 PM at 7:49 pm said:

To continue our (burgeoning) tradition of defending Ptolemy against comparisons to contemporary small-N social experimenters:

“I call this reasoning Ptolemaic because it’s an attempt to explain an entire pattern of data with an elaborate system of invisible mechanisms.” You mean like Gravity, or Quantum Physics, or String Theory?

I guess my point is just that Ptolemy was wrong, but empirically and technologically useful. I think much of the worst social science research these days is both wrong and useless (perhaps even detrimental) to society. No comment on the state of contemporary (meta-)Physics, but it has allowed us some space travel, given us an ability to estimate the age of the universe, offered a glimpse into the recesses of our Galaxy, and spurred us to radically re-concieve our own place in the Universe. So it has that going for it, which is nice.

Reply ↓
- Jake Argent on October 24, 2016 8:23 PM at 8:23 pm said:
  
  @jrc: I think it better to interpret “invisible mechanisms” as those mechanisms that don’t necessarily lead to an observable difference when True, as opposed to False. In that case both Gravity and Quantum Physics has “visible mechanisms”.
  
  Non-Quantum Gravity has lots of precise predictions. QM has these fundamentally unobservable entities called wavefunctions, sure. (I prefer the name state functions, but whatever.) And yet, these entities lead to stark differences if they are true, versus not. Like *all* the cool Quantum effects.
  
  String Theory… I will not defend. I find it indefensible. It’s lots of math-work without a clear physical theory around it, but is prospering (or maybe was) mostly due to PR work done by big name theoretical physicists.
  
  It may even be the case that String Theory is closer to the accustomed useless humanities research, not in the form of data analysis (String Theory has none) but in the form of not being able to put forth clear predictions with point-wise accuracy. This point, BTW, is underlined by a 1967 paper by Meehl (or Meeehl..?) in a critique about the statistical approaches used in psychology in his day.
  
  I’m quite sure that similar shoddy statistical techniques could net “significant” results for String Theory as well. Maybe they should learn a thing or two from the squishier sciences. (*Oh God, please not, not really*)
  
  Reply ↓
  - Seth M. Spain on October 25, 2016 9:43 AM at 9:43 am said:
    
    Perhaps we should say “nonidentifiable” mechanisms, instead. That is, if the hypothesis *can* potentially explain effects in any given direction, a priori, well…that sounds kind of like the definition of not being identifiable. It doesn’t matter how much data we have, the data don’t constrain our knowledge of the parameters (hypotheses).
    
    Reply ↓
    - Jake Argent on October 26, 2016 6:51 PM at 6:51 pm said:
      
      Yeah, exactly that. If the hypothesis is that *anything* could happen, then anything *could* happen and with enough tries, will happen.
Jonathan on October 24, 2016 9:31 PM at 9:31 pm said:

Would it be slightly more accurate to say the effects are real when they occur but their pattern of occurrence matches that of noise?

Reply ↓
Daniel Hawkins on October 24, 2016 11:59 PM at 11:59 pm said:

You’ve mentioned Lakatos in the past, so I know you’re aware of him, but he had the clearest way of describing this category of phenomena—they operate within degenerative research programs. The ad hoc fixes needed to defend the theories don’t yield novel facts or additional explanatory power. This is contrasted with progressive programs, where the ad hoc fix needed to explain an apparent anomaly yields new insights, confirmed by unanticipated empirical facts, and spawning new auxiliary theories with greater explanatory power.

Degenerative programs can become progressive, and vice versa, so there’s nothing wrong with pursuing a degenerative program per se (e.g. string theory, which could arguably be described as degenerative at present). It only becomes a problem when researchers and institutions erroneously conflate resolution of anomalies with increased explanatory power. Explaining anomalies is a necessary, but insufficient condition for a progressive research program.

Reply ↓
Thomas on October 25, 2016 2:06 AM at 2:06 am said:

“[Y]ou have to allow for the possibility that these effects are real, and go in the opposite direction as claimed.” Yes! And if there is any real mechanism, you have to allow for the possibility that your recommendations for dealing with it come at a disproportionate cost.

That’s the point that Gilbert Welch essentially makes with the treatment of hypertension. If you medicate mild hypertension, there is some benefit, but more harm. That’s precisely because the relationship between blood pressure and health is real.

“Own you model.” I like that slogan.

Reply ↓
Phil on October 25, 2016 2:19 AM at 2:19 am said:

Speaking specifically of “priming,” I think there’s an additional effect: some kinds of priming really DO have an effect, even a big one.

I remember a stupid joke/trick from when I was a kid: you’d ask a string of questions like “what’s the opposite of ‘least’?” (most); “what kind of supernatural creature is Casper?” (ghost); “What do you call a party where everyone pokes fun of the guest of honor?” (roast); and finally “what do you put in a toaster?” And the kid says “toast.” And then you say “No, you put bread in a toaster. Toast is what comes _out_ of the toaster.” Ha! Such laffs like you wouldn’t believe! But: I’m pretty sure you had to ask the lead-up questions first. If you start with “What do you put in a toaster”, most people won’t say “toast.” But prime them with a series of questions whose answers all rhyme with “toast” and you get “toast” as the answer. (Given my age, this joke was an application of cutting-edge psychology: The Wikipedia article on priming says it was documented in the 1970s that “people were faster in deciding that a string of letters is a word when the word followed an associatively or semantically related word. For example, NURSE is recognized more quickly following DOCTOR than following BREAD. “)

You can see a sort of slippery slope in which people start by testing easily believable examples of priming such as those above, and then gradually slide down a slope towards lower signal/noise ratio until eventually they’re just mining the noise, if that is indeed what is happening.

Also, you probably know this but maybe not: In an open letter four years ago, Kahneman said to people who study priming, “right or wrong, your field is now the poster child for doubts about the integrity of psychological research” and has urged an experimental program to “examine the replicability of priming results, following a protocol that avoids the questions that have been raised and guarantees credibility among colleagues outside the field.” So maybe you are being a bit hard on him.

Reply ↓
- Thomas on October 25, 2016 2:46 AM at 2:46 am said:
  
  I would think that ease with which we recognize NURSE after DOCTOR correlates with the likelihood of misregonizing NORSE as NURSE after DOCTOR. (Conversely NURSE after VIKING will be more often misrecognized as NORSE after VIKING than after DOCTOR.) That’s if you tell the research subject to do it as quickly as possible. Everything changes when we tell the subject to be as accurate as possible.
  
  When we don’t have time, we rely on our biases. When we do have time, we don’t. This is what I find distressing about “implicit bias” research. It has founded a political movement to eradicate bias in our thinking, just as “priming” research seems to be mainly used to manipulate people. But in both cases the answer is simple: think! Don’t just “blink” as Gladwell put it; don’t “think without thinking”. Implicit bias and priming are effects that largely disappear under conditions were much better reasons for action are made available. After that, any efforts to further remove bias are much more trouble than they’re worth.
  
  There is too much research in psychology these days that tries to convince us that we aren’t as rational as we think. The truth, of course, is that we are exactly as rational as we think: we are rational to the extent that we undertake to think.
  
  Reply ↓
  - Seth M. Spain on October 25, 2016 9:59 AM at 9:59 am said:
    
    +1
    
    Mostly. I think there are some traps people can fall into that really *do* fall into the “motivated cognition” I mention below. People are capable of engaging in all sorts of rationalizations for their prejudices/biases/uninformed opinions. Especially nowadays, it is relatively easy to seek out “facts” that support your pre-existing beliefs, and it’s even easier to simply “reason” your way back to your original position.
    
    While I largely agree with you about implicit bias research, or, moreover, I think that *explicit* bias, while declining at large, is still a bigger problem. That is, a person with explicit beliefs consistent with unbiasedness will try to *act* in a way that is unbiased, in spite of implicit biases. At least when they’re acting *intentionally*. That doesn’t mean that implicit biases are absolutely not a problem, but that they might be more…marginal? Okay…I’m done. For real.
    
    Reply ↓
    - Thomas on October 26, 2016 8:46 AM at 8:46 am said:
      
      I think we agree about this. There are some (it seems to me) who want to adjust our implicit biases so that they “bend towards justice,” as MLK put it. For some people this simply means getting rid of the relevant bias, for others it might changing its sign. But I think this sort of program, directed at “the hearts and minds” of whole populations, is unwise. I simply don’t think it can be done delicately enough not to do a bunch of harm in the process.
      
      Your suggestion, as I understand it, is the right one. Since our explicit biases (or lack thereof) are much stronger than the implicit ones, to avoid implicit bias in your thinking just make your reasoning explicit. The best corrective to unconscious bias is simply consciousness.
- Seth M. Spain on October 25, 2016 9:53 AM at 9:53 am said:
  
  Lexical priming seems to be real, and there’s really sound theoretical explanations for it – if concepts are stored in network form (i.e., neural networks, though much of the theory seems to be built on more abstract/higher-order semantic/associative networks), concepts that *share* components become slightly activated whenever related concepts are activated/used. In the business, they (at least used to) call this “spreading activation”. It’s pretty demonstrable, and is one of the more robust findings in psychology. Then again, it comes out of proper cognitive psychology, which is rarely criticized for being unscientific (as a discipline – I’m sure we can dredge up individual studies; and it has its problems at the theoretical level, like the associative networks mentioned above or the use of artificial neural networks for some modeling that bear only the sketchiest of relationships to biological neural networks).
  
  On the other hand, when social psychology imported these concepts in “hot” or “motivated” cognition, giving us social cognition as a sub-field, well, there are mixed results. On the plus side, this is where a lot of the work on heuristics/biases/bounded rationality comes from, but on the minus side, you get all this stuff on subliminal and social priming. Now, there are some more social priming effects that make a kind of sense (i.e., there’s some work using affective primes, but emotions do have a substantial cognitive component about “labeling your feelings”, so…anyway, I’ll stop there), and then there’s “old words make you walk out the lab slowly,” which…just don’t seem to hold up at all…
  
  Reply ↓
Peter on October 28, 2016 5:28 AM at 5:28 am said:

Beyond the statistical problem, and the unnecessary entities, I’d say there is a theoretical problem. Let’s admit, for argument’s sake, that the findings are real (priming works, but only if done by skilled experimenters, that dress in blue, between 3 and 5 P.M., if the room is big enough, subjects have names beginning with G and are democrats) and sufficiently backed from a statistical point of vue.
Even if such an effect were real, it would also be so narrow as to be completely uninteresting. The promise of priming is that of a powerful, general mechanism that subconsciously alters our behaviour; what we have in the end, after ignoring obvious statistical problems, is a small cog so obscure as to never actually be turned in real life.

Reply ↓
- Martha (Smith) on October 28, 2016 4:41 PM at 4:41 pm said:
  
  +1
  
  Reply ↓
Chris G on November 1, 2016 8:33 PM at 8:33 pm said:

> responding to elderly-themed words by running faster…

Sure. Back when I used to run 10k’s I’d have a tape loop going on my Walkman where I’d alternately scream “Incontinence!!!” and “Alzheimer’s!!!” to myself. Got me down to 7 minutes/mile from 8. Don’t most runners do that?

More seriously…

>Look. I’m not some sort of Occam fundamentalist.

No reason to be. Write an equation, make a prediction, compare your prediction against data, use the deviations between observation and prediction to refine your model of the world. I generally prefer “stiff” models but that’s rooted in an innate fear of overfitting more than anything else. If two models fit data equally well then there’s no objective basis for choosing one over the other, is there?

Reply ↓
- Chris G on November 1, 2016 10:40 PM at 10:40 pm said:
  
  For example, consider weather forecasting and climate modeling. There are multiple operational models in use both fields. Predictions are taken more seriously when there’s consensus within the ensemble. Occam’s Razor isn’t a factor.
  
  Related reading – Michael Behar, Why Isn’t the US Better at Predicting Extreme Weather? – http://www.nytimes.com/2016/10/23/magazine/why-isnt-the-us-better-at-predicting-extreme-weather.html
  
  Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Ptolemaic inference

16 thoughts on “Ptolemaic inference”

Leave a Reply Cancel reply