His concern is that the authors don’t control for the position of games within a season.

Chris Glynn wrote last year:

I read your blog post about middle brow literature and PPNAS the other day. Today, a friend forwarded me this article in The Atlantic that (in my opinion) is another example of what you’ve recently been talking about. The research in question is focused on Major League Baseball and the occurrence that a batter is hit by a pitch in retaliation for another player previously hit by a pitch in the same game. The research suggests that temperature is an important factor in predicting this retaliatory behavior. The original article by Larrick et al. in the journal Psychological Science is here.

My concern is that the authors don’t control for the position of games within a season. There are several reasons why the probability of retaliation may change as the season progresses, but a potentially important one is the changing relative importance of games as the season goes along. Games in the early part of the season (April, May) are important as teams try to build a winning record. Games late in the season are more important as teams compete for limited playoff spots. In these important games, retaliation is less likely because teams are more focused on winning than imposing baseball justice. The important games occur in relatively cool months. There exists a soft spot in the schedule during June, July, and August (hot months) where the games are less consequential. Perhaps what is driving the result is the schedule position (and relative importance) of the game. Regardless of the mechanism by which the schedule position impacts the probability of retaliation, the timing of a game within the season is correlated with temperature.

One quick analysis to get at the effect of temperature in games of similar importance would be to examine those that were played in the month of August. Some of those games will be played in dome stadiums which are climate controlled. Most games will be played in outdoor stadiums. I am curious to see if the temperature effect still exists after controlling for the relative importance of the game.

My reply: Psychological Science, published in 2011? That says it all. I’ll blog this, I guess during next year’s baseball season…

And here we are!

21 thoughts on “His concern is that the authors don’t control for the position of games within a season.

  1. Come on–being published in Psych Science doesn’t say it all. Some good papers are published there too.

    Regarding this specific paper, I suppose one could control for the importance of the game to the team. Teams often play games in hot weather in September as well as the months of high summer. The authors could examine whether playoff-contending teams are less likely to retaliate in September heatwaves than teams that are not contending.

  2. Andrew, is this really fair? I agree Glynn’s story is plausible, but Larrick et al.’s story seems plausible as well, and this post offers no evidence at all that Glynn’s story is right – at this point it’s just an interesting conjecture. How is that a sufficient basis for disbelieving Larrick et al.’s story, especially given that the stories aren’t even mutually exclusive?

    • A researcher:

      I’m not saying the paper in question is horrible; it’s just standard Psychological Science circa 2011: open-ended theory, lots of potential interactions, lots of arbitrariness in the model, key result is a p-value, lots of interpretation of the results. The number of researcher degrees of freedom is immense. Just for example, from the abstract: “Controlling for a number of other variables, we conducted analyses showing that the probability of a pitcher hitting a batter increases sharply at high temperatures when more of the pitcher’s teammates have been hit by the opposing team earlier in the game.” With a big pile of data and a flexible theory, you’ll have no problem finding statistically significant patterns in your data. The interpretation of the results in this depend a lot on the assumed additive and linear form for all the other predictors in the model, and it’s not at all clear why this additivity should make sense, given the authors’ own claim that a certain interaction is so important.

      Again, I don’t see this paper as particularly bad; it’s just what people were publishing back in 2011, back before there was a general understanding of the issues of researcher degrees of freedom and forking paths. For yet another example, here’s footnote 7 of the paper:

      Timmerman (2007) has shown that pitchers born in the U.S. South are more likely than others to retaliate when their teammates are hit by a pitch. We tested whether temperature remained a key variable after controlling for three measures of southernness: the location of the game, birthplace of the pitcher, and home location of each team. Only one southernness variable had a significant effect (see Table 2): Playing games in the southern United States increased the probability of a pitcher hitting a batter. This result suggests that a subculture difference (Nisbett & Cohen, 1996)—perhaps fan expectations—contributes to pitchers’ aggressiveness.

      One could tell plausible stories like this forever.

  3. I’m not an expert, but this does look like a pretty good paper, and I agree with Michael and a researcher that Andrew is being unfair here. However, I also agree with Glynn’s criticism. Given the number of other predictors they include in the model, it seems really weird that Larrick et al don’t include either “days since beginning of season” or some measure of how consequential the game is to the team. A better predictor than temperature might be “degrees Fahrenheit above the historical mean for the date (or month) and location of the game”.

    • It looks like model misspecification again:

      http://statmodeling.stat.columbia.edu/2017/05/15/needed-good-research-hint-not-just-much-weight-given-small-samples-tendency-publish-positive-results-not-negative-results-perhaps-unconscious-bias/#comment-488871

      I really don’t understand how things got this bad when omitted variable bias, etc are well known to the stats community. Does anyone really believe the model presented in this paper approximates the correct one? Isn’t almost every paper making this same mistake?

      • That’s a good point — I guess I was looking for predictors that didn’t make sense, rather than those that made sense but weren’t incorporated into the model very well. And the tiny effect size demands skepticism.

      • Andrew,

        Yes, I agree with you, almost every paper without a (quasi-)experiment or an explicitly assumed formal model is making this same mistake. In my experience, many social scientists (especially, ironically, researchers who are primarily experimentalists) believe that as you add more and more control variables to an OLS regression of your DV on your (theoretical) “IV”, omitted variable bias “becomes less likely”. Of course, this is wrong. But that’s what they believe.

        I think it has to do with the fact that it’s more difficult for researchers to think of additional alternative mechanisms / omitted variables if there are already many alternative mechanisms “controlled for” in OLS, so this creates the (false) perception that additional omitted variables are implausible or do not exist.

        • I’m not Andrew, but thanks for the compliment.

          Also, if you read through the earlier thread I linked you will see that the omitted variable problem still exists for experimental studies if you want to extrapolate outside your “population” (which is almost always; see Deming’s “enumerative” vs “analytical” distinction linked to there). For example, my grandma is on at least a dozen different specific treatments at specific dosages, how well does any clinical trial approximate that? I’d guess no one has any idea.

        • Whoops, I had originally meant to reply to Andrew’s message, then I got side tracked, came back, and read/replied to you, Anoneuoid. Anyway, I do agree with the spirit of your comment, except in my view the problem you describe is not “omitted variable bias” in the causal inference sense. Extrapolation is (at least formally) a different problem, and the optimal way to approach it probably require both theoretical and empirical efforts.

          It’s true that if one believes we live in a world in which in enormous “interaction” effects swamp all “main” effects, then the distinction I make is not so useful, but (1) that is because reliable/robust causal inferences are nearly impossible to systematically document if causality in the world hinges on complex interactions and cannot be parsimoniously described (a point the actual Andrew makes frequently), and (2) I don’t believe we live in such a world. If we did, it’s hard to explain why I can today successfully replicate Kahneman and Tversky’s effects from the 1970s in about 10 minutes.

        • it’s hard to explain why I can today successfully replicate Kahneman and Tversky’s effects from the 1970s in about 10 minutes.

          Sorry, I am not familiar with what specifically you are referring to. If true, then these are some psychological/sociological laws and should be termed as such. Can you be more specific?

        • Sure – anchoring effect (Tversky and Kahneman 1974), reflection effect, certainty effect, loss aversion (Tversky and Kahneman 1981). You can replicate all four of these effects in literally 10 minutes (e.g. on MTurk).

        • Thanks, I found a paper with this question:

          Decision (i). Choose between:
          A. a sure gain of $240 [84 percent]
          B . 25% chance to gain $1000, and
          75% chance to gain nothing [ 16 percent]

          Decision (ii). Choose between:
          C. a sure loss of $750 113 percent]
          D. 75% chance to lose $1000, and
          25% chance to lose nothing [87 percent]

          Have you done this one on MTurk? If so, how do you implement the second scenario? Do you give them money first or somehow get them to give access to their account? How does the amount of money involved affect the results, is there some kind of curve people have figured out?

        • Yes, that replicates on MTurk. The way I’ve done (to avoid losses) it is to change option (A) in Decision (i) to $250, divide all dollar amounts by 1000, and give people who will make Decision (ii) an extra $1. This makes the final outcomes identical.

          I don’t know exactly how the amount of money affects the results, sorry. But certainly I do replicate the qualitative reflection effect (p < .001), and the quantitative percentage point shift is reasonably close to the original Tversky-Kahneman result (not identical).

  4. With all of the things they controlled for, I’m sort of surprised that they didn’t control for batter/pitcher handedness. Pitchers are almost twice as likely to hit a same-handed batter (right-handed pitchers hitting right-handed batters or lefties against lefties).

    • Handedness is likely random error in this study, not systematic bias. It’s hard for me to see how handedness would be an alternative explanation for the effect of temperature on being hit by a pitch.

      • Yeah, I agree with that.

        Some other ideas to prove that I’m giving this too much thought… At least in recent years, there are slightly more HBP on Sundays (the final game in a weekend series) and Sunday games are almost all days games and thus hotter. The day/night temperature difference might also be a reason why they would have been better off including season as a random effect rather insisting on a linear relationship between season and hbp rate. Stretches of seasons with higher hbp rates could also have more day games than we’d expect from a model implying a linear increase in the percentage of night games. Pitchers are also most likely to hit same-handed batters with inside fastballs and they might be somewhat less inclined to throw inside fastballs on hot days since home run rates increase with temperature.

      • Might handedness might interact with the effect of temperature, rather than being either noise (insofar as anything here is more than noise) or an alternative explanation? The graph on page 426 shows the probability of being hit by a pitch as between 0.007 and 0.008 for 90F, with the slope depending on the number of pitcher’s teammates previously hit. In particular, a difference in probability of 0.001 at 3 teammates hit) is amplified to a difference of nearly 0.005 at >90F. That makes me think that temperature might also amplify the average difference between same-handed and opposite-handed pitcher-batter matchups. But maybe it shrinks the gap! Or does nothing! Since they didn’t present that comparison, no way to tell the sign or magnitude of the interaction.

        Google says that about 25% of MLB players are left-handed, and the fraction of pitchers is similar, so assuming reasonable mixing, any given at-bat has around a 60-65% chance of same-handedness. J. Cross claims that pitchers are almost twice as likely to hit a batter with the same handedness. Given those numbers, there’s room for the interaction to look like pretty much anything.

        • Teams tend stack their lineups with batters who hit from the opposite side from the pitcher and, of course, switch hitters always hit from the opposite side. The upshot is that (from 2002 to the present) starting pitchers have faced same-handed batters 42% of the time (hitting 0.86% of them with pitches) and relief pitchers have faced same-handed batters 53% of the time (hitting 0.98%). I think hot games with more offense likely see a higher % of PA thrown by relievers and thus a higher % by same-handed pitchers but I’m not sure if the effect is big enough to matter.

  5. To be fair, Larrick et al. controlled for game attendance (mentioning that they treat this as a measure of game importance), so the scheduling/importance issue isn’t entirely ignored in their analysis.

Leave a Reply

Your email address will not be published. Required fields are marked *