Ahhhh, PPNAS!

To busy readers: Skip to the tl;dr summary at the end of this post.

A psychology researcher sent me an email with subject line, “There’s a hell of a paper coming out in PPNAS today.” He sent me a copy of the paper, “Physical and situational inequality on airplanes predicts air rage,” by Katherine DeCelles and Michael Norton, edited by Susan Fiske, and it did not disappoint. By which I mean it exhibited the mix of forking paths and open-ended storytelling characteristic of these sorts of PPNAS or Psychological Science papers on himmicanes, power pose, ovulation and clothing, and all the rest.

There’s so much to love (by which I mean, hate) here, I hardly know where to start.

– Coefficient estimate and standard errors such as “1.0031** (0.0014)” (yes, that’s statistically significantly different from the baseline value of 1.0000).

– Another coefficient of “11.8594” (dig that precision) with a standard error of “11.8367” which is still declared statistically significant at the 5% level. Whoops!

– The ridiculous hyper-precision of “Flights with first class present are ∼46.1% of the population of flights” (good thing they assured us that it wasn’t exactly 46.1%).

– The interpretation of zillions of regression coefficients, each one controlling for all the others. For example, “As predicted, front boarding of planes predicted 2.18-times greater odds of an economy cabin incident than middle boarding (P = 0.005; model 2), an effect equivalent to an additional 5-h and 58-min flight delay (0.7772 front boarding/0.1305 delay hours).” What does it all mean? Who cares!

– No raw data. Sorry, proprietary restrictions so nobody can reproduce this analysis! (Don’t get me wrong, I have no problem with researchers learning from proprietary information, I do it all the time. What the National Academy of Sciences is doing publishing this sort of thing, I have no idea. Or, yes, I do have an idea, but I don’t like it.)

Story time: “We argue that exposure to both physical and situational inequality can result in antisocial behavior. . . . even temporary exposure to physical inequality—being literally placed in one’s “class” (economy class) for the duration of a flight—relates to antisocial behavior . . .”

– A charming reference in the abstract to testing of predictions, even though no predictions were supplied before the data were analyzed.

– Dovetailing!

The data

The authors don’t share any of their data, but they do say that there were between 1500 and 4500 incidents in their database, out of between 1 and 5 million flights. So that’s about 1 incident per thousand flights.

They report a rate of incidents of 1.58 per thousand flights in economy seats on flights with first class, .14 per thousand flights in economy seats with no first class, and .31 per thousand flights in first class.

It seems like these numbers are per flight, not per passenger, but that can’t be right: lots more people are in economy class than in first class, and flights with first class seats tend to be in bigger planes than flights with no first class seats. This isn’t as bad as the himmicanes analysis but it displays a similar incoherence.

There’s no reason we should take this sort of tea-leaf-reading exercise seriously. Or, to put it another way—and I’m talking to you, journalists—just pretend this was published in some obscure outlet such as the Journal of Airline Safety. Subtract the hype, subtract the claims of general relevance, just treat it as data (which we don’t get to see).

I should perhaps clarify that I can only assume these researchers were trying their best. They were playing by the rules. Not their fault that the rules were wrong. Statistics is hard, like basketball or knitting. As I wrote a few months ago, I think we have to accept statistical incompetence not as an aberration but as the norm. Doing poor statistical analysis doesn’t make Katherine DeCelles and Michael Norton bad people, any more than I’m a bad person just cos I can’t sink a layup.

tl;dr summary

NPR will love this paper. It directly targets their demographic of people who are rich enough to fly a lot but not rich enough to fly first class, and who think that inequality is the cause of the world’s ills.

P.S. I was unfair to NPR. See here.

103 thoughts on “Ahhhh, PPNAS!

  1. As a point of interest, the editor — Susan Fiske of Princeton University — was the dissertation advisor of Amy Cuddy of “power pose” fame.

  2. I am trying to think of all the other variables that they could plausibly have included in their analyses if they were really trying to build a good predictive model but I would definitely include whether or not alcohol was served, time of day of the flight, and the presence or absence of in flight entertainment systems.

  3. “NPR will love this paper. It directly targets their demographic of people who are rich enough to fly a lot but not rich enough to fly first class, and who think that inequality is the cause of the world’s ills.”
    Bingo. As much as people ridicule the economists for assuming rationality- if you look at the incentives that people face, it is easy to understand their behavior.

  4. This ‘story’ popped up in my news feed this morning and my first thought was this sounds like a spurious correlation story. Most flights that i’ve taken that have a first class section (not biz class) are long haul flights greater than 8 hrs. One would think that there would be a high correlation between air rage and time spent on a plane.
    Anyway i was going to rip the paper apart and see what numbers they crunches but you’ve obviously already done the ground work.
    Cheers!

  5. “They report a rate of incidents of 1.58 per thousand flights in economy seats on flights with first class,… and .31 per thousand flights in first class.”

    But I would think usually the ratio of economy class seats to first class seats is greater than 5:1, so it may be that on a per passenger basis first class is more rowdy. This wouldn’t surprise me, because the sense of entitlement you get from paying for first class may be pretty high, and aren’t the drinks still free up there?

  6. They needed the 4 decimal places because they didn’t think to code the flight distance effect per 1000 miles. Love the supplementary table where day of week and month are listed alphabetically for convenient interpretation. Sorry those are trivial comments – but the paper is a bit of a disaster. The supplementary materials show that there is no correlation greater than r = .05 to any outcome, and these are only the obvious risk exposures of distance and number of seats. So should we make a nice table of counternulls for these data? With three asterisks we can say for certain that the effect sizes are less than X…. maybe not quite as newsworthy.

  7. ” approved March 30, 2016 (received for review November 3, 2015)”. I admire PNAS’s fast review time. One of my students’ papers took 1.5 years from submission to publication, excluding revision time. This was just the time that the paper sat with the journal. In the first submission, the editor didn’t even send it off for review, or requested reviewers to review and never followed up on it, for 9 straight months.

    Also, PNAS’s webpage says: “PNAS publishes only the highest quality scientific research. Every published paper is peer reviewed and has been approved for publication by an NAS member.” So this is what a NAS member would approve?

    It also says: “The PNAS impact factor is 9.674 and the Eigenfactor is 1.41892 for 2014. ” What the heck is an Eigenfactor? I did find an explanation, and it is quite juicy: “The Eigenfactor Score calculation is based on the number of times articles from the journal published in the past five years have been cited in the JCR year, but it also considers which journals have contributed these citations so that highly cited journals will influence the network more than lesser cited journals. References from one article in a journal to another article from the same journal are removed, so that Eigenfactor Scores are not influenced by journal self-citation.”

  8. Two questions.

    Do I understand you correctly? When I’m looking at their regression table, my main concern is that they’ve done a great deal of forking paths (and, indeed, multiple testing) and yet have barely significant results on their “main” coefficient. I totally believe the large effects of flight length etc. because the coefficients are really quite large (and it seems reasonable). But they are focusing on a coefficient that is strongly subject to multiple potential tests, and has weak significance anyway. So without prespecified tests, the evidence is null. Yes?

    On the other hand, I don’t understand the problem with interpreting multiple (‘zillions of’) regression coefficients. With enough data, this seems totally fine to me. And moreover in each one we should be ‘controlling for all the others’ — to do otherwise would yield obviously biased estimates. Controlling for the others can reduce the biasing effects of correlation between the predictors. I don’t think you’re suggesting interpreting multiple coefficients each in their own bivariate regressions, so (1) what is the problem with interpreting these (large) coefficients and (2) what should we do instead? They have lots of data, why not interpret all the coefficients, keeping in mind that the analysis is not prespecified?

    • Rob:

      The causal interpretation of a regression coefficient is a change in one variable with all other variables held constant. It is difficult to interpret, for example, a change in the “number of first class seats” variable conditional on the “length of flight” variable held constant, given that longer flights could be in planes with more first class seats. And it’s not just those two variables. All the variables in that regression are tangled in that way.

      • I agree that there is probably a tendency for longer flights to have more seats. However, it was my understanding that an existing correlation between predictors is exactly the circumstance in which it is important that we include both variables, so as to avoid an omitted variable bias on the included coefficient (and on the truly nonzero omitted coefficient).

        In small datasets, it is common for coefficients on correlated variables to steal from each other and interpretation is difficult because there is too much noise to distinguish the effect of each variable from one another. However, this problem is in large part reduced in large datasets as the regression can take advantage of all the cases of big-plane–short-flight and small-plane–long-flight which are bound to be present in a large dataset.

        In situations having correlated predictors and lots of data, therefore, I understand it is a good practice to include the correlated predictors. Do you recommend otherwise?

        • Rob:

          No, in this case the predictors are just too tangled. If you have a clean design then you can use such a regression to estimate average differences, and if you really believe your additive linear model then, yes, you can pull these things apart. But in this case no, you’re just throwing your data into a machine and telling stories about what pops out. The model makes no sense, and the estimates have no clean interpretation apart from the model. If you really want to study the effect of sitting near first-class passengers or whatever, I think you need to design an experiment or an observational study just for that. The idea that you can throw 9 variables into a regression and estimate 9 causal effects . . . well, yes, I know that’s how this stuff is often taught, but it’s wrong. It might get your work published in PPNAS and featured on ABC News but it’s not telling you much about psychology or about air safety.

        • What would a “clean” observational design look like in this case?

          If you were doing an experiment, you might randomly assign some coach passengers to sit near first class and others to sit far away from first class. Or you might randomly assign some coach passengers to mid-boarding planes with first class cabins and others to front-boarding planes with first-class cabins. You could even have another treatment where you hired actors to sit in first class and be eating caviar while the coach passengers boarded. (There are feasibility issues with these designs, obviously).

          But I’m curious about what a good observational design might look like.

        • John:

          You’d want to look for some sort of natural experiment (when an airline changed policies, or comparison between airlines with different policies, or comparison between similar routes on the same airline with different policies). Jennifer and I discuss this sort of thing in chapter 10 of our book.

  9. > They report a rate of incidents of 1.58 per thousand flights in economy seats on flights with first class, .14 per thousand flights in economy seats with no first class, and .31 per thousand flights in first class.

    > It seems like these numbers are per flight, not per passenger, but that can’t be right: lots more people are in economy class than in first class, and flights with first class seats tend to be in bigger planes than flights with no first class seats.

    I don’t understand your argument here. Why can’t first class passengers just be more likely to cause an incident? Could you spell it out please?

    • Rob:

      Per-flight numbers are hard to interpret. If a plane has 20 people in first class and 100 in coach, how do you think about the total number of incidents in each part of the plane?

  10. If I am reading the supplementary materials correctly then boarding through the front is actually negatively correlated with first class or economy incidents (see Table s2).

    • Front-boarding is also very strongly correlated with all sorts of other variables in the model and I wondering if this is not just all a silly suppressor effect (with reversed sign). Maybe front boarding actually reduces air rage because it makes us aspire to be like all the awesome people in first class who we have to walk past! Of course, I am not sure why being “emotional”, “sexual” or “intoxicated” qualifies as “air rage”.

      • I agree, Mark, that there seems to be a suppressor effect here. The relation is negative when a single predictor is used but positive when more than one predictor is used. As is so often the case, the authors seem not to have noticed the suppressor effect. It is difficult to know how to interpret the results of analyses that contain suppressor effects in the absence of a theory that predicts them. Sometimes they are just meaningless statistical artifacts. Other times, they may of substantive interest.

        • I just took a closer look. The results are definitely due to suppressor effects, and the suppressor effects are due to (1) high correlations between some of the predictors and (2) correlations with different signs.

        • Thanks for confirming my suspicion. How can the authors (and the reviewers and the editor) not be aware of something so elementary?

        • Mark:

          Regarding the authors, see here. Statistics is hard. Multiple regression is hard. Figuring out the appropriate denominator is hard. These errors aren’t so elementary.

          Regarding the reviewers, see here. The problem with peer review is that the reviewers are peers of the authors and can easily be subject to the same biases and the same blind spots.

          Regarding the editor: it doesn’t help that she has the exalted title of Member of the National Academy of Sciences. With a title like that, it’s natural that she thinks she knows what she’s doing. What could she possibly learn from a bunch of blog commenters, a bunch of people who are so ignorant that they don’t even believe in himmicanes, power pose, and ESP?

        • I agree with Andrew that statistics is hard and that it is so easy to make mistakes. But Norton has a PhD in psychology from a topnotch university. Suppression effects are, or should be, well known in psychology. Some, although not all, of the basic graduate introductory statistics books in psychology discuss suppression, and many articles on the topic have appeared in the good psychology journals, as well as in the statistics journals (which most psychologists other than quantitative psychologists do not read). The issue of suppression is also discussed in many articles on mediation, a popular topic in psychology nowadays.

          Nonetheless, I often see published articles with unrecognized suppression effects. I don’t understand this.

          I wonder whether it is worth a comment to PNAS? I myself have written about suppression before so that I could just adapt some of that material for a comment. I once submitted a comment on a different article to PNAS and PNAS didn’t want to hear it. It had already published someone else’s comment regarding a different issue on that same article and PNAS indicated to me that it therefore had satisfied its obligation. Bah.

        • @Carol

          How about an email to DeCelles / Norton / Fiske expressing your concerns?

          I wonder what they would say.

        • DeCelles should know about this issue by now because there is already a post on pubpeer about the suppression effect. First authors get automatically notified.

    • I can’t help but wonder if the April date is relevant … but then again, there is some crazy stuff out there that isn’t intended as a joke.

    • In the hard sciences, can you really publish an article from a brand new collaboration within a year (or less, given that most first flights to/from a new airport probably aren’t Jan. 1)? That is sorta weird in Figure 1. OK, pretty weird.

      The overarching research design strikes me as reasonable (it is an event study, and I tend to like those), but I don’t know that much about doing analysis on dyadic pairs in this way and the potential problems that could bring (statistically and/or in terms of sample construction).

      But mostly I can’t get over the instantaneous effect. Maybe Chemists publish really fast? Just seems strange to me that the effect would happen like that if it really was about airfare and travel convenience. I’d expect the effect to fade-in, not just appear.

      • My guess, having looked only at the graph and not read the actual paper, is that all that’s happening is that people are putting each other on their papers. You are trying to design some new catalyst or something, and you’re about 70% of the way done, and have started writing the article with only a few followup experiments to do… you fly off to a conference, and suddenly you’re discussing catalysts with Joe Cummerbund who suggests a minor modification that makes everything more stable, and you follow up those few experiments including his suggestion, and you jam his name on the paper and off it shoots…

        This is hugely helpful, but it’s not the same thing as an in-depth collaboration.

  11. Why the dig at NPR? And why the implication that NPR listeners cannot distinguish good scientific articles from bad ones that agree with listeners’ values? On that note, why the implicit indictment of said values (i.e. the desire to reduce inequality, etc.)? I find these statements saddening and confusing.

    • Sepp:

      As I said in the P.S., I was unfair to NPR. I could’ve simply scrubbed the NPR mention, but that wouldn’t seem fair, it seems more appropriate for me to admit I was wrong on this one!

      As to your other question: Sorry, but NPR has fallen for this sort of junk science before. For example, NPR and several local NPR stations ran completely uncritical segments on power pose.

      And, no, I meant no condemnation, implicit or otherwise, on the desire to reduce inequality. I expect the message of this air-rage study to be simpatico with the experiences and attitudes of many NPR listeners and broadcasters. That doesn’t mean I think these attitudes are bad.

  12. Do I interpret correctly that the “study” is missing any sort of attempt at a control group as well? “To test our predictions, we obtained a private database of all incidents of air rage from a large international airline over several years (circa 2010) of between 1 and 5 million flights… ”

    It seems to imply that the dataset would clearly overlook the possibility that certain types of planes and configurations are much more frequently employed by the airline in question and it could be the incidence of air rage is simply randomly distributed among flights. If so, seems like that’s the kind of thing that a reviewer ought to pick up on pretty easily (since I spent all of 5 minutes reading it).

    Anecdotally, I can’t recall a flight that I’ve been on that DIDN’T have a first class cabin and didn’t have us board from front to back through that cabin, so it’d seem that this type of arrangement is really common and thus we’re looking at nothing.

    • Anecdotally, I can’t recall a flight that I’ve been on that DIDN’T have a first class cabin

      There are plenty of flights that don’t have first-class (or even business-class) cabins. These tend to be in smaller planes on short-haul flight, but some airlines (e.g., Southwest) have nothing but economy-class seating.

      (Anecdotally, I can’t recall the last time a flight that I’ve been on DID have an actual first-class cabin — as opposed to a business-class cabin. But I wouldn’t assume they didn’t exist just on that basis.)

      The puzzling thing for me is that there’s no mention of how they handled business class (except for the cases of three-cabin airplanes, where they averaged business and first class, though they mentioned there were no incidents of “air rage” in business class in three-cabin airplanes).

      • Fair enough. :) I had a different prior (is it even really fair to call it a prior since my belief was based on the ridiculously poor sample of my flying experience???) that supported my line of questioning – I was expecting lots of first class cabins on flights so it’s probably nothing.

        “I’ll take ‘Garden of Forking Paths’ for $200, Alec”, in regards to the business class coding process. I suppose they’d have coded it whichever way supported their hypothesis better. ;)

        • I was expecting lots of first class cabins on flights so it’s probably nothing.

          For what it’s worth, the answer is in one of the quoted statistics that Andrew was mocking: approximately “46.1%” of flights had first-class cabins, so slightly over half did not. That’s out of several million flights all over the world; there will probably be differences in some markets.

          I suppose they’d have coded it whichever way supported their hypothesis better. ;)

          See, this is why I’m rather skeptical about the whole “garden of forking paths” idea: it seems to function as a kind of generator of just-so stories for dismissing studies, calibrated by little more than the level of one’s cynicism.

          Actually, looking more closely at the paper, it appears they used data from just a single “large international airline”, which for the most part has just first class and economy, and my concerns about how business class was coded are irrelevant (and dealt with in the SI).

  13. I’m replying here to Rahul’s comment on May 4, 2016, at 11:04AM and Mark’s comment on May 4, 2016, at 12;37PM, because there is no reply link underneath their comments (perhaps because we have gone so far down in the comment tree)?

    Rahul: I’ll send De Celles and Norton a quick note about my concerns, as you suggested. I’ve sent many such notes over the years and often there is no response. But I’ll give it a try.

    Mark: Thanks for the heads up about PubPeer. I’ll take a look. I wonder how much attention authors pay to PubPeer?

  14. We should be thankful that this research has inspired a wonderful flight of fantasy from Anne Perkins in The Guardian
    http://www.theguardian.com/world/2016/may/04/air-rage-inequality-happiness-plane-passengers

    “They were less concerned with the wider costs of inequality. But that doesn’t stop their findings brilliantly revealing how perceptions of inequality undermine social wellbeing. They have come up with a kind of micro truth about unequal societies that is as revealing about the 21st century as Jane Austen’s universal truth about rich men and marriage was about women’s status in the early 19th century.”

    Or maybe they just found a bunch of p-values less than 0.05.

    • Such wild speculation based off a significant p-value is as old as NHST itself:
      “From hence it follows, that Polygamy is contrary to the Law of Nature and Justice, and to the Propagation of the Human Race”

      John Arbuthnot (1710). “An argument for Divine Providence, taken from the constant regularity observed in the births of both sexes”. Philosophical Transactions of the Royal Society of London 27 (325–336): 186–190. doi:10.1098/rstl.1710.0011
      http://www.york.ac.uk/depts/maths/histstat/arbuthnot.pdf

    • Anne Perkins writes that the authors are at Princeton University. They are not. De Celles is at the University of Toronto and Norton is at Harvard University. Fiske (the editor) is at Princeton. Can’t these journalists even read?

  15. Michael Norton is the second author of this paper. Susan Fiske was the action editor of this paper. Guess who sat on Norton’s dissertation committee when he did his PhD at Princeton? How did Fiske not recuse herself from this paper?

  16. The author seems to have not understood the statistical results presented in the paper. I do not see what is wrong with: “Coefficient estimate and standard errors such as “1.0031** (0.0014)” or “Another coefficient of “11.8594” (dig that precision) with a standard error of “11.8367” which is still declared statistically significant at the 5% level.

    These ‘coefficients’ (which are odds ratios) seem to be reported and interpreted properly. First, odds ratios with values close to 1 can easily represent remarkable effects (it depends on the scale of original variable). For example, odds ratio for the flight distance in miles is very close to 1 (1.0004). Air travel is often long distance, however, so odds ratio of 1.0004 per mile is quite remarkable – it means that the chances of air rage double with every 1700 miles of the flight distance.

    Second, there is nothing incorrect about odds ratio (OR) of 11.84 with standard error (SE) of 11.84 being significant. Note that the standard error, originally estimated for b coefficient (which equals log(OR)), is transformed to OR scale by multiplying it by OR (e.g. SE_OR = OR*SE(b)). So, the confidence intervals are in this case 1.66 to 84. In fact, SE for OR can be easily larger than OR; it could be as high as 14 in this case, without rendering results insignificant. (Authors could, however, report it as CI in the first place to make it easier to read.)

    Remaining issues are rather petty. Should we NOT publish results based on proprietary data? I personally dislike and avoid proprietary data, but it’s better to have this paper out than not. Cheers to the airline for willing to expose themselves to potential PR backlash in the name of science. Hyperprecision? It’s necessary when dealing with odds ratios. Storytelling, dovetailing… really? Why so much hate?

    • Michal:

      The problem is that the paper is a mess, drawing causal conclusions from observational data without any serious plan, leaving a bunch of uninterpretable regression coefficients. If the main analysis was fine, I wouldn’t be so bothered by the cliches, the innumeracy, and the storytelling.

      As to the data being unavailable: Given the problems with the analysis, it’s hard to learn anything at all without being able to see the data. I agree with you that this work is “in the name of science.” I just don’t think it’s good science. That can be expected—not all scientific work will be of high quality. But with the data being unavailable, nobody else can do good science with these data either.

      • Yesterday I requested from the authors either a copy of the dataset or, if that is not possible, some supplemental analyses. De Celles responded that they have a legal agreement not to provide the data or additional details for people not on the research team. PNAS’s stated policy is that one must share data to publish in PNAS.

        So here we have an article, the results of which seem to reflect suppression effects (which, without strong theory, are generally considered to be statistical artifacts), for which no one else can have access to the data, edited by a person who was on the PhD thesis committee of one of the authors and has co-authored at least one article with that author.

        Terrible.

        • I agree with your assessment. Did you point out the suppression effect? Obviously De Celles already knows about this via pubpeer.

        • Yes, Mark, I told them that their results “seem to reflect an unusual and interesting statistical phenomenon called ‘suppression'” and that it is likely that these results are statistical artifacts.

        • Kirby: I sent two notes. The first mentioned the possibility of suppression and requested the data or, if that was not possible, the results of some supplemental analyses. De Celles responded that she could not provide either due to their legal agreement.

          I then sent a second note stating that their results could be statistical artifacts and also mentioning that PNAS seems to require data sharing as a condition of publication. Neither author responded to this second note.

          I am considering submitting a brief comment to PNAS.

  17. Middle boarding is only available on certain aircraft models and is therefore integral to the configuration of the seating and number of aisles, the placement of the kitchen galleys (all those slamming cabinets), number and placement of the washrooms, and blocking of the aisles by the service carts, as well as impact on storage. Seatguru.com, which was used as a resource for the study, codes individual seat quality as red/yellow/green. For such dramatic findings of belligerent passengers in first class, the seatguru data on seat quality and number of type of comments, positive, mixed, and negative for the model aircraft with middle boarding vs. those with only front boarding might have been useful data to crunch. The authors only mention washrooms once, in the supplement regarding lack of data from the airline on warnings to economy passengers not to use the first class washrooms which reveals a singleminded focus on their hypotheses throughout the analysis. How is it possible to ignore the factors mentioned consistently on seatguru – noisy or cold areas, sitting near the washroom, functioning of wifi and electric outlets, and visibility of monitors?

  18. Agree with Michal – is there anything substantive in the points raised beyond a suspicion the paper has problems.

    The precision of the estimates is not really relevant nor is the SE point. The garden of forking paths and unavailability of data apply equally to many/most studies using secondary data sources. Yes planes with first class contain more seats than planes with economy only but not even twice as many seats on average – so the difference doesn’t even come close to explaining the gap in the incidence rate of air rage (1.58 vs. 0.14) for first class versus economy only flights.

    Understanding this differenence is surely interesting and worthy of investigation.

    • Michael:

      There’s a long tradition of people running big regression analyses on observational data and interpreting a whole pile of coefficients, heedless of the difficulty of simultaneously considering comparisons of A while holding B constant, and considering comparisons of B while holding A constant, etc. So in that sense this paper is no worse that lots of other papers published in Air Safety Quarterly or whatever. I agree that it’s fine to get these findings out there, so in that sense the big problem is with the hype. But, beyond that, I think researchers should bring something to the data. If you do bring the raw data, then others can analyze. If you don’t bring the raw data, then the standards are higher.

      If you want to look at the rate of incidents in economy class, comparing flights with and without first class cabins, fine: That’s an observational study to do, and there are principles of performing such studies. Not everyone is aware of these principles and associated methods, but a lot of people are. I’d hope that a purportedly top-notch journal such as PPNAS would catch this in one of the referee reports, even if the authors and the editor did not happen to know about this area of statistics. Regression ain’t magic.

      One of the reasons that post-publication review is so important is because, even if the authors of the paper, the editor, and 3 referees did not catch the problems with these paper, the problems are apparent to a lot of outsiders.

      In response to the final comment: Yes, sure, the patterns in these data are worthy of investigation. I just don’t think this was a particularly good investigation. They threw a bunch of numbers against the wall and saw what stuck. That can be a good start but at some point you want to get serious. I don’t think PPNAS should be publishing the before-you-get-serious stuff. I feel the same way about the himmicanes paper: Sure, cool data, interesting pattern, publish the speculation, go for it. But be clear that it’s speculation and that a serious analysis could lead you to much different conclusions.

    • P.S. On the youtube comments you write that I’d commented on Fiske’s integrity. I looked this up: integrity is defined as “the quality of being honest and having strong moral principles; moral uprightness.” I have no particular reason to think that Fiske lacks these qualities. I do think she’s published some bad papers, but that’s not the same as dishonesty.

      I have never met Fiske and have no sense of her personal qualities. But, speaking more generally, if someone is confused about something, having strong moral principles will not necessarily put them on the path to good science. To put it another way, I can well believe that Fiske is honest, with strong moral principles and moral uprightness—but that she happens to be confused about variation in measurement, and uncertainty in inference. Conditional on her being confused about these statistical issues, Fiske could manifest her strong moral principles etc by standing up for what she believes is right, which just happens to be poor scientific practice such as in the notorious papers on himmicanes, power pose, etc.

      Perhaps it would help to give a navigational analogy. Suppose you’re a tourist in an unfamiliar city and you’re trying to get to some particular place. But you’re turned around, so the faster you walk, the further you get from your destination. You’re not trying to get lost, but you are.

      That’s what I think may be happening with Fiske. Her goal is to promote scientific discovery. But she’s turned around, so she keeps promoting noise studies that lead nowhere. But the problem is not a lack of integrity, it’s a statistical confusion.

      Now, sure, at some point she should read criticisms of the papers she’s published, and realize that something’s gone wrong. She could, for example, talk with her Princeton colleague Kosuke Imai, a poli sci professor who’s an expert on statistics and causal inference. But, admitting you’ve been confused: that’s tough. Fiske may be closing her eyes a bit here, but I wouldn’t quite call that a lack of scientific integrity. For better or for worse, engaging criticism in this way is just so rare. I’d love it if all scientists would engage with criticism but it seems that they often don’t. So, even though I’d like it if Fiske would reconsider her endorsement of power pose, himmicanes, and all the rest, I can’t quite call it a lack of integrity when she stays pat and doesn’t fold her hand.

      • Forgive the slow reply here. Was traveling in Greece, Israel, and Egypt, when you posted this.

        Certainly, only you can decide whether you would describe your comments as addressing Fiske’s scientific integrity. However, others may decide that your comments are indeed on Fiske’s scientific integrity, whether you view that way or not.

        Definition. Of course you got the first definition right. However, most dictionaries will also provide additonal definitions of integrity. These are from dictionary.com

        2. the state of being whole, entire, or undiminished: to preserve the integrity of the empire.

        3. a sound, unimpaired, or perfect condition: the integrity of a ship’s hull.

        Or, as I would argue, “The integrity of a field’s scientific claims.”

        Certainly your metaphor of a lost tourist describes one way someone can get lost. However, I would use another metaphor, not necessarily instead of your’s, but in addition. Medical malpractice. A doctor who harms a patient is liable *if* that doctor could have and should have known better than to use the practices that created the harm.

        I concur with Daniele Fanelli, http://europepmc.org/abstract/med/23407504, that scientific misconduct is more than just making up data.

        So, without presuming *you* believe what you have discussed constitutes misconduct, here is a brief synopsis of some of the issues raised here:

        Cronyism (Fiske was on Norton’s diss committee; Fiske has collaborated with Norton, although some of your commenters, not you, made these point,s but it is still fair to describe this as “discussed on your blog”)
        Lack of transparency (data sharing); conspicuously NOT insisted upon by Fiske

        Then, indeed, you wrote:
        I think it’s irresponsible of the editors of PPNAS to keep publishing these things. Maybe Susan Fiske doesn’t know any better, but couldn’t the editor of PPNAS know better than to send these papers to Susan Fiske?

        Some of us would argue that, when an editor at one of the most prestigious scientific journals “doesn’t know any better” that that is indeed a major failure, IF (big if) one believes the editor should have known better. When Barry Bonds says, “Hey, I thought that injection was just vitamins,” one could argue that it might have been an honest mistake. That, too, is a good metaphor here.

        Andrew, I am not in the business of trying to change your view of whether this constitutes “commenting on scientific integrity.”

        However, I also think those of the opinion that a discussion of the role of cronyism, lack of transparency, and failing to know better constitutes a discussion of scientific integrity (as per the 2nd and 3rd definitions I started this comment with) is not making a claim that is demonstrably ridiculous, false, or unjustified.

        For the record, I love your blog posts. My view is that many of them, not just this one, do discuss scientific integrity as per those definitions and both the scientific malpractice and steroids metaphors. Uli Schimmack routinely refers to these sorts of practices as scientific “doping.” You may disagree. Of course, I respect that disagreement.

        Lee Jussim

  19. Thanks for clarifying, can certainly see that the quality of the analysis and lack of data availability are not aligned with the hype. Though at least the main relationship appears quite pronounced. I have certainly seen worse. Take the below Psych Science paper ‘Money Buys Happiness When Spending Fits Our Personality’ which got global media coverage recently for a tiny borderline significant effect that was t = 2.07 without controls and t = 1.87 with controls (Table 3) based on non-available bank transaction data and some major contortions of what the study set out to test even to get to that result!

    http://foxfellowship.yale.edu/sites/default/files/files/Money%20Buys%20Happiness%20When%20Spending%20Fits%20Our%20Personality%20(1).pdf

    • Henk:

      Wow. I will write more on this. I’m wondering (hoping) that this air rage study represents the beginning of the end of science hype. The himmicanes study was more ridiculous, sure, but it was pretty much always taken as no more than a silly feature story. The air rage study, though, seems to have been taken completely seriously—it was even featured in the Economist! I’m thinking (hoping) that once all these news organizations realize they’ve been duped, they’ll shift to a new level of skepticism.

      That is, my hope is that the air rage study will be the “cold fusion” of social science: the big, hyped story that shocks even everyday reporters into realizing that they’ve been playing somebody else’s game all these years.

  20. To be honest, none of the criticisms I’ve seen have landed a knockout blow. But I think I’ve found the flaw in the study’s methodology.

    It uses linear regression to correct for various factors, which is valid only if the factors have a linear relationship. However, the relationship between flight duration and air rage is probably not linear; more likely it’s an upward-curving line, roughly parabolic.

    When you try to correct for a parabolic relationship with a linear line-of-best-fit you would over-estimate incidents on short flights and under-estimate them on long flights, resulting in spurious coefficients being assigned to other variables that correlate with flight length – such as the existence of first class seating.

    It’s a shame the underlying data is secret, because it would be fairly straightforward to check whether the air rage vs flight duration relationship is linear or not.

    • Matt:

      I don’t think it should be necessary for a criticism to have a “knockout blow.” It’s enough to say that the data do not support the claims being made, in this case for various reasons including the difficulty of interpreting many coefficients at once in a linear regression fit to observational data (which is what you’re saying).

      To put it another way, I’m pushing against the “incumbency advantage” by which a paper published in PPNAS is assumed to be valid unless “knocked out.”

      • It should not be an absolute advantage, yes. But there ought to be some incumbency advantage, right? If not why engage in the charade of referees & editors?

        We could select which papers to publish by random selection among candidates.

        • Rahul:

          That sort of signal tells you something in the absence of no other information. But once we have some information (published in PPNAS, edited by Susan Fiske, and I actually read the paper), then the paper’s published status doesn’t tell us much. I’m not saying that all Fiske-edited PPNAS papers are bad, just that I wouldn’t give it any incumbency advantage.

      • I think criticism should point out fundamental flaws in methodology, rather than nit-picking over the precision of coefficients. If they dropped a few significant figures from their numbers and dialed back their claims from causation to mere correlation, you wouldn’t have a leg to stand on.

        Anyway, the flaw with the paper is more fundamental than just interpreting coefficients. Quite simply, linear regression is not a valid way to correct for variables that are inherently non-linear.

        I can’t be sure without seeing the raw data, but I would bet money that the relationship between flight duration and air rage is upward-curving. If that’s the case, they need to re-do their analysis.

        • Matt:

          Calling something “nitpicking” is of course a value judgment. Writing numbers to too many significant figures is a form of innumeracy. It does not imply that a paper is wrong, but it helps to give some insight when we do find serious errors.

          If the authors were to write their numbers with appropriate precision and if they were to dial back their claims from causation to mere correlation, the paper would indeed be better! But then I don’t think the paper would’ve been published. It’s not enough to say, We did least squares and these are our coefficients. These coefficients are only notable after being interpreted.

          Your comment, “linear regression is not a valid way to correct for variables that are inherently non-linear,” is also a statement about “just interpreting coefficients.” The least-squares regression coefficients are what they are; what I was saying in my above post, and what you were saying in your comment, was that there’s no clear way to interpret these coefficients, and that the casual interpretation given by the authors is inappropriate.

          Finally, you may be right about the relationship between flight duration and air rage, but even if there were a linear relation there, I still wouldn’t trust the regression as it relies on linearity and additivity in so many other places. The whole thing is a mess.

          To say it another way, the task of “just interpreting coefficients” is central to statistical practice! Coefficients without interpretation are meaningless.

        • Value judgement are very important & useful!

          I send a paper to two students to review where one finds mostly typos whereas the other discovers a subtle but fatal & deep error in logic then I do make the value judgement that the second reviewer was hugely more useful.

          That said, I do find an indirect utility to spotting typos, non-significant digits, bad quality graphs & other such annoyances because I’ve noticed empirically that such sloppiness usually also correlates with just fundamentally bad work in general.

        • P.S. Regarding the significant figures: Remember that my audience here is not primarily the authors of this particular paper. One of my goals is to explore ideas of good statistical practice more generally, and numeracy is part of this.

        • Perhaps, intention vs outcome?

          i.e. When you critique a paper merely for not following good statistical practice, readers read it as a damning verdict on the quality of the entire paper.

        • Rahul:

          The paper is essentially useless because it followed poor statistical practice, in particular interpreting regression coefficients inappropriately. The authors’ error was not, I assume, intentional—but aside from the issue of intentionality, what they did was as wrong as what Mark Hauser did when he miscoded the videotapes of his monkeys.

          Good statistical practice is not a minor thing. When it comes to interpreting data, it’s everything.

          As for the significant digits, sure, this is a minor point and I never claimed otherwise. It’s ok to make minor points too.

        • Having finally read the paper, it looks like they don’t actually claim causation. They propose some hypotheses, run the analyses, and get statistically significant results. That’s good enough for a journal, unless there is a clear mistake in their methodology that would bias their results.

          And not being able to debunk the correlation is problematic to me. It would mean their hypothesis is plausible.

          Anyway, I’ve written a critique of their paper, pointing out the flaws and how to fix them. https://my20percent.wordpress.com/2016/05/14/does-first-class-seating-cause-air-rage/

        • Matt:

          You write, “That’s good enough for a journal, unless there is a clear mistake in their methodology that would bias their results.” There are clear mistakes in their methodology: (a) raw comparisons of incidence rates that don’t adjust for the number of passengers, (b) regression results that depend on an additivity and linearity and are uninterpretable given the confounding between the different predictors, and (c) not sharing their data, making it that much harder to interpret (a) and (b). I don’t think that paper is “good enough for a journal” except in the empty sense that a journal actually published it. I suppose it’s good enough for a journal in the same sense that the himmicanes, ESP, and bible code papers were good enough for a journal! I certainly don’t think the study was good enough to be splashed all over NPR, the Economist, etc.

          The funny thing is, I followed your link, and it seems that you dislike the paper for the same reason I do (what I listed as reason b above). You don’t seem to mind the closed data—but all I can say is, I’ve seen people screw up with data enough times, that I’ve learned not to place much trust in conclusions based on data that I and others aren’t allowed to see. It also does not inspire confidence that the abstract of the paper alluded to the testing of predictions, yet no predictions were supplied before the data were analyzed.

        • I’m curious: as a reviewer, would you ever reject a paper just because they refuse to share commercially- or privacy-sensitive data?

          The mistakes in methodology are certainly a problem … except you didn’t catch them in your original critique. The points you *did* raise wouldn’t (in my opinion) justify rejection. Even the precision of the coefficients could be justified given their million-plus sample size.

          I must admit it’s a bit disappointing for me, as an outsider, that none of the experts – the researchers, the reviewers, or the debunkers – managed to spot the fundamental errors. But having seen some of the debunking attempts, I’m now concerned that good papers might be rejected for the wrong reasons.

        • Matt:

          As a reviewer, I do not accept or reject papers, I just make recommendations. If Susan Fiske had sent me this particular paper to review, I would’ve recommended rejection on the grounds that I don’t think the data analysis supports the claims made in the paper.

          And, yes, I did catch the mistakes in the methodology in my original critique. Here’s what I wrote in the above post, which was my original critique:

          The interpretation of zillions of regression coefficients, each one controlling for all the others. For example, “As predicted, front boarding of planes predicted 2.18-times greater odds of an economy cabin incident than middle boarding (P = 0.005; model 2), an effect equivalent to an additional 5-h and 58-min flight delay (0.7772 front boarding/0.1305 delay hours).” What does it all mean? Who cares!

          Story time: “We argue that exposure to both physical and situational inequality can result in antisocial behavior. . . . even temporary exposure to physical inequality—being literally placed in one’s “class” (economy class) for the duration of a flight—relates to antisocial behavior . . .”

          A charming reference in the abstract to testing of predictions, even though no predictions were supplied before the data were analyzed.

          They report a rate of incidents of 1.58 per thousand flights in economy seats on flights with first class, .14 per thousand flights in economy seats with no first class, and .31 per thousand flights in first class.

          It seems like these numbers are per flight, not per passenger, but that can’t be right: lots more people are in economy class than in first class, and flights with first class seats tend to be in bigger planes than flights with no first class seats. This isn’t as bad as the himmicanes analysis but it displays a similar incoherence.

          I didn’t explain all these points in detail—this was a blog post, not a textbook or even a referee report—but it was all there. To spell it out: My first quoted paragraph above addressed the problem of the regression coefficients, which is the same problem you noted in your comment. My second quoted paragraph is relevant in that the paper makes claims about human behavior which are not supported by their data. My third quoted paragraph pointed out the non-preregistered nature of the analysis. As always, preregistration is not required, but when this sort of completely open-ended study is not preregistered, this calls into question all claims of prediction accuracy and p-values. My fourth and fifth paragraphs address the point that the direct comparisons presented in the paper are uninterpretable.

          Finally, the data are unavailable so it is impossible for an outsider to evaluate any of these claims. If the data were public, I’d recommend publication under a much lower standard, because once the data are out there, others could do their own analyses.

          In any case, I appreciate your effort in comments to motivate me to explain my criticisms in more detail. As I said earlier, I’m getting so sick of this pseudo-scientific statistics-based crap, that it’s becoming harder for me to want to go to the trouble of spelling out my criticisms. At this point it’s starting to be like that joke where the comedians just call out numbers instead of saying the whole joke. “Forking paths!” “Closed data!” “Regression on observational data!” “Difference between significant and non-significant!” And so on.

          So, yes, you were wrong in your claim that none of the experts managed to spot the fundamental errors. I did in my post, and so did that airline blogger in his post. But your error is understandable in that we’re all getting a bit tired of having to deal with this sort of paper, and I think it’s irresponsible of the editors of PPNAS to keep publishing these things. Maybe Susan Fiske doesn’t know any better, but couldn’t the editor of PPNAS know better than to send these papers to Susan Fiske?

          And, finally, yes, good papers get rejected for the wrong reasons all the time! No need for you to be concerned about this as if it’s a new thing. I’ve had hundreds of papers rejected from journals, and some of them have been very good, extremely influential papers. Let me say this again. Good papers get rejected from journals all the time. All. The. Time. When our papers get rejected, we resubmit them to other journals. Had the air rage paper been rejected from PPNAS, it would’ve been accepted at Psychological Science or one of any number of other journals. It still would’ve appeared somewhere. But there was no obligation for the editors of NPR, etc., to fall for it.

        • Matt:

          Also, in one of the comments above, Carol wrote:

          The results are definitely due to suppressor effects, and the suppressor effects are due to (1) high correlations between some of the predictors and (2) correlations with different signs.

          This is another way of putting the same criticism that I made and that you made: these regression results are essentially uninterpretable. And, again, the garden of forking paths adds another twist to this because there are so many different regressions someone could fit to this data, so if you’re doing data analysis here you can get pretty much any conclusion you want. And, again, the data being unavailable make it that much more difficult for anyone to evaluate this work. All these problems together make the paper essentially worthless.

Leave a Reply

Your email address will not be published. Required fields are marked *