More details on the Matthew Whitaker case from Brian Gratton and from Rick Shenkman. Shenkman even goes to the trouble of interviewing some of the people involved. It’s not pretty.
Continue reading ‘Arizona plagiarism update’ »
More details on the Matthew Whitaker case from Brian Gratton and from Rick Shenkman. Shenkman even goes to the trouble of interviewing some of the people involved. It’s not pretty.
Quantitative literacy is tough! Or, I had no idea that, in 1958, 96% of Americans disapproved of interracial marriage!
And I was stunned, first by the data on interracial marriage and then, retrospectively, by my earlier ignorance of these data.
Was approval of interracial marriage only 4% in 1958? I had no idea. I looked it up at the Gallup site and it seems to be so. But it’s just hard for me to get my head around this. I mean, sure, I know that attitudes on racial issues have changed a lot, but still . . . 4 percent?? If you’d’ve asked me, I might have guessed 30 percent, or 20 percent, certainly I never would’ve given any number close to 4 percent.
I also learned from the Gallup report that “black-white marriages . . . still represent less than 1% of all married couples.” This sounded low to me, but then I thought about it: if 13% of Americans are black, and 7% of blacks marry white people, then this comes to 1% of all married couples. 7% is low, but it’s not as low as 1%.
In some ways that second number (the total percentage of marriages that are between blacks and whites) is somewhat of a trick question. Still, I’m surprised how far off my intuition was on both these numbers (the rate of approval of interracial marriage in 1958, and the current percentage of marriages that are between whites and blacks). Indeed, I still can’t fit the 1958 approval number into my understanding of public opinion.
I’m reminded of our discussion with Charles Murray a couple years ago regarding the claim that “it is morally wrong for a woman to bring a baby into the world knowing that it will not have a father.” Murray and several commenters seemed convinced that it is “obsessively nonjudgmental” to not think it is morally wrong for a woman etc. It was a fun and somewhat disturbing discussion because I really couldn’t understand these commenters and they really couldn’t understand me.
I similarly have difficulty understanding how 96% of Americans in 1958 could’ve disapproved of interracial marriages. I mean, sure, the data are there, and I guess I could fashion a logical argument, something along the lines of 50% were just flat-out prejudiced and 46% were not personally prejudiced but felt that, in practice, an interracial marriage probably wouldn’t work out in a prejudiced world. Still, I never would’ve guessed the numbers would be so high. My continued astonishment here is a sign to me that I need to further rejigger my mental model of public opinion to handle this data point.
In addition to this point having general statistical relevance—the idea that a single data point, like a single story (as discussed in my paper with Basbøll) has the potential to falsify a model and transform one’s worldview—it also relates to what I think is a fundamental issue in political science: as I wrote in that earlier discussion:
We often have the tendency to think that our political opponents agree with us, deep down, even if they don’t want to admit it. Hence you see Thomas Frank trying to explain the phenomenon of ordinary conservative voters, or various conservative politicians insisting that ordinary blacks and hispanics are fundamentally conservative and are voting for Democrats by mistake, or Charles Murray imagining that my friends and I agree with him that it’s wrong for a woman to have a baby without a male partner. . . . [and this contributes to] the difficulty of understanding across demographic, geographic, and partisan divides.
Noted psychology researchers and methods skeptics Leif Nelson and Uri Simonsohn write:
A recent Psych Science (.pdf) paper found that sports teams can perform worse when they have too much talent.
For example, in Study 3 they found that NBA teams with a higher percentage of talented players win more games, but that teams with the highest levels of talented players win fewer games.
The hypothesis is easy enough to articulate, but pause for a moment and ask yourself, “How would you test it?”
So far, so good. But then they come up with this stunner:
If you are like everyone we talked to over the last several weeks, you would run a quadratic regression (y=β0+β1x+β2×2), check whether β2 is significant, and whether plotting the resulting equation yields the predicted u-shape.
This is horrible! Not a negative comment on Leif and Uri, who don’t like that approach and suggest a different analysis (which I don’t love, but which I agree that for many purposes would be better than simply fitting a quadratic), but a negative comment on their social circle.
If “everyone you talk to over several weeks” gives a bad idea, maybe you should consider talking about statistics problems with more thoughtful and knowledgeable statisticians.
I’m not joking here.
But, before going on, let me emphasize that, although I have some disagreements with Leif and Uri on their methods, I generally think their post is clear and informative and like their general approach of forging strong links between the data, the statistical model, and the research question. Ultimately what’s most important in these sorts of problems is not picking “the right model” or “the right analysis” but, rather, understanding what the model is doing.
Who should we be talking to?
Now let me return to my contention that Leif and Uri are talking with the wrong people.
Perhaps it would help, when considering a statistical problem, to think about five classes of people who might be interested in the results and whom you might ask about methods:
1. Completely non-quantiative people who might be interested in the substantive claim (in this case, that sports teams can perform worse when they have too much talent) but have no interest in how it could be estimated from data.
2. People with only a basic statistical education: these might be “civilians” or they could be researchers—perhaps excellent researchers—who focus on the science and who rely on others to advise them on methods. These people might well be able to fit the quadratic regression being considered, and they could evaluate the advice coming from Leif and Uri, but they would not consider themselves statistical experts.
3. Statisticians or methodologists (I guess in psychology they’re called “psychometricians”) who trust their own judgment and might teach statistics or research methods and might have published some research articles on the topic. These people might make mistakes in controversial areas (recommending a 5th-degree polynomial control in a regression discontinuity analysis or, as in the example above, naively thinking that a quadratic regression fit demonstrates non-monotonicity).
4. General experts in this area of statistics: people such as Leif Nelson and Uri Simonsohn, or E. J. Wagenmakers, or various other people (including me!), who (a) possess general statistical knowledge and (b) have thought about, and may have even worked on, this sort of problem before, and can give out-of-the-box suggestions if appropriate.
5. Experts in this particular subfield, which might in this case include people who have analyzed a lot of sports data or statisticians who specialize in nonlinear models.
My guess is that the people Leif and Uri “talked to over the last several weeks” were in categories 2 and 3. This is fine—it’s useful to know what rank-and-file practitioners and methodologists would do—but it’s also a good idea to talk with some real experts! In some way, Leif and Uri don’t need this, as they themselves are experts, but I find that conversations with top people can give me insights.
I (almost and inadvertently) followed Dan Kahan’s principles in my class today, and that was a good thing (would’ve even been more of a good thing had I realized what I was doing and done it better, but I think I will do better in the future, which has already happened by the time you read this; remember, the blog is on a nearly 2-month lag)
As you might recall, the Elizabeth K. Dollard Professor says that to explain a concept to an unbeliever, explain it conditionally. For example, if you want to talk evolution with a religious fundamentalist, don’t try to convince him or her that evolution is true; instead preface each explanation with, “According to the theory of evolution . . .” Your student can then learn evolution in a comfortable manner, as a set of logical rules that explain certain facts about the world. There’s no need for the student to believe or accept the idea that evolution is a universal principle; he or she can still learn the concepts. Similarly with climate science or, for that matter, rational choice or various other political-science models.
Anyway, in my Bayesian data analysis class, I was teaching chapter 6 on model checking, and one student asked me about criticisms of posterior predictive check as having low power, and another asked why we don’t just do cross-validation. I started to argue with them and give reasons, and then I paused and gave the conditional explanation, something like this:
It’s important in some way or another to check your model. If you don’t, you can run into trouble. Posterior predictive check is one way to do it. Another way is to compare inferences and predictions from your model to prior information that you have (i.e., “do the predictions from your model make sense?”); this is another method discussed in chapter 6. Yet another approach is sensitivity analysis, and there’s also model expansion, and also cross-validation (discussed in chapter 7). And all sorts of other tools that I have not personally found useful but others have.
Posterior predictive checking is one such tool. It’s not the only tool out there for model checking, but a lot of people use it, and a lot of people (including me) think it makes a lot of sense. So it behooves you to learn it, and also to learn why people like me use this method, despite various criticisms.
So what I’m going to do is give you the internal view of posterior predictive checking. If you’re going to try this method out, you’re gonna want to know how it works and why it makes sense to people coming from my perspective.
I didn’t quite say it like that, and in fact I only thought about the connection to Kahan’s perspective later, but I think this is the right way to teach it, indeed the right way to teach any method. You don’t have to “believe” in posterior predictive checks to find them useful. Indeed, I don’t “believe” in lasso but I think it has a lot to offer.
The point here is partly to reach those students who might otherwise be resistant to this material, but also more generally to present ideas more conditionally, to give some separation between the internal logic of a method and its justification.
But, hey, it was published in the prestigious Proceedings of the National Academy of Sciences (PPNAS)! What could possibly go wrong?
Here’s what Erik Larsen writes:
In a paper published in the Proceedings of the National Academy of Sciences, People search for meaning when they approach a new decade in chronological age, Adam L. Alter and Hal E. Hershfield conclude that “adults undertake a search for existential meaning when they approach a new decade in age (e.g., at ages 29, 39, 49, etc.) or imagine entering a new epoch, which leads them to behave in ways that suggest an ongoing or failed search for meaning (e.g., by exercising more vigorously, seeking extramarital affairs, or choosing to end their lives)”. Across six studies the authors find significant effects of being a so-called 9-ender on a variety of measures related to meaning searching activities.
Larsen links to news articles in the New Republic,
Washington Huffington Post, Salon, Pacific Standard, ABC News, and the British Psychological Society, all of which are entirely uncritical reports, and continues:
I [Larsen] show that each of the six studies in the paper consist of at least one crucial deficiency hindering meaningful inferences. In several of the studies the results stems from the fact that the end digits are not comparable as 9, for example, is more likely to be found among younger age decades as the age range in all the studies is from 25 to 65. In other words, if people are more likely to engage in an activity at a younger age compared to later in life, higher end digits are more likely to measure such differences compared to lower end digits. When controlling properly for age the differences reported in some of the studies fails to reach statistical significance. In other studies, results were questionable due to empirical shortcomings.
Larsen conveniently made the data accessible via R code in this Github repository.
You can follow the link for all of Larsen’s comments but here are a few:
In Study 1, the authors use data from the World Values Survey. They conclude that “9-enders reported questioning the meaning or purpose of life more than respondents whose ages ended in any other digit”. However, if one take age decades into consideration, the 9-ender effects fails to reach statistical significance (p=0.71) despite the sample size of 42.063.
In the replication material, some respondents from the World Values Survey are excluded. In the full data set (obtained at http://www.worldvaluessurvey.org/WVSDocumentationWV6.jsp), there are 56,145 respondents in the age range 25-64. The difference between 9-enders and the other respondents on the Meaning variable is 0.001 and is non-significant . . .
In Study 2, age decades are less relevant due to the experimental design. Here the authors find significant effects of the experimental stimuli, and the effects seem robust. However, no randomization tests are reported. One sees that the experimental group differ systematically from both the baseline and control group on pre-randomization measured variables. People in the experimental group are, for example, on average over 6 years younger than the baseline group (p<0.001) and have a higher distance to a new decade than the control group (p=0.02), questioning the possibility to draw design-based inferences. . . .
In Study 3, the authors argue that 9-enders are more likely to seek extramarital affairs. What did the authors do here? They “categorized 8,077,820 male users, aged between 25 and 64 years, according to the final digit of their ages. There were 952,176 9-enders registered on the site”.
Before looking at the data, one can ask two questions. First, whether those aged 30 are more likely to report the age of 29 and so on. The authors argue that this is not the case, but I am not convinced. Second, and something the authors doesn’t discuss, whether age is a good proxy for the age users had when they signed up at the specific site (unless everybody signed up before in the months before November 2013). . . .
In Study 4, the authors look at suicide victims across the United States between 2000 and 2011 and show that 9-enders are more likely to commit suicide. Interestingly, the zero-order correlation between suicide rate and 9-enders is non-significant. In the model conducted by the authors (which shows a significant result), when controlling properly for age, the effect of 9-enders fails to reach statistical significance (p=0.13).
In Study 5, the authors show that runners who completed marathons faster at age 29 or 39. However, simple tests show that there is no statistical evidence that people run faster when they are 29 compared to when they are 30 (p=0.2549) or 31 (p= 0.8908) and the same is the case for when they are 39 compared to 40 (p=0.9285) and 41 (p=0.5254). Hence, there is no robust evidence that it is the 9-ender effect which drives the results reported by the authors. . . .
To relay these comments is not to say there is nothing going on: people are aware of their ages and it is reasonable to suppose that they might behave differently based on this knowledge. But I think that, as scientists, we’d be better off reporting what we actually see rather than trying to cram everything into a single catchy storyline.
Of all the news articles, I think the best was this one, by Melissa Dahl. The article was not perfect—Dahl, like everybody else who reported on this article in the news media, was entirely unskeptical—but she made the good call of including a bunch of graphs, which show visually how small any effects are here. The only things that jump out at all are the increased number of people on the cheating website who say they’re 44, 48, or 49—but, as Larsen says, this could easily be people lying about their age (it would seem like almost a requirement, that if you’re posting on a cheaters site, that you round down your age by a few years)—and the high number of first-time marathoners at 29, 32, 35, 42, and 49. Here’s Dahl’s summary:
These findings suggest that in the year before entering a new decade of life, we’re “particularly preoccupied with aging and meaningfulness, which is linked to a rise in behaviors that suggest a search for or crisis of meaning,” as Hershfield and Alter write.
Which just seems ridiculous to me. From cheaters lying about their ages and people more likely to do their first marathon at ages 29, 32, 35, 42, and 49, we get “particularly preoccupied with aging and meaningfulness”??? Sorry, no.
But full credit for displaying the graphs. Again, I’m not trying to single out Dahl here—indeed, she did better than all the others by including the graphs, it’s just too bad she didn’t take the next step of noticing how little was going on.
One problem, I assume, is the prestige of the PPNAS, which could distract a reporter from his or her more skeptical instincts.
Here, by the way, is that graph of suicide rates that appeared in Dahl’s news article:
According to the graph, more people actually commit suicide at 40 than at 39, and more do it at 50 than at 49, while more people kill themselves at 58 than 59. So, no, don’t call the suicide hotline just yet.
P.S. To get an image for this post, I started by googling “hype cycle” but all I got was this sort of thing:
All these are images of unidirectional curves; none are cycles! Then I went on wikipedia and found that “hype cycle” is “a branded graphical tool.” How tacky. If you’re going to call something a cycle, you gotta make it return to its starting point!
The hype cycle
The “cycle” of which I speak goes something like this:
1. Researcher is deciding what to work on.
2. He or she hears about some flashy paper that was published in PNAS under the auspices of Susan Fiske, maybe something like himmicanes and hurricanes. The researcher, seeing this, sees that you can get fame and science chits by running a catchy psych experiment or statistical analysis and giving it a broad interpretation with big social implications.
3. The analysis is done, it’s submitted to PNAS.
4. PNAS sends it to Susan Fiske, who approves it.
5. The paper appears and gets huge exposure and uncritical media attention, both in general and (I’m sorry to say) within the field of psychology.
Return to 1: Another researcher is deciding what to work on. . . .
That’s the hype cycle: The press attention and prestige publication motivates more work of this kind.
P.S. Let me emphasize: I’m not opposed to this sort of work. I think the age analysis is clever (and I mean that in a good way), and I think it’s great that this sort of thing is being done. But, please, can we chill on the interpretations? And, journalists (both in general and within our scientific societies), please report this with a bit of skepticism? I’m not saying you need to quote an “opponent” of the study, just don’t immediately jump to the idea that the claims are generally valid, just because they happened to appear in a top journal.
Remember, PNAS published the notorious “himmicanes and hurricanes” study.
Remember, the Lancet published the notorious Iraq survey.
Remember, Psychological Science published . . . ummm, I think you know where I’m going here.
Reporting with skepticism does not require “debunking.” It’s just a matter of being open-minded about the possibility that claimed results to not really generalize to the larger populations or questions of interest, in the above case it would imply questioning a claim such as, “We know that the end of a perceived era prompts us to make big life decisions,” which is hardly implied by some data that a bunch of 42-year-olds and 49-year-olds are running marathons.
Mon: The hype cycle starts again
Tues: I (almost and inadvertently) followed Dan Kahan’s principles in my class today, and that was a good thing (would’ve even been more of a good thing had I realized what I was doing and done it better, but I think I will do better in the future, which has already happened by the time you read this; remember, the blog is on a nearly 2-month lag)
Wed: Leif and Uri need to hang out with a better class of statisticians
Thurs: Quantitative literacy is tough! Or, I had no idea that, in 1958, 96% of Americans disapproved of interracial marriage!
Fri: Arizona plagiarism update
Sat: Unstrooping names
Sun: A question about varying-intercept, varying-slope multilevel models for cross-national analysis
Besides family values, that is?
Both these politicians seem to have a problem with the National Weather Service:
Santorum also accused the weather service’s National Hurricane Center of flubbing its forecasts for Hurricane Katrina’s initial landfall in Florida, despite the days of all-too-prescient warnings the agency had given that the storm would subsequently strike the Gulf Coast.
Governor Cuomo’s attempt to scapegoat the National Weather Service for an inaccurate forecast in advance is not only completely in error—the NWS did an outstanding job—but is a disservice to the public and to the hard-working staff of this federal agency. No forecast of such an historical disaster is going to be absolutely perfect, but no one who lives here can say this event was not well forecast in advance, or that the warning headlines of its impact to come were not well explained in advance…his statement is disinformation, purposeful or ill-informed.
Hey, politicians are politicians, they have to make lots of compromises. But, as a statistician, I’m repulsed by this sort of anti-data attitude coming from either political party.
. . . and Kaiser Fung is unhappy. In a post entitled, “Princeton’s loss of nerve,” Kaiser writes:
This development is highly regrettable, and a failure of leadership. (The new policy leaves it to individual departments to do whatever they want.)
The recent Alumni publication has two articles about this topic, one penned by President Eisgruber himself. I’m not impressed by the level of reasoning and logic displayed here.
Eisgruber’s piece is accompanied with a photo, captioned thus:
The goal of Princeton’s grading policy is to provide students with meaningful feedback on their performance in courses and independent work.
Such a goal [writes Kaiser] is far too vague to be practical. But let’s take this vague policy at face value. How “meaningful” is this feedback when 40% of grades handed out are As, and 80% of grades are either As or Bs? (At Stanford, Harvard, etc., the distributions are even more skewed.)
Here are some data:
My agreement with Kaiser
As a statistician, I agree with Kaiser that if you want grades to be informative, it makes sense to spread out the distribution. It’s an obvious point and it is indeed irritating when the president of Princeton denies or evades it.
I’d also say that “providing students with meaningful feedback” is one of the least important functions of course grades. What kind of “meaningful feedback” comes from a letter (or, for that matter, a number) assigned to an entire course? Comments on your homeworks, papers, and final exams: that can be meaningful feedback. Grades on individual assignments can be meaningful feedback, sure. But a grade for an entire course, not so much.
My impression is that the main functions of grades are to motivate students (equivalently, to deter them from not doing what it takes to get a high grade) and to provide information for future employers or graduate schools. For these functions, as well as for the direct feedback function, more information is better and it does not make sense to use a 5-point scale where 80% of the data are on two of the values.
One can look at this in various ways but the basic psychometric principle is clear. For more depth, go read statistician Val Johnson’s book on grade inflation.
OK, now for my disagreement or, maybe I should say, my discomfort with Kaiser’s argument.
Grad school grades.
In grad school we give almost all A’s. I’m teaching a course on statistical communication and graphics, and I love the students in my class, and I might well give all of them A’s.
In other grad classes I have lots of grading of homeworks and exams and I’ll give mostly A’s. I’ll give some B’s and sometimes students will complain about that, how it’s not fair that they have to compete with stat Ph.D. students, etc.
The point is, if I really believe Kaiser’s principles, I’d start giving a range of grades, maybe 20% A, 20% B, 20% C, 20% D, 20% F. But of course that wouldn’t work. Not at all. I can’t do it because other profs aren’t doing it. But even if all of Columbia were to do it . . . well, I have no idea, it’s obviously not gonna happen.
Anyway, my point is that Princeton’s motivation may well be the same as mine: yes, by giving all A’s we’re losing the ability to give feedback in this way and we’re losing the opportunity to provide a useful piece of information to potential employees.
But, ultimately, it’s not worth the trouble. Students get feedback within their classes, they have internal motivation to learn the material, and, at the other end, employers can rely on other information to evaluate job candidates.
From that perspective, if anyone has the motivation to insist on strict grading for college students, it’s employers and grad schools. They’re the ones losing out by not having this signal, and indirectly if students don’t learn the material well because they’re less well motivated in class.
In the meantime, it’s hard for me to get mad at Princeton for allowing grades to rise, considering that I pretty much give all A’s myself in graduate classes.
P.S. Kaiser concludes:
The final word appears to be a rejection of quantitative measurement. Here’s Eisgruber:
The committee wisely said: If it’s feedback that we care about, and differentiating between good and better and worse work, that’s what we should focus on, not on numbers.
The wisdom has eluded me.
Let me add my irritation at the implicit equating of “wisdom” with non-quantitative thinking.
Tweeting has its virtues, I’m sure. But over and over I’m seeing these blog vs. twitter battles where the blogger wins. It goes like this: blogger gives tons and tons of evidence, tweeter responds with a content-free dismissal.
The most recent example (as of this posting; remember we’re on an approx 2-month delay here; yes, this site is the blogging equivalent of the “slow food” movement), which I heard about on Gawker (sorry), is Slate editor Jacob Weisberg, who took a break from tweeting items such as “How Can You Tell if a Designer Bag Is Fake?” to defend serial plagiarist Fareed Zakaria:
(Just to interject: to say that Zakaria is a serial plagiarist is not to say he has nothing to offer as a journalist. Plagiarists ranging from Martin Luther King to Doris Kearns Goodwin to Laurence Wayne Tribe have been functional members of society when not copying the work of others without attribution. (OK, I’m kidding about that “Wayne” thing; I just thought Tribe needed a middle name too, to fit with his distinguished associates.)
OK, back to the story. The tweet is Weisberg’s empty defense of Zakaria’s plagiarism. OK, not completely empty, but what can you do in 160 characters? Weisberg’s point is that it’s ok to quote without attribution on TV because on TV there’s no room for footnotes. Fair enough. I can tell right now that (on the rare occasions that) I go on radio or TV, I would never use someone else’s words without attribution, but Zakaria’s a lot more busy than I am, and a lot more practiced in using other people’s words, so maybe the guy just can’t help it.
OK, back to the story. (Hey, I keep saying that! Maybe we’re having too many digressions here. Anyway . . . ) Now comes the blogger. “@blippoblappo & @crushingbort” are blogging, they have as much space as they want, indeed their only limitation is their own capacity to get bored by their own writing. OK, blogging is rarely crisp, but they have the space to make their points. They don’t need to pick and choose, they can say everything they need to say.
In particular, in their post, “Yes, the indefensible Fareed Zakaria also plagiarized in his fancy liquor columns for Slate,” they have this:
Winner: blog. It’s not even close. When it’s information vs. soundbites, information wins.
Of course this is not to say that bloggers are always correct or that every tweeter is wrong. Not at all. But I do think there’s a benefit to being able to lay out the whole story in one place.
Rogier Kievit sends in this under the heading, “Worst graph of the year . . . horribly unclear . . . Even the report doesn’t have a legend!”:
It’s horrible but I still think the black-and-white Stroop test remains the worst visual display of all time:
What’s particularly amusing about the Stroop image is that it does have color—but only in its label!
Aahhhh, much better.