To Throw Away Data: Plagiarism as a Statistical Crime

‘ve been blogging a lot lately about plagiarism (sorry, Bob!), and one thing that’s been bugging me is, why does it bother me so much. Part of the story is simple: much of my reputation comes from the words I write, so I bristle at any attempt to devalue words. I feel the same way about plagiarism that a rich person would feel about counterfeiting: Don’t debase my currency!

But it’s more than that. After discussing this a bit with Thomas Basbøll, I realized that I’m bothered by the way that plagiarism interferes with the transmission of information:

Much has been written on the ethics of plagiarism. One aspect that has received less notice is plagiarism’s role in corrupting our ability to learn from data: We propose that plagiarism is a statistical crime. It involves the hiding of important information regarding the source and context of the copied work in its original form. Such information can dramatically alter the statistical inferences made about the work.

In statistics, throwing away data is a no-no. From a classical perspective, inferences are determined by the sampling process: point estimates, confidence intervals and hypothesis tests all require knowledge of (or assumptions about) the probability distribution of the observed data. In a Bayesian analysis, it is necessary to include in the model all variables that are relevant to the data-collection process. In either case, we are generally led to faulty inferences if we are given data from urn A and told they came from urn B.

A statistical perspective on plagiarism might seem relevant only to cases in which raw data are unceremoniously and secretively transferred from one urn to another. But statistical consequences also result from plagiarism of a very different kind of material: stories. To underestimate the importance of contextual information, even when it does not concern numbers, is dangerous.

Here’s our full article (which has just appeared in the American Scientist). It features two of the recurring characters from this blog. Here’s our conclusion:

Scholars in fields ranging from psychology to history to computer science have recognized that stories are part of how people understand the world. As statisticians, we can consider reasoning from stories as a form of approximate inference. From this perspective, statistical principles should provide some approximate guidance about the potential biases and precision of such inferences. One key principle is not to throw away information and, if discarding data is for some reason necessary, to describe as clearly as possible the mechanism by which the relevant information was excluded. Plagiarism violates both these rules and, as such, is a violation of statistical ethics, beyond any other considerations of moral behavior.

P.S. I’m more interested in scientific plagiarism than the legal or literary variety, but this 2004 news article by Daniel Hemel and Lauren Schuker (which I found by googling *laurence tribe plagiarism*) is full of good quotes. Here’s my favorite part:

Tribe’s mea culpa comes just three weeks after another prominent Harvard faculty member—Climenko Professor of Law Charles J. Ogletree—publicly apologized for copying six paragraphs almost word-for-word from a Yale scholar in a recent book, All Deliberate Speed.

Last fall, Frankfurter Professor of Law Alan M. Dershowitz also battled plagiarism charges. And in 2002, Harvard Overseer Doris Kearns Goodwin admitted that she had accidently copied passages from another scholar in her bestseller The Fitzgeralds and the Kennedys.

University President Lawrence H. Summers told The Crimson in an interview last week—before the allegations against Tribe surfaced—that he did not see “a big trend” of plagiarism problems at the Law School as a result of the charges against Ogletree and Dershowitz, but indicated that a third case would change his mind.

“If you had a third one, then I would have said, okay, you get to say this is a special thing, a focused problem at the Law School,” Summers said of the recent academic dishonesty cases.

He declined comment last night.

32 thoughts on “To Throw Away Data: Plagiarism as a Statistical Crime

  1. I find it interesting that in the “Acknowledgment” to the American Scientist article you state “Parts of this essay are adapted from Gelman’s blog, Statistical Modeling, Causal Inference, and Social Science, at http://statmodeling.stat.columbia.edu.” Would you have included that statement if you were the article’s sole author? When I wrote my recent book, I discussed with my editor whether to use quotes and individual citations for portions of the book that had previously appeared on my blog. We both felt that would impede the reader’s flow through the material and be unnecessarily pedantic. (Look at me! I’m quoting myself!) So, in the end, cognizant of self-plagiarism but feeling that a blog didn’t carry the same weight as published material, we settled on a blanket statement near the beginning of the book: “Some of this material has previously appeared on the author’s blog [reference].”

    • Rick:

      We added this acknowledgement at the suggestion of our editor, who didn’t want us to be vulnerable to some smartass making an aha-type gotcha at us for self-plagiarism. Usually when I take stuff from my blog, I don’t do the citation.

  2. I just finished a class on Deuteronomy. We went over the descriptions in Exodus and saw how they’d been edited a few different times to incorporate a Deuteronomistic world view. And that world view was itself composed of layers built around an older core. The retelling of Deuteronomy – the literal meaning, if you didn’t know – is then the model for the Christian retelling in John and then the Quranic retelling and so on. Each of these presents itself as “the story” though each does violence to the previously existing versions.

    I whole-heartedly agree that hiding the source distorts understanding, but then I’ve met only a rare handful of Christians who know anything much about the Jewish roots of Judaism let alone the textual histories of their own stories. This isn’t even a conversation in nearly all of Islam because it is still essentially impossible for Muslims to engage in public discussion about the Quran’s development.

    We can look at the creation stories in Genesis. The second one is older but the first, more famous one doesn’t acknowledge that.

  3. Andrew says in the above:
    “One key principle is not to throw away information and, if discarding data is for some reason necessary, to describe as clearly as possible the mechanism by which the relevant information was excluded. Plagiarism violates both these rules and, as such, is a violation of statistical ethics, beyond any other considerations of moral behavior.”
    However, sometimes things get so complicated that the main statistical idea is lost unless some data are discarded. As an example, consider the famous and numerically concise Leo Durocher quote: “Nice guys finish last.”
    I was alive when he said it and remember it well. Except that he didn’t say it. In those days there were eight teams in the National League and he said “Nice guys finish seventh.” Or maybe, he said something else. Go to http://www.phrases.org.uk/bulletin_board/30/messages/1839.html which summarizes the book by Ralph Keyes, “Nice guys finish seventh”: False phrases, spurious sayings, and familiar misquotations.”
    And while we are at it, instead of Vince Lombardi describing a statistically and mathematically unique event, it was the actor John Wayne playing the role of a football coach who first uttered the immortal phrase “Winning isn’t everything, it’s the only thing!” in a 1953 film called “Trouble Along the Way.” But according to Keyes, it wasn’t Wayne but his young daughter who spoke these eternal truths.
    As I wrote, some information has to be thrown away or we will be bogged down in minutiae.

  4. I’ll play Devil’s Advocate: The current trend towards citation-bloat does make reading scientific literature rather irritating at times. Just breaks the flow of reading.

    In my field at least, I notice older papers being rather parsimonious and discerning about citations. Now, this could just reflect on the fact that there’s just a lot more previous work now. OTOH, I wonder if it has just become more fashionable to cite liberally? Or perhaps wanton citations to please people that matter? Or to avoid the risk of plagiarism? Or to encourage quid pro quo citation index / impact factor boosting? I don’t know.

    This isn’t a rant to encourage plagiarism, only an appeal for moderation.

    • I’m with you. Certain fields are so ate up with citation bloat it’s impossible to rest your eyes on a single clear English sentence. A typical example might be:

      The literature [Fitsgerald 1966, Rufus 1998] indicates psychological participants [Gerlado 2001, Herodotus 223] perceive the sun as hotter [Lizard Man 2013, Queen Mother 1988] than the moon [Cowen 1992, Tyler 1993, Dallas Cowboy Cheerleaders et al 1979].

      Entire papers look like that. It almost seems like some fields actually require every phrase, no matter how trivial, to have a citation for publication. It looks to be a bigger problem in fields that are more self conscious about their status as a real science. Either that, or authors are getting paid by the citation.

      Whenever I look at a papers written in English between say 1850-1950, I’m always shocked at how much clearly written they are.

    • What I find frustrating is the style where citations in text are written as [1,2,3] and in references give the journal and pages but not the titles of the articles. When the titles of the articles are not in the citation, I find it hard to figure out where to follow up.

      In my books I hold off the citations until the end of the chapters or book, using separate bibliographic notes.

      Direct quotations, though, I indicate using quotation marks.

      • Yes, this! Especially in online versions of the articles, where the space constraints for that are especially artificial.

  5. There seems to be a general sense of “you can’t make an island” out of story or a specific contrast out of study, etc. without likely frustrating or even misleading their proper interpretation (in context). Here even if you _steal_ a story, it can’t be properly interpreted without knowing where it was _stolen_ from. Misinterpretation being much worse, than lack of respect for property or getting something without paying for it.

    I remember it being very challenging convincing many statisticians that by being just involved in one study – that did not make the related current and past studies not relevant in their analysis (certainly not if findings and conclusions were arising in and being communicated from that one study).

    As for the earlier literature not providing many citations, that is one reason why we can’t discern, for instance, where RA Fisher drew his early ideas from.

    • It may be that following Fisher’s influences are difficult. Jaynes speculates that Fisher got his early facility in multivariate integrals from working with physicists who were doing exactly those kinds of integrals in statistical mechanics in the years immediately before his F distribution stuff. Also Fisher’s reputation is that of being the kind of jerk who wouldn’t give credit where credit is due if it went to someone he disagreed with.

      On the other hand building on what Rahul mentioned above about citation bloat, RA Fisher’s papers are still a pleasure to read after 60-70 years. How many statistics heavy papers published today do you think people will be saying the same about 70 years from now?

      • Sure, the citation bloat is annoying, but I think one of the aspects about early papers being more readable is that they’re plucking the low hanging fruit. Another aspect is that people used to read and write more, their ability to communicate was generally better, and scientists tended to write in languages they were fluent in. Huge amounts of modern english language scientific writing is written by researchers for whom english is a second language that they learned rather late in life.

        Now there’s a lot more science known, and so science students spend more time studying science and a LOT less time studying literature and philosophy and other things that might hone written communication skills.

        • Daniel,

          I’m reminded of the immortal words of Clifford Truesdell in regards to an early paper by Stokes:

          “Both for method and for style, this paper would be rejected by the secret referees of any society’s journal today. The language of Stokes is plain English; today’s crab-dance of German noun piles tottering among Dutch passives and impersonals with quotation marks, dashes, and parentheses to string together street slang, Latin fustian, and ad-men’s gabble was still to be invented as the ‘scientific style’. To modern official modesty, the use of the personal pronoun ‘I’ may suggest the pontificating of an elderly authority; in fact, it was the honesty of an unknown beginner, twenty-six years old, who was thinking, and who admitted it.”

          -Essays in the History of Mechanics

          I know your a fan of Mechanics, so you probably wont need much convincing that Stokes’s research wasn’t lower hanging fruit that what we see published today in Psychology or Economics. Nor has anyone claimed Stokes plagiarized without attribution.

        • But Stokes’s research, and Brinkman’s research (who wrote a great classic paper on the modification of the Stokes drag as a particle integrates into a swarm of particles, check it out! http://dns2.asia.edu.tw/~ysho/YSHO-English/1000%20CE/PDF/App%20Sci%20Res%20Sec%20A1,%2027.pdf )

          Those were way lower hanging fruit than computational fluid mechanics stuff designed to predict say the noise produced by a certain Turbofan jet engine right? In the sense that a bunch of people had to work to develop software that would solve the equations reliably, and validate that software, and then someone had to design the turbofan, and another person had to model it in some 3D package, and run the simulation, and show that the results are robust in some way, and compare to some other computing package, and bla bla bla, there’s a lot of stuff you’re basing the results on that’s just a lot of grunt work you have to cite because your part in it all is just the tiny part where you actually decided to use all these other people’s bits together to answer some highly specific question about what’s the best way to reduce noise pollution near airports or whatever.

          So, anyway I guess you’re right that an imposed “style” has a lot to do with it but also the inevitable progress in knowledge is going to mean that in any given field, there’s a marginal return on any one person’s input and hence a lot of citation.

        • Truesdell is an interesting example of style.

          At one level, he’s a gifted writer, and you can see that he’s put a great deal of thought into careful choice of words and sentence structure, witty and erudite allusions, piling up elaborate detail on other elaborate details, and so forth. (I’m starting to imitate.) But just as it’s often quotable in the small, it’s wearisome stuff to read in any quantity. It’s too much “look at this style, and how gifted a writer I am, and how clever I am”.

          Strunk and White did have a point….

        • I think it’s a bit egocentric to call the research low hanging fruit. What looks like retrospectively appears like low hanging fruit tomorrow will be unimaginable by most people today.

        • I think the act of realizing that some low hanging fruit is out there and plucking it is a dramatic and heroic action on the part of researchers. It’s much harder to do than to write some incremental paper on a minor alternative to the xyz theory of pdq.

          Like that statistician with the baby names that end in “n”, anyone might have found this trend, but not *anyone* did only one person. I don’t think it denigrates a researcher to have their research called “low hanging fruit”. If anyone feels that this is some kind of slight I apologize for the miscommunication. I simply meant that the results don’t require getting out a bucket arm truck and a crew of thousands to pluck.

          Identifying low hanging fruit is where the biggest individual increments are to be found I think. We should all be so lucky.

  6. I don’t find Andrew’s thesis (“plagiarism as a statistical crime”) very compelling. For one, most people plagiarize work they consider worth plagiarizing, not absolute crap.

    If at all such plagiarism only makes work sound weaker than it is but then the author can easily get that same effect by means other than plagiarism (e.g. intentionally writing a bad paper)

    If there was a statistical crime that’d ironically be the exact opposite of conventional plagiarism: giving credit to someone for stuff he never wrote! Wonder if academic fraudsters have exploited that flaw…..

    • Part of our argument is that plagiarism can turn a perfectly good source into crap. Even if it was worth something in its original form, it can do a lot of harm when it circulates without a reference back to its original context.

      The poem we write about is a good case in point. There’s nothing wrong with it as a poem about a “story from the war”. It is the moment it is presented as a prose account of an “incident that happened” that the trouble begins. The critical qualities of the original are elided in the plagiarized version that now circulates in organization studies.

      • I don’t buy that. Plagiarism to me is mainly a case of not giving credit where it’s due. It upsets the incentive / signalling structure of academics and is probably morally / ethically wrong. Note that in legal terms Plagiarism as a concept does not even exist (I think) and that may be a reflection of how little it matters to the larger social good (although we as academics like to pretend that it does).

        I don’t see how plagiarism can poison the well of evidence. In that sense, fraudulent data, doctored experiments, cherry picking, selective publication, or even genuine analysis errors seem like bigger crimes to me. The statistical crime aspect of plagiarism seems a stretch.

        • Well, I can only say that you’re pretty much the reader we’re trying to reach, so I hope you’ll think about it anyway. Plagiarism can be a serious problem even where it’s not illegal.

          Yes, there are bigger “crimes” than plagiarism, of course. But I should mention that the cases of plagiarism I’ve found are often found together with some of the other things you mention. Someone who copies a source without attribution is also likely to misread it. Someone who steals an anecdote, probably also cherry picked it to serve a particular purpose.

  7. What about plagiarism of introductory/motivational material? This seems morally as wrong to me as other plagiarism and yet, I see no can’t quite see the utilitarian need (esp. not statistical) in having all authors rewrite good boilerplate stuff – that they agree with – in “their own words.”

    (One might say that in this case you should cite the original material and trust the reader to access it somehow and read it before continuing, but that’s a bit utopian in today’s world where most cite academic material is essentially inaccessible, especially so with the efficiency that would allow one to interrupt one’s reading to chase down the cited source [*]”.)

    [*] That’s literally false of course, but who is going to spend half an hour signing up for some site then spending $30 and then sharing a crap of personal financial details every single time someone would say “for a good background, see [X]”. In this case
    [X] is _practically_ inaccessible, if all it is being cited for is as background material, for anyone outside of, say, a university. If the original author wants to communicate ideas that’s a bit silly. So they plagiarize, or eske they (from certain perpsective, gratuitously) rewrite in their own words even if they have (and even believe themselves to have) nothing whatsoever to add. No answer here seems good to me.

    • Isn’t this what quotation marks are for?

      one author says “One might say that in this case you should cite the original material and trust the reader to access it somehow and read it before continuing, but that’s a bit utopian…” (bxg 2013)

      Sure it’s a little drastic to copy say 2 or 3 paragraphs with a quote mark around it, but there is also a place for “for a review of the current state of plagiarism blogging see Gelman (2013)” and if they can’t access it, too bad, at least they know where it came from.

  8. Andrew: What’s with this obsession with Edward Wegman? You never lose an opportunity to go after him. In that American Scientist piece you seem to devote waaay to much ink to Wegman, who was kind of an aside to your primary story.

    • Rahul:

      I’m not obsessed with Wegman but I won’t apologize for the fact that I find his case interesting. See the last paragraph here for some discussion of why.

  9. 1) George Mason University excused plagiarism in introductory sections of the Wegman Report, although the text was from areas outside their expertise (paleoclimate, social networks analysis), and parts if the former were even changed to invert expert conclusions. This wasn’t reuse of in-field boilerplate, standard definitions, but the sort that tries to claim credibility for later arguments.

    2) I know of another (not to be named) school that basically said that plagiarism and falsification on introductory sections categorically did not count, only that in sections that reported new research . I don’t know how many other schools have adopted this definition.

    3) For in-field boilerplate,I like providing standard definitions, I’m happy if someone writes “we adopt standard definitions from (citation, with section or page #s) or writing ” adapting the discussion from X” and if really copying word for word, copy it and quote it. If the issue was handled well elsewhere, I’m happy to see even a big chunk of text quoted, among other things because if I have seen it before, then I know it is the same and I can just skip it.
    This is like legal agreements, like nondisclosure agreements.
    I’ve often wished there were a few standard forms, with fill-in-the blanks, cross-outs, adding … Rather than pages of text that have to be read to make sure they are actually vanilla.

    • Citations are meant for two purposes: to give attribution and to give background reference for people who want to dig deeper. I like (3), where we might say “Greene (1993) provides an excellent introduction to this topic. In the remainder of this section, we adopt his terminology and notation, and present background material on this topic.” I think if you do that, then no plagiarism can be claimed, even if the high-school definition (five or more words taken from a source without quoting) occurs.

      • This is somewhat of a grey zone, apparently, and some people might quibble about the details. I prefer quotes at some point, but I’d certainly never file a complaint against that “Greene …”, as this style of wording, with a clear, explicit citation *at the beginning*, where it can’t be missed, is clear evidence the author is trying to do the right thing, whether or not they go far enough for everyone. I’d be less happy, if there was a vague citation at the end of the text, a style I’ve seen, especially when falsification is mixed into the plagiarism, as per this example, i.e., 1)

  10. You mention my soapbox issue: scientists who collect data then because “knowledge is power”, refuse to share the data, or neglect to be clear on their methods, lest someone prove them wrong. Unlike plagiarism, which may well happen near the bottom of the heap or mid-heap, data-hoarding is more often done by those who already have a reputation, in a kind of appeal to (self-)authority.

    • I don’t know about that. The examples in our essay, and those in Andrew’s P.S., all involve people with strong reputations. Their dismissal of the charges (and the way their peers defend them) certainly relies on appeals to authority.

      Plagiarism is like data hoarding in the sense that the author claims to know something but does not reveal the basis on which she knows it. (Like when Kearns Goodwin failed to cite McTaggart, which was actually a kind of conspiracy to keep readers in the dark.)

  11. Thanks for a nice article; I agree that destruction (or worse: deliberate distortion) of information provenance is a serious intellectual crime. Somehow scientific work is not so much about communicating results, but more communicating the basis for results.

    Can I get a definition (or statement of scope) of the “Zombies” tag? It makes me laugh!

Comments are closed.