Skip to content

Taking Data Journalism Seriously

This is a bit of a followup to our recent review of “Everybody Lies.”

While writing the review I searched the blog for mentions of Seth Stephens-Davidowitz, and I came across this post from last year, concerning a claim made by author J. D. Vance that “the middle part of America is more religious than the South.” This was a claim that stunned me, given that I’d seen some of the statistics on the topic, and it turned out that Vance had been mistaken, that he’d used some unadjusted numbers which were not directly comparable when looking at different regions of the country. It was an interesting statistical example, also interesting in that claims made in data journalism, just like claims made in academic research, can get all sorts of uncritical publicity. People just trust the numbers, which makes sense in that takes some combination of effort, subject-matter knowledge, and technical expertise to dig deeper and figure out what’s really going on.

How should we think about data journalism, an endeavor which might be characterized as “informal social science”?

Data journalism is a thing, it’s out there, and maybe it needs to be evaluated by the same standards as we evaluate published scholarly research. For example, this exercise in noise mining—a study on college basketball that appeared in the New York Times—is as as bad as this Psychological Science paper on sports team performance. And then there’s data journalism done by academic researchers on holiday, as it were; wacky things like this. When I do data journalism I think it’s of the same high quality as my published work (except that it’s more likely to have some mistakes because it gets posted right away and hasn’t had the benefit of reviews), but I get the impression that other academics have different standards for newspaper articles and blog posts than for scholarly articles. One thing I like about Stephens-Davidowitz’s book is that it mixes results from different sources without privileging PPNAS or whatever.

Anyway, I don’t currently have any big picture regarding data journalism. I just think it’s important; it’s different from the sorts of social science research done in academia, business, and government; and we should be taking it seriously.

P.S. According to Wikipedia, J. D. Vance (author of the mistaken quote above about religiosity) is an “author and venture capitalist,” which connects us to another theme, that of silly statistics from clueless rich guys, of which my favorite remains this credulity-straining graph of “percentage of slaves or serfs in the world” from rich person Peter Diamandis. Wealthy people have no monopoly on foolishness, of course. But when a rich guy does believe passionately in some error, he might well have the connections to promulgate it widely. Henry Ford and Ron Unz come to mind.


  1. Mike Spagat says:

    Standards for data journalism is a great topic that seems to hardly have been addressed so far as I can see.

    I would suggest one principle which, surprisingly, doesn’t seem to be followed. When you write journalistic articles about a dataset you should have the dataset you are writing about.

    I have a recent experience of requesting some some surveys of public opinion in Iraq that the BBC had co-sponsored and written a number of articles about. I requested the data from the BBC and was told that the BBC does not have and never did have these datasets. They had co-sponsored the surveys with ABC news and had just written what ABC told them to write.

    (ABC news ignored me when I wrote to them.)

    Somehow I believe that the BBC applies higher standards to non-data news stories. Sure the BBC might say that according to stories in the Washington Post that they can’t verify that Trump gave away secrets to the Russians. But they wouldn’t say that the BBC has learned that Trump gave away secrets to the Russians unless they could line up reliable sources verifying this claim.

    But with data they are OK with publishing tables from a “BBC Survey” without being in a position to check on the tables if they are questioned.

    That was just a short version of a very long story that I’ll be writing up more fully on my blog soon but it seems relevant to your post.

    Look at it this way. Andrew is wondering about a gap between standards for academic articles and standards for data journalism articles. But there is also a gap between standards for regular journalism and standards for data journalism. It seems that, at least at the BBC, there is no idea that they should be able to justify data journalism articles if challenged.

    Reading this through I realize that I was harsher on the BBC than I was on ABC and this may be unfair. At least the BBC responded to me. ABC just ignored my data request.

  2. John Taylor says:

    Andrew, sidetracking a little bit (although slightly relevant for the discussion) Pinker’s book about violence is in the news again. However, as you are certainly aware, Taleb says (with his usual language but now also on a paper the argument is wrong. Do you have a take on this discussion?

  3. Dan N says:

    As someone who considers himself a practitioner of “data journalism”, could you clarify what you consider to be data journalism? It’s worth pointing out that it’s a controversial term even within the community of technical/stats/programming-minded journalists. From David Leonhardt, who was the first editor of NYT’s The Upshot (i.e. the replacement for 538):

    > And you know what the formal name for 1,000 anecdotes is, right? A statistic. “Statistics” and “data” are really just a plural form of “fact.”…Data journalism, ultimately, has the same aim as “quote journalism” and “anecdote journalism.” They all aspire to be “fact journalism” or, more eloquently, journalism.

    Nate Silver, often mistaken as the Lord Emperor of Data Journalism, also dislikes the term, preferring “empirical journalism” instead:

    (As a sidenote, usually when people bash Nate Silver and data journalism writ large, they’re conflating what he and 538 does with “forecasting” journalism. I think Silver can defend himself on that charge)

    “Data journalism” is a dumb phrase because it is not to journalism what “data science” is to science. Data science is understood to be a field focused on scientific methods and data practices as applied to science. The existence of the phrase “data science” does not imply that there is science being done without data. But with data journalism, many “traditional” journalists think of it as journalism that uses data — not journalism *about* data.

    The more old-fashioned term for the work of “empirical journalism” that Silver refers to is “computer-assisted reporting”, which sounds hokey but is far more specific than data journalism, as it implies a set of reporting/knowledge problems that require computational solutions (i.e. not “computer-assisted”, as in, “I need to use a word processor to write my story”). This CAR journalism has been going on for decades and still powers Pulitzer stories today:

    • Andrew says:


      To answer your question about what is data journalism, see here, where I call it “quantitative journalism,” which is perhaps a better name. I am indeed talking about journalism that uses data, not journalism about data. (And I’m neither bashing Nate Silver nor considering him as a Lord Emperor, so on both counts I think you’re talking with someone who’s not me.)

      Also, this is slightly a different topic, but (a) I’m not a big fan of the term “data science,” and (b) despite what you seem to be implying, yes there is science being done without data. Consider, for example, the theory of relativity, which does not use data and was motivated only somewhat by data; it’s my understanding that the larger motivation was inconsistencies in existing theory.

      • David Chorlian says:

        “theory of relativity … it’s my understanding that the larger motivation was inconsistencies in existing theory.”
        Just a slight change, but the key phrase in the first paragraph of Einstein’s 1905 paper is “… asymmetries which do not appear to be inherent in the phenomena.” This is remedied in the second paragraph by the principle (of relativity): “the phenomena of electrodynamics as well as of mechanics possess no properties corresponding to the idea of absolute rest.” Once the consequences of this are realized, the asymmetries disapear, leaving electromagnetic theory intact.

  4. Rahul says:

    >>>But when a rich guy does believe passionately in some error, he might well have the connections to promulgate it widely. Henry Ford and Ron Unz come to mind.<<<

    Isn't that like saying that the greatest schools are typically small schools? Well, the crappiest ones are too probably.

    • Jonathan (another one) says:

      Yeah… If only there were some statistical technique that would allow one to partially pool very small high variance data sets (billionaires) with somewhat larger and less variable datasets (millionaires) and even larger datasets (the middle class) maybe we could learn something…

  5. Alberto says:

    “Henry Ford and Ron Unz come to mind.”

    Another guy comes to mind. He sits in the White House.

    Perhaps the halo effect plays a part, too. If you made a lot of money, you clearly understand your business. So people believe you understand the world in general. But knowing how to make money in, say, construction does not give you any particular insight about, say, the trade deficit.

  6. Trevor Butterworth says:

    We ran a panel on data journalism at last year’s JSM. Mark Hansen, Alberto Cairo, Carl Bialik, Rebecca Goldin and Regina Nuzzo presented. Regina made the following points: Data journalists need to A. Formally involve statisticians. B. Develop a code of ethics. C. Have peer review and transparency. There is also a need for stats-journalism research collaborations, and data journalism should become an academic specialization.

Leave a Reply