The new quantitative journalism

The first of the breed was Bill James.

But now we have a bunch: Felix Salmon, Nate Silver, Amanda Cox, Carl Bialik, . . . .

I put them in a different category than traditional science journalists such as Malcolm Gladwell, Gina Kolata, Stephen Dubner who are invested in the “scientist as hero” story, or modern science journalists such as Susan Perry, Julie Reymeyer, Ed Yong, Regina Nuzzo who engage in old-school reporting of science but with a more inquisitive, skeptical bent.

OK, so here’s the real question I want to ask: why is this all happening now, during a time when the economy of journalism is collapsing? Why were there no skeptical, investigative, quantitative journalists decades ago? Great journalism is not new, but quantitative journalism, that seems like a new development. Why wasn’t it a thing 30, 40, or 50 years ago? Is it just that statistical skills more generally have become more widespread, which (a) made it more likely that some journalists would have access to these tools, and (b) there’s now a broad audience for this sort of material? I don’t know.

30 thoughts on “The new quantitative journalism

  1. The fact that web publishing and graphical design software together with statistical computing platforms makes it easy to create and disseminate compelling ways to visualize quantitative data. Bill James newsletters had tables of numbers and an audience of sports-stats idiot savants willing to wade through them. What has changed is that technology enables supply to meet the demand from a large audience who desire direct access to the insights that can be gleaned from a quantitative understanding of the world, but lack the ability/patience to extract it from tables of numbers or blurry static charts.

    • I’m a huge Bill James fan, but years before I was reading Bill James, I was reading Dan Seligman’s “Keeping Up” column in Fortune, which was a major magazine in the 1970s. This was pretty much the forerunner of the statistical / social science blog.

      • Something that’s very important for data journalists is to figure out what would get you in trouble if you were honest about it. For example, during their baseball statistics careers, both Bill James and Nate Silver shied away from analyzing the biggest baseball statistics phenomenon of their era, steroids.

        In contrast, Daniel Seligman waded right into the most important social science issues of the 1970s-1990s, such as crime rates and IQ. He wrote an excellent introduction to the IQ controversy, “A Question of Intelligence,” in the early 1990s but had a terrible problem getting it published and reviewed fairly. And Seligman was a Manhattan magazine journalism insider.

  2. I have two theories:
    – more widespread access to data, particularly data in the public interest, and in many cases guaranteed by freedom of information or open access requirements, means that the boring bit of this work (collecting the data) is easier than it used to be.
    – the rise of scientism and “evidence-based” policy discussions in the technocratic class means this work is more prestigious than it used to be, to the extent that these people are becoming public intellectuals, for better or worse.

  3. I’d say it has to do with the increased availability of datasets to base quantitative stories on. 40 years ago it’d probably have been a lot harder to keep the cadence of publishing simply because there was a lot mot effort that had to go into gathering data for any given story.

  4. Hi, Andrew. I’ve been a “data journalist,” variously defined, for roughly the period during which your list of modern luminaries gained prominence (approx. the last six years). Here’s an answer which I think most of my colleagues would largely agree with:

    There is actually a robust history of quantitative journalism. This ranges from 19th century experiments with news graphics to mid-century investigations by leaders such as Philip Meyer. However, it has rapidly accelerated over the last decade, precisely *because* of the media industry’s financial calamity. Here’s a rough timeline:

    * The internet begins to eat media.
    * The traditional revenue model collapses.
    * Desperate and willing to try anything, legacy media organizations decide to try technology—in a lot of different forms. (Digitals ads, subscriptions, ebooks, apps, blogs, “hyperlocal”, new story forms, etc., etc.)
    * From some of those efforts emerge the first technology teams in newsrooms. Many of these were formed in large part by journalism-outsiders who get pulled into the news industry because they wanted to use technology to do something meaningful. In some cases the people who were building the CMS just started doing journalism, because it turns out that’s a lot more fun.
    * In a sort of halting, fast-stumble journalists and technologists crash into each and other and realize they actually have a hell of a lot to offer each other. (This process is ongoing.)
    * The technology injection shows no sign of saving journalism in any of the predicted ways.
    * But the side-benefit of infusing newsrooms with an dose of technology-minded folks remains.

    Of course, eventually this would have happened anyway. The tools for doing quantitative analysis are simpler and more accessible than ever before. However, without the failure of the traditional business model I think it’s unlikely newspapers would have had the wherewithal to so quickly hire so many folks who didn’t fit their expectations for what a “reporter” is.

    • Interesting! The same is probably true of X other sorts of journalism/entertainment that have gained prominence with the decline of newspapers/traditional broadcast television. Andrew’s question just as easily might have been “why has this genre of Norwegian women whispering into microphones taken off?” or “Why are listicles so popular?”

      Lower barriers to entry + fragmented audience means we get more of _everything_.

      • Dan Carlin, popular historian and author of the “Hardcore History” podcast, did a nice TEDx talk last year that follows this sentiment. He provides an interesting perspective from his background in broadcast journalism.

        You can probably find it by googling “The New Media’s coming of age, Dan Carlin” or something thereabouts.

  5. > Why were there no skeptical, investigative, quantitative journalists decades ago?
    The editors being in firm control of the media platforms – did not want them?

  6. I think that the advent of CAR (computer-assisted reporting) in the late 70s and early 80s basically mirrors the profusion of inexpensive computing hardware and software.

    But I also think that there are many examples of quantitative journalism over the last half century. For example:

    * Philip Meyer won a Pulitzer in 1968 for quantitative survey analysis of people who rioted in the 1967 riots in Detroit.
    * Bill Dedman won a Pulitzer in 1989 for “The Color of Money,” a ~25 story series on racially biased lending in Atlanta that relied almost entirely on a database of loans he analyzed.
    * Steve Doig won a Pulitzer in 1992 for an engineering analysis / survey / campaign finance investigation of home builders in Miami after Hurricane Andrew that was, again, almost entirely database-backed.
    * Sarah Cohen won a Pulitzer in 2002 at the Washington Post for analyzing a series of documents about DC family services.
    * PolitiFact in 2009 both won a Pulitzer for a series of highly quantitative stories and a web-facing database.

    And that’s just the ones that won Pulitzers? I’m probably even missing several.

  7. I think Nate Silver’s original fivethirtyeight blogs of 2008 deserve a lot of credit. But this may simply have been the Nirvana moment of quantitative journalism, that is, the first breakout success with lots of mainstream attention being paid. Nate himself was responding to the Meat Puppets in the more naturally statistical field of sports journalis. The popularity of the World Series of Poker in the early Aught’s being one represenative example, Moneyball another. But after Nate “called” 49 out of 50 states, I think folks really started paying attention. Nevermind that the vast majority of journalists clearly have limited understanding of the concept of probability if they think a forecast with odds on the order of 2:1 is equivalent to “calling” for a win.

    • Interesting.

      The first mention of the lack of pooling of political poles I came across as being curious if not outright dumb was in Jeff Rosenthal’s 2005 popular book Struck by Lightning: The Curious World of Probabilities – see http://probability.ca/sbl/ Public opinion polls and margins of error

      Now, Sam Wang at Princeton starting aggregating US Presidential polls using probabilistic methods in 2004 and in 2012 he apparently correctly predicted the presidential vote outcome in 49 of 50 states and even the popular vote outcome of Barack Obama’s 51.1% to Mitt Romney’s 48.9% – see https://en.wikipedia.org/wiki/Sam_Wang_(neuroscientist)

      I think its clear Nate got 80 to 95% of the publicity of calling 49 out of 50 states.

      OK, so here’s the real question I want to ask: why (how) did Nate get all the publicity?
      (Not in anyway way suggesting that was unfair, but it somehow happened.)

      • > OK, so here’s the real question I want to ask: why (how) did Nate get all the publicity?

        I’m going to extend the music metaphor. Considering the audience on this blog, your question is akin to a bunch of audiophiles musing why it was that Nevermind was the chart topper of 1991 and culture touchstone, and why not The Pixies Doolittle of 1989?

        Using that metaphor makes it clear the question is somewhat unanswerable, or maybe the answer is somewhat ineffable. But we can definitely point to some trends. The web and the immediacy of information being an obvious big one here. But at the same time the media is not the message (all apologies to any closet McLuhanians here). There is also clearly a reciprocal relationship in that Barack Obama’s campaign which Nate got famous for covering was itself famous for it’s own internal embrace of data and data science.

        Some of the compelling aspects of Nate’s old blogs was his frank discussion of the model combined with some very good graphics. But I think the thing that was most compelling to me back in the day was something perfectly tuned to the moment and the media and also something very Bayesian: model updating. Nate had a coherent framework for including new information and updating posterior probabilities that was presented very attractively. This made for a refresh-happy narrative that was more compelling than taking the latest horse race snapshot of the polls. It kept you coming back. It wasn’t just that it was single story which used elements of data and statistical modelling, it was a binge-worthy touchstone to check every morning (and at each coffee break). It was serialized in a way that made for better overall narrative of the campaign than the more staccato news-cycle of gaffes and controversies.

        Then again, maybe it just smelled like teen spirit.

        • Thanks.

          When I was doing an MBA many years ago someone asked why some people make a lot more money than others in marketing – the senior prof’s answer was that they were “just fat, balding, cigar smoking old men who seemed to be able to hit upon what sells”.

          However “the presented very attractively” makes sense as Nate somehow got much more publicity.

          Also, a dissenting prof above made arguments that it was unknown as to how to judge real talent in the area of marketing and so those who could project a sense of having talent (perhaps past accidental successes) were being very highly overpaid.

      • Keith:

        Political scientists have been aggregating polls forever, certainly well before 2005, and we’ve been frustrated for a long time that journalists were just reporting one poll at a time. Nate definitely gets some credit for creating a market for something that we’d been doing for awhile but not in such a focused and careful way.

  8. I attribute it in part to the confluence of the legacy of Harry Roberts (U of Ch), cheap data and computing software, and corp advertising for the “enterprise” software packages:
    – Harry Roberts’ campaign to be more authentic about teaching statistics in college (as well as your own): don’t act like card questions are “easy” since they are essentially combinatorics, which aren’t. Don’t teach algebra, teach statistical thinking, etc. I think this has had a cumulative, though slow, impact on students and then the public’s comfort with quantitative summaries
    – data is available on the internet for free and cheap software provides easier and more informative graphics than clip-art retreads
    – Big software giants advertise “predictive analytics” like it is a branch of clairvoyance and the general public has picked this up.
    Finally, J schools are worried about their (and their industry’s) future, so they are obsessive about spotting trends and getting on them quickly. The J school at our university has, in the past three years, hired four ‘data scientist’ faculty and acquired the SABRmetrics group and brought them all in-house as well as partnering with other colleges on text-search type projects, jointly-taught quantitative classes, etc.

  9. I agree this is a Web-related phenomena… without the Internet this sort of journalism wouldn’t have arisen so rapidly, if at all. Traditionally, writers like Yong and Nuzzo would have taken two decades to ‘pay their dues’ and ascend through the ranks & hierarchy, but with persistence and social media, quality can now quickly find a significant following.

  10. Most of the comments focus on the supply side, but to go into more detail on the demand side: there’s a lot more college graduates in the population than 50 years ago, a lot more of whom have had at least minimal exposure to statistics, because the various social sciences, probably even all science, have become more statistical. That creates more of an audience, which can support more suppliers like Gladwell, Silver, etc.

  11. As recently as twenty years ago it was very hard to get data. Thirty years ago, even if you had data it was hard to analyze them and make nice graphics. It’s now possible to do in a week what would have taken several months to do previously. I think a quantitative journalist can be more than 10x as productive, per person-hour, as they could 35 years ago. For a conventional journalist maybe it’s a factor of 2, if that.

    • Bill James is an excellent example of this. His early work was painstakingly performed without computers, as far as I could tell. Hard to believe there’d be very many people who could combine (a) this sort of approach to data; (b) a genuinely curious mind, as opposed to someone who wanted to “prove” something; and (c) a gift for explaining the result to someone untrained. The computing world and the availability of Internet data fixed (a), and while (b) and (c) are still in short supply, the relaxation of the (a) constraint should have greatly increased the supply side of the market.

  12. What exactly is the new data journalism? Newspapers have been publishing opinion polling for over a century, and as Chris notes above, various other forms of data have been in the news even longer.

    Like Chris, I think the “new” data journalism (which mostly consists of journalists analyzing data collected by others) is a response to declining news revenue, though I imagine the pathway is rather different.

    Data journalism is dirt cheap – all it costs is the salary of the data journalist. You don’t see very many news organizations putting substantial resources into actual data collection (e.g., despite the boom in poll aggregation, the number of polls sponsored by news organizations is in decline). Traditional journalism is a very expensive and time-consuming proposition, and the news organization pays for both the “collection” (sending reporters to events, doing interviews, filing public records requests) and the analysis. In data journalism, other organizations shoulder all of the collection burden (just look at all the polls released freely online; government data; news sports data usually produced by the leagues), so the news organization only has to pay for the analysis.

  13. Would echo Chris and Jeremy’s comments, and would add that this area of journalism has been mostly avoided by journalism education since it appeared, with a few notable exceptions. Universities mostly don’t teach it, those who do rely mainly on adjuncts, and the leadership of newsrooms has been – until very recently – almost universally populated by people with little understanding or regard for the skills required.

    So yes, skills are more widespread, financial incentives make it easier and cheaper to experiment, and the Web makes for a great canvas to work on. But newsrooms found it, early on, an expensive and exotic genre of reporting, and most chose not to invest in it while they were still earning plenty of money doing what they had done for years.

  14. I think most everyone has already touched on the likely explanations; Just want to point out that there’s a really bright future for this quant journalism in many ways just beginning. Today The Upshot already shares its source on github — in some ways journalism is moving faster towards replicatable research far faster than academia.

    The cost driven race to the bottom in journalism has really taken a toll on the credibility and reliability of the media in the eyes of the public. Publishing source code and clean links to data sources really gives unprecendented transparency on the journalistic process. Personally, I hope that this sort of attitude infects the sourcing policy of the main news room. Given the cheapness of storage and bandwith, there seems no excuse for news outlets not to provide full transcripts of events and interviews behind a link, rather than constantly providing small quotes stripped of context.

    Not sure how much interest there is here, but wanted to link an article about an ongoing project (The Gamma) looking to build tools in support of open, reproducible and interactive data-driven journalism. http://tomasp.net/blog/2016/thegamma-olympic-medalists/

    I hate linking things in comments, because it sounds like shilling, so I just wanted to clarify that as far as I know, nothing is for sale here, it’s just a project funded by DNI. “The Digital News Initiative (DNI) is a collaboration between Google and news publishers in Europe to support high quality journalism and encourage a more sustainable news ecosystem through technology and innovation.”

  15. 1) Ever read some of Bill James’ early work? It was typed. I have a wonderful paper on bootstrapping in which equations were written, though they were eventually typeset in a journal. That was a) hard and b) hard to disseminate. It’s like the difference between Strat-o-Matic and using MLB’s actual pitch and hit speed/curve/location, etc. data to figure out UZR et al. I remember when a company I worked at developed a spreadsheet for evaluating future yield projections given various GNMA and real estate portfolios and it took 45 minutes to load on an x286 and then loaded in a few minutes on an x386 and now would be trivial to model, to load, to graph, etc. partly because of processor power, memory management, etc. but also because of API’s and apps. I can do stuff like that on my freaking phone.

    2) You get attention. Nate Silver has become a brand-name and the words together Nate Silver are used to describe a style and are treated as a standard of excellence (though I think it’s pretty clear other poll analysis has performed better). Most people want attention.

    3) For the industry, the main way of selling news is always that you have a special take or an exclusive or you have it first and in a connected world you aren’t going to have it first for long – and it will be copied in a minute, usually without attribution. Now you can present cheap exclusive material which isn’t as easy to use without attribution. (As a note, when Twitter started, news outlets bent over backwards to say x had it first, even if that was 10 minutes faster. And they found themselves misused as people released all sorts of fake stories or stories that were likely to be true but hadn’t happened quite yet so they began to ignore who had it first by 10 minutes and now you really don’t see that anymore.)

  16. A lot of plausible contributing factors mentioned here. Another is perhaps at a meta-level: that several factors (technology, larger population, economic fluctuations, changing job markets, …) have led to more and more people following non-traditional job paths, and even creating their own jobs — which is what Silver essentially did.

  17. When it comes to quantitative journalism in the medical/health field, I once again recommend

    healthnewsreview.org

    For example, the latest posting discusses the “astroturf” aspect of the epipen price gauging. This latest posting also contains this quantitative statement about osteoporosis medications:

    “The HealthDay story subheadline reads: ‘Abaloparatide appears to reduce fractures better than the current drug Forteo, researchers say.'”

    “But in fact the researchers never said any such thing and explicitly warned against making such a comparison in the study itself.”

    “Comparison of abaloparatide vs teriparatide [Forteo] for the primary efficacy end point was not part of the study objectives because the study would have required a sample size of approximately 22 000 per treatment group to provide 90% power to detect the treatment difference between abaloparatide (observed rate, 0.58%) and teriparatide (observed rate, 0.84%) based on our study results.”

Leave a Reply to Nadia Cancel reply

Your email address will not be published. Required fields are marked *