Do we trust these data on political news consumption?

Mark Palko writes:

The Monkey Cage just retweeted this but some of the numbers look funny.

“This” is a paper, “The Myth of Partisan Selective Exposure: A Portrait of the Online Political News Audience,” by Jacob Nelson and James Webster, which who write:

We explore observed online audience behavior data to present a portrait of the actual online political news audience. We find that this audience frequently navigates to news sites from Facebook, and that it congregates among a few popular, well-known political news sites. We also find that political news sites comprise ideologically diverse audiences, and that they share audiences with nearly all smaller, more ideologically extreme outlets. Our results call into question the strength of the so-called red/blue divide in actual web use.

But Palko is skeptical, in particular pointing to the graph below:

Palko writes:

Yes, the claims are somewhat counterintuitive, but that wouldn’t bother me so much if the rest of the data didn’t look so screwy. In particular, the average total minutes per visitor per month graph strongly suggests some kind of math or data collection error. It’s not just the size of the spikes at Drudge and Google News search (though those are troubling); it’s their placement.

If we were seeing huge spikes for sites with relatively small but loyal readerships (particularly with longform content), that would be understandable, but this graph shows pretty much the opposite. Large, heavy traffic sites inevitably have a significant portion of casual users and one-time drop-ins. These ought to drag down the average. This makes me suspicious of the Drudge Report numbers (Drudge has huge traffic) and even more so of Google News (surely there are a lot of people who very occasionally spend 30 seconds to a minute looking for stories on some topic).

Drilling down a bit further, the fifth highest average time goes to Bloomberg, another high traffic site that must have a significant portion of casual readers. Compare that to Breitbart, a site known for fanatical followers. I would expect the average Breitbart reader to have a much higher monthly total than the average Bloomberg reader. Likewise the Hill, Mother Jones, Vox and the New Yorker.

I exchanged some tweets with one of the authors. He conceded that the Drudge numbers look strange, but said they had great confidence in their data, and offered to discuss the paper further. He hasn’t replied to any of my tweets since then.

I feel like I’m missing the obvious here, but these numbers seem to run completely counter to what I’d expect.

I think the snappy nature of twitter makes it difficult to have a good discussion about something such as data quality so I think it’s better to have the discussion here on the blog.

In discussing this example, let me emphasize that I’ve not looked at these data at all, and Palko also wanted to say that he does not have expertise here, he’s just a casual observer doing a quick plausibility check of the data.

So I’m interested to hear the story, and I hope that a clarification of the quality of these data can be helpful in moving forward our understanding of the important topic of the consumption of political news.

6 thoughts on “Do we trust these data on political news consumption?

  1. Looking at the paper, I noticed a couple of other weird things.

    In figure 1, there is more bars than website titles listed at the bottom. In addition it appears to be missing major news outlets discussed in the paper like the New York Times.

    Also, figure 2 shows no variation from Buzzfeed to Breitbart which seems a little suspicious. MSNBC has 36% conservative to Breitbart’s 30%. In fact they have more liberals on Breitbart than on around a third of the sites and the same amount as the New York Times. In addition, they only classify liberals as about 155 news consumers and internet users which seems a little strange. These baseline numbers seem to be off by quite a bit as they have about half of people as moderate as opposed to a more typical breakdown.

    The Drudge Report numbers don’t seem to pass a smell test in figure 3. They suggest that a Drudge Report visitor spends 5 hours on the website which is really weird as discussed above, assuming it is per day or visit. They fail to define what the minutes is over for figure 3 making the numbers somewhat meaningless.

    Figure 5 has the same issue as figure 1 which is not having labels on every bar and is missing some common new site discussed in the paper and in figure two like the New York Times.

    Figure 6 has the same issue as 1 and 5.

    Also, figure 7 does not reveal the source of news audiences for 73% of links to news sites.

    There appears to be significant errors in the dataset used in this paper or in its presentation. Maybe someone smarter than me can figure out what went wrong here.

  2. As an engineer on large websites for longer than comScore has been around, I feel uncomfortable with Nelson and Webster’s use of comScore data. comScore data just isn’t very good. It often diverges from a website’s own logging data, sometimes dramatically and inexplicably. A company I worked for even developed its own much smaller (thousand-person) panel, similar to comScore’s, to try to figure out what was going on. Internal logs could be reconciled with the internal panel, but comScore’s panel data didn’t match anything.

    That doesn’t mean the comScore data is useless. It’s fine if you want to target ads or compare traffic to two similar sites. If Nelson and Webster feel they processed the data correctly, I have no reason to doubt them. But if they have “great confidence” that the comScore data reflects reality, their confidence is misplaced.

    That said, the data look fine to me. Google News and Drudge are both “front pages” to news sites, and people keep them open for long stretches of time. They have very few “article pages” hosted on their own site. The audience of a site is a union of the visitors who use the site as a front page that is left open, and the visitors who come from social media and search engines to read article pages.

    Most sites in the graph have an audience dominated by the latter, visitors who come to read single articles and leave. GNews and Drudge have an audience dominated by the former, visitors who leave them open for long periods of time.

    This raises the question, why are GNews and Drudge plotted in the same graph as the others, destroying the scale?

    But there’s a deeper question too: since these statistics aggregate very dissimilar behaviors (front page vs. single article), and the statistics probably can’t tell the difference between reading the sports section and the politics section, how much should we believe the paper’s contention that “partisan selective exposure is not occurring” online?

    Incidentally, I think the least credible graph in the paper is Figure 7. Facebook+Google+Twitter, plus navigating to the site directly, only account for 25% of all origins. What pray tell is the other 75%?

    Having decided to use comScore data, Nelson and Webster didn’t make substantial errors that I can see. Yet this sort of aggregate data can’t really refute studies showing differences in individual-level data. I wish some researchers would create their own panel, with a little browser plugin that tracks click behavior (with user consent from their subjects of course). It wouldn’t be too difficult, it would overcome the self-reporting and limited-choice criticisms Nelson and Webster have of previous work, and it would avoid the comScore data quality issues and the loss of information from aggregation and averaging.

    • They don’t have navigate directly actually, they have “log on” which I guess means you came from a login page. Other routes are probably: click bait, email, advertising, book marks, links from other sites. Some (many?) people probably just have those tabs open automatically when they open their browsers.

  3. Their Figure 1 could use some improvement — some large fraction of the bars aren’t labeled. It’s a shame, because that seems like a good place to get a handle on the data quality question, or part of it, anyway.

  4. I saw this presented at a conference and, while I don’t know much about comScore and the researchers seemed sober and thoughtful about it, I got the feeling that they too saw the data collection processon comScore’s end as something of a black box. It’s one of those things…they knew that maybe some other data would contradict comScore’s, but comScore’s is the data they have. I have some concerns about the quality of the data, but in some sense I don’t hate that they’ve shared it since it lets us all ask these questions and decide what to make of it.

  5. Is it me, or are they somewhat consistently in the writing confusing “percent of conservatives visiting this site” and “percent of site visitors who are conservative”?

Leave a Reply to Elin Cancel reply

Your email address will not be published. Required fields are marked *