Fabio Rojas points me to this excellently-titled working paper by Joseph DiGrazia, Karissa McKelvey, Johan Bollen, and himself:

Is social media a valid indicator of political behavior? We answer this ques- tion using a random sample of 537,231,508 tweets from August 1 to November 1, 2010 and data from 406 competitive U.S. congressional elections provided by the Federal Election Commission. Our results show that the percentage of Republican-candidate name mentions correlates with the Republican vote margin in the subsequent election. This finding persists even when controlling for incumbency, district partisanship, media coverage of the race, time, and demographic variables such as the district’s racial and gender composi- tion. With over 500 million active users in 2012, Twitter now represents a new frontier for the study of human behavior. This research provides a framework for incorporating this emerging medium into the computational social science toolkit.

One charming thing about this paper—and I know this is going to sound patronizing but I don’t mean it to be—is that the authors (or, at least, whatever subset of the authors who did the statistical work) are amateurs. They analyze the outcome in terms of total votes rather than vote proportion, even while coding the predictor as a proportion. They present regression coefficients to 7 significant figures. They report that they have data from two different election cycles but present only one in the paper (but they do have the other in their blog post).

But that’s all ok. They pulled out some interesting data. And, as I often say, *the most important aspect of a statistical analysis is not what you do with the data, it’s what data you use*.

**Tweets and votes**

As to the result itself, I’m not quite sure what to do with it. Here’s the key graph:

More tweets, more votes, indeed.

~~Of course most congressional elections are predictable. But the elections that are between 40-60 and 60-40, maybe not so much. So let’s look at the data there . . . Not such a strong pattern (and for the 2012 data in the 40-60% range it looks even worse; any correlation is swamped by the noise). That’s fine, and it’s not unexpected, it’s not a criticism of the paper but it indicates that the real gain in this analysis is not for predicting votes.~~

I’m not so convinced that tweets will be so useful in predicting votes—most congressional elections are predictable, but perhaps the prediction tool could be more relevant in low-information or multicandidate elections where prediction is not so easy.

Instead, it might make sense to flip it around and predict twitter mentions given candidate popularity. That is, rotate the graph 90 degrees, and see how much variation there is in tweet shares for elections of different degrees of closeness. Also, while you’re at it, re-express vote share as vote proportion. And scale the size of each dot to the total number of tweets for the two candidates in the election.

Move away from trying to predict votes and move toward trying to understand tweets. DiGrazia et al. write, “the models show that social media matters . . .” No, not at all. They find a correlation between candidate popularity and social media mentions. No-name and fringe candidates get fewer mentions (on average) than competitive and dominant candidates. That’s fine, you can go with that.

Again, I fear the above sounds patronizing, but I don’t mean it to be. You gotta start somewhere, and you’re nowhwere without the data. As someone who was (originally) an outsider to the field of political science, I do think that researchers coming from other fields can offer valuable perspectives.

**Sharing the data**

What I want to know is, is this dataset publicly available? What would really make this go viral is if DiGrazia et al. post the data online. Then everyone will take a hack at it, and each of those people will cite them.

There’s been a lot of talk about reproducible research lately. In this case, they have a perfect incentive to make they data public: it will help them out, it will help out the research project, and it will be a great inspiration for others to follow in their footsteps. Releasing data as a publicity intensifier: that’s my new idea.

P.S. In the first version of this post I included a graph showing votes given tweet shares between 40% and 60%. I intended this to illustrate the difficulty of predicting close elections, but my graph really missed the point, because the x-axis represented close elections in tweet shares, not in votes. So I crossed that part out. If nothing else, I’ve demonstrated the difficulty of thinking about this sort of analysis!

This is great, thanks for covering the piece! This piece is short and quite new, with a lot more work to do. We are currently publishing this specific work to PLOS one, with the plus of it being open access with quick publishing.

The results are roughly the same for vote share, and we saw no meaningful difference between the two models. We are still going through reviews and will definitely keep your criticisms in mind for the final version, if the political science people will be happier for it.

It’s interesting how you offer the suggestion of predicting tweets instead of votes. Using votes to predict tweets from the past seems unintuitive — but maybe not for a stats person who does this sort of thing all the time. Our audience is much broader, and we wanted it to make the result as intuitive as possible. In the future, i think this is a really cool approach though and maybe we should consider trying it out. I think scaling by # of tweets is a good idea.

As for the data, my group at CNetS is very protective of it. Our gardenhose access can be pulled at any time, and offering raw tweet text with a popular paper is an easy way to get attention. We might release just the counts per candidate in those three months, though.

Like I said, thanks for covering this and let one of us if you have any other comments, Qs, or suggestions.

Karissa:

You write, “Using votes to predict tweets from the past seems unintuitive.” But think of the vote outcome not as as a future event but as a measure of the candidate’s support in the population. That support is there during the campaign; it just happens to be measured by something (vote) that happens after the campaign is over. To put it mathematically, if x is election outcome, y is tweets, and z is support during the campaign, your tweet model is y given z, and your vote model is x given z. To a first approximation, x equals z, so you can just directly model y given x.

Hi Andrew,

Thanks for covering our paper. A few brief responses to your concerns about our analysis:

“They analyze the outcome in terms of total votes rather than vote proportion, even while coding the predictor as a proportion.”

We have actually run the models using both vote share and vote proportion. Substantively the results are the same, but we liked the interpretation for the former better.

“They present regression coefficients to 7 significant figures.”

Fair enough, but the paper linked to is just a working paper.

“They report that they have data from two different election cycles but present only one in the paper (but they do have the other in their blog post).”

The figure on the blog is from a newer version of the paper than the one linked to on SSRN.

In what way is vote margin more intuitive or interpretable than vote proportion? Vote margin depends so heavily on context — seeing that plot, I’m not entirely sure what to make of it. I appreciate that you normalize the margin measure, but if anything that makes it less intuitive when we get to the marginal effects of tweets.

In other words, models and (especially) figures should make complicated things simpler. Looking at the effect on vote margin, I’m stuck trying to figure out precisely what that would mean given how different districts are. Proportions, on the other hand… that’s much easier, at first glance, to interpret.

Just me, though…

Karissa, from what [veeeery little] I’ve looked into the topic, while Twitter frowns upon publishing whole tweets (which would be the most useful), it should be alright to publish collections of tweet IDs for others to use to look them up. It would be totally awesome of y’all if you could publish those, at least, though, of course, be prudent and ask around if others have had any run-ins with Twitter because of this.

Tweet IDs might work, but there is the chance of people deleting tweets, or accounts, since time of collection. We will definitely consider that, though. Thanks!

What’s the reason to analyse proportion of vote; rather than vote margin?

A natural model for the votes in an election is that there’s a fixed number of voters, each of which votes for a certain candidate with a certain probability, as determined by whatever covariate you are interested in. Modelling the vote margin seems kinda unreasonable, because you will likely end up predicting vote margins in elections with small numbers of voters that are very large relative to the size of the electorate.

A binomial GLM framework seems exactly appropriate for this sort of data, and I’m surprised it wasn’t used.

Fhnuzoag,

I think it makes sense to consider the tweets as an outcome. Think of the vote in the election not as as a future event but as a measure of the candidate’s support in the population. That support is there during the campaign; it just happens to be measured by something (vote) that happens after the campaign is over. To put it mathematically, if x is election outcome, y is tweets, and z is support during the campaign, your tweet model is y given z, and your vote model is x given z. To a first approximation, x equals z, so you can just directly model y given x.

Also, no glm needed. The number of votes is so large that it can essentially be considered continuous.

People who don’t user Twitter are more likely to be old and thus conservative. But they are also more likely, if they do post, to post in bulk using many fake accounts. I’m not sure if these two opposite biases balance out.