6 December 2016

Can we predict electoral outcomes from Wikipedia traffic?

2016 presidential candidate Donald Trump in a residential backyard near Jordan Creek Parkway and Cody Drive in West Des Moines, Iowa, with lights and security cameras. Image by Tony Webster (Flickr).

2016 presidential candidate Donald Trump in a residential backyard near Jordan Creek Parkway and Cody Drive in West Des Moines, Iowa, with lights and security cameras. Image by Tony Webster (Flickr).

As digital technologies become increasingly integrated into the fabric of social life their ability to generate large amounts of information about the opinions and activities of the population increases. The opportunities in this area are enormous: predictions based on socially generated data are much cheaper than conventional opinion polling, offer the potential to avoid classic biases inherent in asking people to report their opinions and behaviour, and can deliver results much quicker and be updated more rapidly.

In their article published in EPJ Data Science, Taha Yasseri and Jonathan Bright develop a theoretically informed prediction of election results from socially generated data combined with an understanding of the social processes through which the data are generated. They can thereby explore the predictive power of socially generated data while enhancing theory about the relationship between socially generated data and real world outcomes. Their particular focus is on the readership statistics of politically relevant Wikipedia articles (such as those of individual political parties) in the time period just before an election.

By applying these methods to a variety of different European countries in the context of the 2009 and 2014 European Parliament elections they firstly show that the relative change in number of page views to the general Wikipedia page on the election can offer a reasonable estimate of the relative change in election turnout at the country level. This supports the idea that increases in online information seeking at election time are driven by voters who are considering voting.

Second, they show that a theoretically informed model based on previous national results, Wikipedia page views, news media mentions, and basic information about the political party in question can offer a good prediction of the overall vote share of the party in question. Third, they present a model for predicting change in vote share (i.e., voters swinging towards and away from a party), showing that Wikipedia page-view data provide an important increase in predictive power in this context.

This relationship is exaggerated in the case of newer parties — consistent with the idea that voters don’t seek information uniformly about all parties at election time. Rather, they behave like ‘cognitive misers’, being more likely to seek information on new political parties with which they do not have previous experience and being more likely to seek information only when they are actually changing the way they vote.

In contrast, there was no evidence of a ‘media effect’: there was little correlation between news media mentions and overall Wikipedia traffic patterns. Indeed, the news media and Wikipedia appeared to be biased towards different things: with the news favouring incumbent parties, and Wikipedia favouring new ones.

Read the full article: Yasseri, T. and Bright, J. (2016) Wikipedia traffic data and electoral prediction: towards theoretically informed models. EPJ Data Science. 5 (1).

We caught up with the authors to explore the implications of the work.

Ed: Wikipedia represents a vast amount of not just content, but also user behaviour data. How did you access the page view stats — but also: is anyone building dynamic visualisations of Wikipedia data in real time?

Taha and Jonathan: Wikipedia makes its page view data available for free (in the same way as it makes all of its information available!). You can find the data here, along with some visualisations

Ed: Why did you use Wikipedia data to examine election prediction rather than (the I suppose the more fashionable) Twitter? How do they compare as data sources?

Taha and Jonathan: One of the big problems with using Twitter to predict things like elections is that contributing on social media is a very public thing and people are quite conscious of this. For example, some parties are seen as unfashionable so people might not make their voting choice explicit. Hence overall social media might seem to be saying one thing whereas actually people are thinking another.

By contrast, looking for information online on a website like Wikipedia is an essentially private activity so there aren’t these social biases. In other words, on Wikipedia we can directly have access to transactional data on what people do, rather than what they say or prefer to say.

Ed: How did these results and findings compare with the social media analysis done as part of our UK General Election 2015 Election Night Data Hack? (long title..)

Taha and Jonathan: The GE2015 data hack looked at individual politicians. We found that having a Wikipedia page is becoming increasingly important — over 40% of Labour and Conservative Party candidates had an individual Wikipedia page. We also found that this was highly correlated with Twitter presence — being more active on one network also made you more likely to be active on the other one. And we found some initial evidence that social media reaction was correlated with votes, though there is a lot more work to do here!

Ed: Can you see digital social data analysis replacing (or maybe just complementing) opinion polling in any meaningful way? And what problems would need to be addressed before that happened: e.g. around representative sampling, data cleaning, and weeding out bots?

Taha and Jonathan: Most political pundits are starting to look at a range of indicators of popularity — for example, not just voting intention, but also ratings of leadership competence, economic performance, etc. We can see good potential for social data to become part of this range of popularity indicator. However we don’t think it will replace polling just yet; the use of social media is limited to certain demographics. Also, the data collected from social media are often very shallow, not allowing for validation. In the case of Wikipedia, for example, we only know how many times each page is viewed, but we don’t know by how many people and from where.

Ed: You do a lot of research with Wikipedia data — has that made you reflect on your own use of Wikipedia?

Taha and Jonathan: It’s interesting to think about this activity of getting direct information about politicians — it’s essentially a new activity, something you couldn’t do in the pre-digital age. I know that I personally [Jonathan] use it to find out things about politicians and political parties — it would be interesting to know more about why other people are using it as well. This could have a lot of impacts. One thing Wikipedia has is a really long memory, in a way that other means of getting information on politicians (such as newspapers) perhaps don’t. We could start to see this type of thing becoming more important in electoral politics.

[Taha] .. since my research has been mostly focused on Wikipedia edit wars between human and bot editors, I have naturally become more cautious about the information I find on Wikipedia. When it comes to sensitive topics, sach as politics, Wikipedia is a good point to start, but not a great point to end the search!

Taha Yasseri and Jonathan Bright were talking to blog editor David Sutcliffe.

Note: This article gives the views of the authors, and not the position of the Policy and Internet Blog, nor of the Oxford Internet Institute.