Jonathan Bright, Oxford Internet Institute
Taha Yasseri, Oxford Internet Institute
The potential offered by socially generated big data for the social sciences is by now well documented. The "social web" (by which we mean platforms such as Twitter, Facebook, Google and Wikipedia) offers transactional data on a vast range of human activities, delivered at a scale and speed without precdent . Yet though possibilities for research are significant, sceptical voices are also growing. Chief among the concerns raised is the extent to which socially generated data are representative of the population at large.
Nowhere is this problem more apparent than in the application of big data to the field of election prediction. The promise of using platforms such as Twitter for such predictions is seductive: polling could be updated every minute, in response to every political event (not to mention being targetted at hard to reach populations or elections where little polling has been commissioned). But the initial excitement has cooled, with a range of authors arguing that social web predictions perform little better than those offered by random chance , and significantly worse than their traditional opinion poll counterparts. A variety of sampling biases are thought to provoke this issue, with those using social media platforms not representative of the population as a whole, and those actively contributing to social media not even representative of social media users (for example, approximately 40% of active Twitter users never actually tweet ).
However, despite the wealth of research in this area, as yet there have been few attampts to articulate a theory of how the generation of social big data might relate to (and hence help predict) underlying electoral processes. Rather, the majority of studies have focussed simply on the comparison of any metrics at hand (such as the number of Google searches for an individual candidate) with election results. It is inevitable, we argue, that such methods are found wanting.
This article aims to address this deficit. It makes two principal contributions. First, we develop a theory of the relationship between public behaviour on the social web and electoral politics, with particular reference to information seeking activities on Google and Wikipedia. Information seeking, we argue, is likely to differ between political systems (for example, countries with historically higher turnout are likely to generate wider information seeking). We assess the potential impact of a range of systemic level variables, including the electoral rules in operation, the structure of democracy (presidential versus parliamentary) and the number of linguistic groups in the country.
Information is also likely to be altered by the context of individual elections. For example, elections with high turnout or which are hotly contested are more likely to generate information seeking than those which are poorly attended or which are a foregone conclusion. Here we assess a range of potentially relevant criteria, including the likelihood that the election will involve a transfer of power, and the presence of new or insurgent parties.
Our second contribution is to apply this theory of information seeking behaviour to the example of the European parliament elections of May 2014. These elections, which are held in a broadly similar fashion simultaneously in 28 different countries, present a unique opportunity to test some of the propositions made by our theory of information seeking. We harvest data on both Google search volume and Wikipedia page views for the purposes of our study, comparing the performance of major political actors at election time with the amount of information seeking activity they generate in the build up to the election. At the first step we examine correlations between pairs of data streams to be able to locate the best predictors; we then use a multivariate non-linear regression model based on the social data, with the aim of predicting turnout and party/candidate votes.
Overall, our paper contributes to the study of the application of big data by establishing more clearly the situations where predictions made from data on the social web are likely to mirror real world outcomes. It also generates new insight into the relationship between information seeking activities and electoral politics, and hence contributes to our understanding of how electoral politics functions in the digital age.