data science – The Policy and Internet Blog

Could Counterfactuals Explain Algorithmic Decisions Without Opening the Black Box?

David Sutcliffe — Mon, 15 Jan 2018 10:37:21 +0000

The EU General Data Protection Regulation (GDPR) has sparked much discussion about the “right to explanation” for the algorithm-supported decisions made about us in our everyday lives. While there’s an obvious need for transparency in the automated decisions that are increasingly being made in areas like policing, education, healthcare and recruitment, explaining how these complex algorithmic decision-making systems arrive at any particular decision is a technically challenging problem—to put it mildly.

In their article “Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR” which is forthcoming in the Harvard Journal of Law & Technology, Sandra Wachter, Brent Mittelstadt, and Chris Russell present the concept of “unconditional counterfactual explanations” as a novel type of explanation of automated decisions that could address many of these challenges. Counterfactual explanations describe the minimum conditions that would have led to an alternative decision (e.g. a bank loan being approved), without the need to describe the full logic of the algorithm.

Relying on counterfactual explanations as a means to help us act rather than merely to understand could help us gauge the scope and impact of automated decisions in our lives. They might also help bridge the gap between the interests of data subjects and data controllers, which might otherwise be a barrier to a legally binding right to explanation.

We caught up with the authors to explore the role of algorithms in our everyday lives, and how a “right to explanation” for decisions might be achievable in practice:

Ed: There’s a lot of discussion about algorithmic “black boxes” — where decisions are made about us, using data and algorithms about which we (and perhaps the operator) have no direct understanding. How prevalent are these systems?

Sandra: Basically, every decision that can be made by a human can now be made by an algorithm. Which can be a good thing. Algorithms (when we talk about artificial intelligence) are very good at spotting patterns and correlations that even experienced humans might miss, for example in predicting disease. They are also very cost efficient—they don’t get tired, and they don’t need holidays. This could help to cut costs, for example in healthcare.

Algorithms are also certainly more consistent than humans in making decisions. We have the famous example of judges varying the severity of their judgements depending on whether or not they’ve had lunch. That wouldn’t happen with an algorithm. That’s not to say algorithms are always going to make better decisions: but they do make more consistent ones. If the decision is bad, it’ll be distributed equally, but still be bad. Of course, in a certain way humans are also black boxes—we don’t understand what humans do either. But you can at least try to understand an algorithm: it can’t lie, for example.

Brent: In principle, any sector involving human decision-making could be prone to decision-making by algorithms. In practice, we already see algorithmic systems either making automated decisions or producing recommendations for human decision-makers in online search, advertising, shopping, medicine, criminal justice, etc. The information you consume online, the products you are recommended when shopping, the friends and contacts you are encouraged to engage with, even assessments of your likelihood to commit a crime in the immediate and long-term future—all of these tasks can currently be affected by algorithmic decision-making.

Ed: I can see that algorithmic decision-making could be faster and better than human decisions in many situations. Are there downsides?

Sandra: Simple algorithms that follow a basic decision tree (with parameters decided by people) can be easily understood. But we’re now also using much more complex systems like neural nets that act in a very unpredictable way, and that’s the problem. The system is also starting to become autonomous, rather than being under the full control of the operator. You will see the output, but not necessarily why it got there. This also happens with humans, of course: I could be told by a recruiter that my failure to land a job had nothing to do with my gender (even if it did); an algorithm, however, would not intentionally lie. But of course the algorithm might be biased against me if it’s trained on biased data—thereby reproducing the biases of our world.

We have seen that the COMPAS algorithm used by US judges to calculate the probability of re-offending when making sentencing and parole decisions is a major source of discrimination. Data provenance is massively important, and probably one of the reasons why we have biased decisions. We don’t necessarily know where the data comes from, and whether it’s accurate, complete, biased, etc. We need to have lots of standards in place to ensure that the data set is unbiased. Only then can the algorithm produce nondiscriminatory results.

A more fundamental problem with predictions is that you might never know what would have happened—as you’re just dealing with probabilities; with correlations in a population, rather than with causalities. Another problem is that algorithms might produce correct decisions, but not necessarily fair ones. We’ve been wrestling with the concept of fairness for centuries, without consensus. But lack of fairness is certainly something the system won’t correct itself—that’s something that society must correct.

Brent: The biases and inequalities that exist in the real world and in real people can easily be transferred to algorithmic systems. Humans training learning systems can inadvertently or purposefully embed biases into the model, for example through labelling content as ‘offensive’ or ‘inoffensive’ based on personal taste. Once learned, these biases can spread at scale, exacerbating existing inequalities. Eliminating these biases can be very difficult, hence we currently see much research done on the measurement of fairness or detection of discrimination in algorithmic systems.

These systems can also be very difficult—if not impossible—to understand, for experts as well as the general public. We might traditionally expect to be able to question the reasoning of a human decision-maker, even if imperfectly, but the rationale of many complex algorithmic systems can be highly inaccessible to people affected by their decisions. These potential risks aren’t necessarily reasons to forego algorithmic decision-making altogether; rather, they can be seen as potential effects to be mitigated through other means (e.g. a loan programme weighted towards historically disadvantaged communities), or at least to be weighed against the potential benefits when choosing whether or not to adopt a system.

Ed: So it sounds like many algorithmic decisions could be too complex to “explain” to someone, even if a right to explanation became law. But you propose “counterfactual explanations” as an alternative— i.e. explaining to the subject what would have to change (e.g. about a job application) for a different decision to be arrived at. How does this simplify things?

Brent: So rather than trying to explain the entire rationale of a highly complex decision-making process, counterfactuals allow us to provide simple statements about what would have needed to be different about an individual’s situation to get a different, preferred outcome. You basically work from the outcome: you say “I am here; what is the minimum I need to do to get there?” By providing simple statements that are generally meaningful, and that reveal a small bit of the rationale of a decision, the individual has grounds to change their situation or contest the decision, regardless of their technical expertise. Understanding even a bit of how a decision is made is better than being told “sorry, you wouldn’t understand”—at least in terms of fostering trust in the system.

Sandra: And the nice thing about counterfactuals is that they work with highly complex systems, like neural nets. They don’t explain why something happened, but they explain what happened. And three things people might want to know are:

(1) What happened: why did I not get the loan (or get refused parole, etc.)?

(2) Information so I can contest the decision if I think it’s inaccurate or unfair.

(3) Even if the decision was accurate and fair, tell me what I can do to improve my chances in the future.

Machine learning and neural nets make use of so much information that individuals have really no oversight of what they’re processing, so it’s much easier to give someone an explanation of the key variables that affected the decision. With the counterfactual idea of a “close possible world” you give an indication of the minimal changes required to get what you actually want.

Ed: So would a series of counterfactuals (e.g. “over 18” “no prior convictions” “no debt”) essentially define a space within which a certain decision is likely to be reached? This decision space could presumably be graphed quite easily, to help people understand what factors will likely be important in reaching a decision?

Brent: This would only work for highly simplistic, linear models, which are not normally the type that confound human capacities for understanding. The complex systems that we refer to as ‘black boxes’ are highly dimensional and involve a multitude of (probabilistic) dependencies between variables that can’t be graphed simply. It may be the case that if I were aged between 35-40 with an income of £30,000, I would not get a loan. But, I could be told that if I had an income of £35,000, I would have gotten the loan. I may then assume that an income over £35,000 guarantees me a loan in the future. But, it may turn out that I would be refused a loan with an income above £40,000 because of a change in tax bracket. Non-linear relationships of this type can make it misleading to graph decision spaces. For simple linear models, such a graph may be a very good idea, but not for black box systems; they could, in fact, be highly misleading.

Chris: As Brent says, we’re concerned with understanding complicated algorithms that don’t just use hard cut-offs based on binary features. To use your example, maybe a little bit of debt is acceptable, but it would increase your risk of default slightly, so the amount of money you need to earn would go up. Or maybe certain convictions committed in the past also only increase your risk of defaulting slightly, and can be compensated for with higher salary. It’s not at all obvious how you could graph these complicated interdependencies over many variables together. This is why we picked on counterfactuals as a way to give people a direct and easy to understand path to move from the decision they got now, to a more favourable one at a later date.

Ed: But could a counterfactual approach just end up kicking the can down the road, if we know “how” a particular decision was reached, but not “why” the algorithm was weighted in such a way to produce that decision?

Brent: It depends what we mean by “why”. If this is “why” in the sense of, why was the system designed this way, to consider this type of data for this task, then we should be asking these questions while these systems are designed and deployed. Counterfactuals address decisions that have already been made, but still can reveal uncomfortable knowledge about a system’s design and functionality. So it can certainly inform “why” questions.

Sandra: Just to echo Brent, we don’t want to imply that asking the “why” is unimportant—I think it’s very important, and interpretability as a field has to be pursued, particularly if we’re using algorithms in highly sensitive areas. Even if we have the “what”, the “why” question is still necessary to ensure the safety of those systems.

Chris: And anyone who’s talked to a three-year old knows there is an endless stream of “Why” questions that can be asked. But already, counterfactuals provide a major step forward in answering why, compared to previous approaches that were concerned with providing approximate descriptions of how algorithms make decisions—but not the “why” or the external facts leading to that decision. I think when judging the strength of an explanation, you also have to look at questions like “How easy is this to understand?” and “How does this help the person I’m explaining things to?” For me, counterfactuals are a more immediately useful explanation, than something which explains where the weights came from. Even if you did know, what could you do with that information?

Ed: I guess the question of algorithmic decision making in society involves a hugely complex intersection of industry, research, and policy making? Are we control of things?

Sandra: Artificial intelligence (and the technology supporting it) is an area where many sectors are now trying to work together, including in the crucial areas of fairness, transparency and accountability of algorithmic decision-making. I feel at the moment we see a very multi-stakeholder approach, and I hope that continues in the future. We can see for example that industry is very concerned with it—the Partnership in AI is addressing these topics and trying to come up with a set of industry guidelines, recognising the responsibilities inherent in producing these systems. There are also lots of data scientists (eg at the OII and Turing Institute) working on these questions. Policy-makers around the world (e.g. UK, EU, US, China) preparing their countries for the AI future, so it’s on everybody’s mind at the moment. It’s an extremely important topic.

Law and ethics obviously has an important role to play. The opacity, unpredictability of AI and its potentially discriminatory nature, requires that we think about the legal and ethical implications very early on. That starts with educating the coding community, and ensuring diversity. At the same time, it’s important to have an interdisciplinary approach. At the moment we’re focusing a bit too much on the STEM subjects; there’s a lot of funding going to those areas (which makes sense, obviously), but the social sciences are currently a bit neglected despite the major role they play in recognising things like discrimination and bias, which you might not recognise from just looking at code.

Brent: Yes—and we’ll need much greater interaction and collaboration between these sectors to stay ‘in control’ of things, so to speak. Policy always has a tendency to lag behind technological developments; the challenge here is to stay close enough to the curve to prevent major issues from arising. The potential for algorithms to transform society is massive, so ensuring a quicker and more reflexive relationship between these sectors than normal is absolutely critical.

Read the full article: Sandra Wachter, Brent Mittelstadt, Chris Russell (2018) Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR. Harvard Journal of Law & Technology (Forthcoming).

This work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1.

Sandra Wachter, Brent Mittelstadt and Chris Russell were talking to blog editor David Sutcliffe.

Can we predict electoral outcomes from Wikipedia traffic?

taha yasseri — Tue, 06 Dec 2016 15:34:31 +0000

As digital technologies become increasingly integrated into the fabric of social life their ability to generate large amounts of information about the opinions and activities of the population increases. The opportunities in this area are enormous: predictions based on socially generated data are much cheaper than conventional opinion polling, offer the potential to avoid classic biases inherent in asking people to report their opinions and behaviour, and can deliver results much quicker and be updated more rapidly.

In their article published in EPJ Data Science, Taha Yasseri and Jonathan Bright develop a theoretically informed prediction of election results from socially generated data combined with an understanding of the social processes through which the data are generated. They can thereby explore the predictive power of socially generated data while enhancing theory about the relationship between socially generated data and real world outcomes. Their particular focus is on the readership statistics of politically relevant Wikipedia articles (such as those of individual political parties) in the time period just before an election.

By applying these methods to a variety of different European countries in the context of the 2009 and 2014 European Parliament elections they firstly show that the relative change in number of page views to the general Wikipedia page on the election can offer a reasonable estimate of the relative change in election turnout at the country level. This supports the idea that increases in online information seeking at election time are driven by voters who are considering voting.

Second, they show that a theoretically informed model based on previous national results, Wikipedia page views, news media mentions, and basic information about the political party in question can offer a good prediction of the overall vote share of the party in question. Third, they present a model for predicting change in vote share (i.e., voters swinging towards and away from a party), showing that Wikipedia page-view data provide an important increase in predictive power in this context.

This relationship is exaggerated in the case of newer parties — consistent with the idea that voters don’t seek information uniformly about all parties at election time. Rather, they behave like ‘cognitive misers’, being more likely to seek information on new political parties with which they do not have previous experience and being more likely to seek information only when they are actually changing the way they vote.

In contrast, there was no evidence of a ‘media effect’: there was little correlation between news media mentions and overall Wikipedia traffic patterns. Indeed, the news media and Wikipedia appeared to be biased towards different things: with the news favouring incumbent parties, and Wikipedia favouring new ones.

Read the full article: Yasseri, T. and Bright, J. (2016) Wikipedia traffic data and electoral prediction: towards theoretically informed models. EPJ Data Science. 5 (1).

We caught up with the authors to explore the implications of the work.

Ed: Wikipedia represents a vast amount of not just content, but also user behaviour data. How did you access the page view stats — but also: is anyone building dynamic visualisations of Wikipedia data in real time?

Taha and Jonathan: Wikipedia makes its page view data available for free (in the same way as it makes all of its information available!). You can find the data here, along with some visualisations

Ed: Why did you use Wikipedia data to examine election prediction rather than (the I suppose the more fashionable) Twitter? How do they compare as data sources?

Taha and Jonathan: One of the big problems with using Twitter to predict things like elections is that contributing on social media is a very public thing and people are quite conscious of this. For example, some parties are seen as unfashionable so people might not make their voting choice explicit. Hence overall social media might seem to be saying one thing whereas actually people are thinking another.

By contrast, looking for information online on a website like Wikipedia is an essentially private activity so there aren’t these social biases. In other words, on Wikipedia we can directly have access to transactional data on what people do, rather than what they say or prefer to say.

Ed: How did these results and findings compare with the social media analysis done as part of our UK General Election 2015 Election Night Data Hack? (long title..)

Taha and Jonathan: The GE2015 data hack looked at individual politicians. We found that having a Wikipedia page is becoming increasingly important — over 40% of Labour and Conservative Party candidates had an individual Wikipedia page. We also found that this was highly correlated with Twitter presence — being more active on one network also made you more likely to be active on the other one. And we found some initial evidence that social media reaction was correlated with votes, though there is a lot more work to do here!

Ed: Can you see digital social data analysis replacing (or maybe just complementing) opinion polling in any meaningful way? And what problems would need to be addressed before that happened: e.g. around representative sampling, data cleaning, and weeding out bots?

Taha and Jonathan: Most political pundits are starting to look at a range of indicators of popularity — for example, not just voting intention, but also ratings of leadership competence, economic performance, etc. We can see good potential for social data to become part of this range of popularity indicator. However we don’t think it will replace polling just yet; the use of social media is limited to certain demographics. Also, the data collected from social media are often very shallow, not allowing for validation. In the case of Wikipedia, for example, we only know how many times each page is viewed, but we don’t know by how many people and from where.

Ed: You do a lot of research with Wikipedia data — has that made you reflect on your own use of Wikipedia?

Taha and Jonathan: It’s interesting to think about this activity of getting direct information about politicians — it’s essentially a new activity, something you couldn’t do in the pre-digital age. I know that I personally [Jonathan] use it to find out things about politicians and political parties — it would be interesting to know more about why other people are using it as well. This could have a lot of impacts. One thing Wikipedia has is a really long memory, in a way that other means of getting information on politicians (such as newspapers) perhaps don’t. We could start to see this type of thing becoming more important in electoral politics.

[Taha] .. since my research has been mostly focused on Wikipedia edit wars between human and bot editors, I have naturally become more cautious about the information I find on Wikipedia. When it comes to sensitive topics, sach as politics, Wikipedia is a good point to start, but not a great point to end the search!

Taha Yasseri and Jonathan Bright were talking to blog editor David Sutcliffe.

Topic modelling content from the “Everyday Sexism” project: what’s it all about?

taha yasseri — Thu, 03 Mar 2016 09:19:23 +0000

We recently announced the start of an exciting new research project that will involve the use of topic modelling in understanding the patterns in submitted stories to the Everyday Sexism website. Here, we briefly explain our text analysis approach, “topic modelling”.

At its very core, topic modelling is a technique that seeks to automatically discover the topics contained within a group of documents. ‘Documents’ in this context could refer to text items as lengthy as individual books, or as short as sentences within a paragraph. Let’s take the idea of sentences-as-documents as an example:

Document 1: I like to eat kippers for breakfast.
Document 2: I love all animals, but kittens are the cutest.
Document 3: My kitten eats kippers too.

Assuming that each sentence contains a mixture of different topics (and that a ‘topic’ can be understood as a collection of words (of any part of speech) that have different probabilities of appearance in passages discussing the topic), how does the topic modelling algorithm ‘discover’ the topics within these sentences?

The algorithm is initiated by setting the number of topics that it needs to extract. Of course, it is hard to guess this number without having an insight on the topics, but one can think of this as a resolution tuning parameter. The smaller the number of topics is set, the more general the bag of words in each topic would be, and the looser the connections between them.

The algorithm loops through all of the words in each document, assigning every word to one of our topics in a temporary and semi-random manner. This initial assignment is arbitrary and it is easy to show that different initializations lead to the same results in long run. Once each word has been assigned a temporary topic, the algorithm then re-iterates through each word in each document to update the topic assignment using two criteria: 1) How prevalent is the word in question across topics? And 2) How prevalent are the topics in the document?

To quantify these two, the algorithm calculates the likelihood of the words appearing in each document assuming the assignment of words to topics and topics to documents.

Of course words can appear in different topics and more than one topic can appear in a document. But the iterative algorithm seeks to maximize the self-consistency of the assignment by maximizing the likelihood of the observed word-document statistics.

We can illustrate this process and its outcome by going back to our example. A topic modelling approach might use the process above to discover the following topics across our documents:

Document 1: I like to eat kippers for breakfast. [100% Topic A]
Document 2: I love all animals, but kittens are the cutest. [100% Topic B]
Document 3: My kitten eats kippers too. [67% Topic A, 33% Topic B]

Topic modelling defines each topic as a so-called ‘bag of words’, but it is the researcher’s responsibility to decide upon an appropriate label for each topic based on their understanding of language and context. Going back to our example, the algorithm might classify the underlined words under Topic A, which we could then label as ‘food’ based on our understanding of what the words mean. Similarly the italicised words might be classified under a separate topic, Topic B, which we could label ‘animals’. In this simple example the word “eat” has appeared in a sentence dominated by Topic A, but also in a sentence with some association to Topic B. Therefore it can also be seen as a connector of the two topics. Of course animals eat too and they like food!

We are going to use a similar approach to first extract the main topics reflected on the reports to the Everyday Sexism Project website and extract the relation between the sexism-related topics and concepts based on the overlap between the bags of words of each topic. Finally we can also look into the co-appearance of topics in the same document. This way we try to draw a linguistic picture of the more than 100,000 submitted reports.

As ever, be sure to check back for further updates on our progress!

How big data is breathing new life into the smart cities concept

Jonathan Bright — Thu, 23 Jul 2015 09:57:10 +0000

“Big data” is a growing area of interest for public policy makers: for example, it was highlighted in UK Chancellor George Osborne’s recent budget speech as a major means of improving efficiency in public service delivery. While big data can apply to government at every level, the majority of innovation is currently being driven by local government, especially cities, who perhaps have greater flexibility and room to experiment and who are constantly on a drive to improve service delivery without increasing budgets.

Work on big data for cities is increasingly incorporated under the rubric of “smart cities”. The smart city is an old(ish) idea: give urban policymakers real time information on a whole variety of indicators about their city (from traffic and pollution to park usage and waste bin collection) and they will be able to improve decision making and optimise service delivery. But the initial vision, which mostly centred around adding sensors and RFID tags to objects around the city so that they would be able to communicate, has thus far remained unrealised (big up front investment needs and the requirements of IPv6 are perhaps the most obvious reasons for this).

The rise of big data – large, heterogeneous datasets generated by the increasing digitisation of social life – has however breathed new life into the smart cities concept. If all the cars have GPS devices, all the people have mobile phones, and all opinions are expressed on social media, then do we really need the city to be smart at all? Instead, policymakers can simply extract what they need from a sea of data which is already around them. And indeed, data from mobile phone operators has already been used for traffic optimisation, Oyster card data has been used to plan London Underground service interruptions, sewage data has been used to estimate population levels … the examples go on.

However, at the moment these examples remain largely anecdotal, driven forward by a few cities rather than adopted worldwide. The big data driven smart city faces considerable challenges if it is to become a default means of policymaking rather than a conversation piece. Getting access to the right data; correcting for biases and inaccuracies (not everyone has a GPS, phone, or expresses themselves on social media); and communicating it all to executives remain key concerns. Furthermore, especially in a context of tight budgets, most local governments cannot afford to experiment with new techniques which may not pay off instantly.

This is the context of two current OII projects in the smart cities field: UrbanData2Decide (2014-2016) and NEXUS (2015-2017). UrbanData2Decide joins together a consortium of European universities, each working with a local city partner, to explore how local government problems can be resolved with urban generated data. In Oxford, we are looking at how open mapping data can be used to estimate alcohol availability; how website analytics can be used to estimate service disruption; and how internal administrative data and social media data can be used to estimate population levels. The best concepts will be built into an application which allows decision makers to access these concepts real time.

NEXUS builds on this work. A collaborative partnership with BT, it will look at how social media data and some internal BT data can be used to estimate people movement and traffic patterns around the city, joining these data into network visualisations which are then displayed to policymakers in a data visualisation application. Both projects fill an important gap by allowing city officials to experiment with data driven solutions, providing proof of concepts and showing what works and what doesn’t. Increasing academic-government partnerships in this way has real potential to drive forward the field and turn the smart city vision into a reality.

OII Resarch Fellow Jonathan Bright is a political scientist specialising in computational and ‘big data’ approaches to the social sciences. His major interest concerns studying how people get information about the political process, and how this is changing in the internet era.

Digital Disconnect: Parties, Pollsters and Political Analysis in #GE2015

Helen Margetts — Mon, 11 May 2015 15:16:16 +0000

The Oxford Internet Institute undertook some live analysis of social media data over the night of the 2015 UK General Election. See more photos from the OII’s election night party, or read about the data hack

Counts of public Facebook posts mentioning any of the party leaders’ surnames. Data generated by social media can be used to understand political behaviour and institutions on an ongoing basis.[/caption]‘Congratulations to my friend @Messina2012 on his role in the resounding Conservative victory in Britain’ tweeted David Axelrod, campaign advisor to Miliband, to his former colleague Jim Messina, Cameron’s strategy adviser, on May 8th. The former was Obama’s communications director and the latter campaign manager of Obama’s 2012 campaign. Along with other consultants and advisors and large-scale data management platforms from Obama’s hugely successful digital campaigns, Conservative and Labour used an arsenal of social media and digital tools to interact with voters throughout, as did all the parties competing for seats in the 2015 election.

The parties ran very different kinds of digital campaigns. The Conservatives used advanced data science techniques borrowed from the US campaigns to understand how their policy announcements were being received and to target groups of individuals. They spent ten times as much as Labour on Facebook, using ads targeted at Facebook users according to their activities on the platform, geo-location and demographics. This was a top down strategy that involved working out was happening on social media and responding with targeted advertising, particularly for marginal seats. It was supplemented by the mainstream media, such as the Telegraph for example, which contacted its database of readers and subscribers to services such as Telegraph Money, urging them to vote Conservative. As Andrew Cooper tweeted after the election, ‘Big data, micro-targeting and social media campaigns just thrashed “5 million conversations” and “community organizing”’.

He has a point. Labour took a different approach to social media. Widely acknowledged to have the most boots on the real ground, knocking on doors, they took a similar ‘ground war’ approach to social media in local campaigns. Our own analysis at the Oxford Internet Institute shows that of the 450K tweets sent by candidates of the six largest parties in the month leading up to the general election, Labour party candidates sent over 120,000 while the Conservatives sent only 80,000, no more than the Greens and not much more than UKIP. But the greater number of Labour tweets were no more productive in terms of impact (measured in terms of mentions generated: and indeed the final result).

Both parties’ campaigns were tightly controlled. Ostensibly, Labour generated far more bottom-up activity from supporters using social media, through memes like #votecameron out, #milibrand (responding to Miliband’s interview with Russell Brand), and what Miliband himself termed the most unlikely cult of the 21st century in his resignation speech, #milifandom, none of which came directly from Central Office. These produced peaks of activity on Twitter that at some points exceeded even discussion of the election itself on the semi-official #GE2015 used by the parties, as the figure below shows. But the party remained aloof from these conversations, fearful of mainstream media mockery.

The Brand interview was agreed to out of desperation and can have made little difference to the vote (partly because Brand endorsed Miliband only after the deadline for voter registration: young voters suddenly overcome by an enthusiasm for participatory democracy after Brand’s public volte face on the utility of voting will have remained disenfranchised). But engaging with the swathes of young people who spend increasing amounts of their time on social media is a strategy for engagement that all parties ought to consider. YouTubers like PewDiePie have tens of millions of subscribers and billions of video views – their videos may seem unbelievably silly to many, but it is here that a good chunk the next generation of voters are to be found.

Use of emergent hashtags on Twitter during the 2015 General Election. Volumes are estimates based on a 10% sample with the exception of #ge2015, which reflects the exact value. All data from Datasift.

Only one of the leaders had a presence on social media that managed anything like the personal touch and universal reach that Obama achieved in 2008 and 2012 based on sustained engagement with social media – Nicola Sturgeon. The SNP’s use of social media, developed in last September’s referendum on Scottish independence had spawned a whole army of digital activists. All SNP candidates started the campaign with a Twitter account. When we look at the 650 local campaigns waged across the country, by far the most productive in the sense of generating mentions was the SNP; 100 tweets from SNP local candidates generating 10 times more mentions (1,000) than 100 tweets from (for example) the Liberal Democrats.

Scottish Labour’s failure to engage with Scottish peoples in this kind of way illustrates how difficult it is to suddenly develop relationships on social media – followers on all platforms are built up over years, not in the short space of a campaign. In strong contrast, advertising on these platforms as the Conservatives did is instantaneous, and based on the data science understanding (through advertising algorithms) of the platform itself. It doesn’t require huge databases of supporters – it doesn’t build up relationships between the party and supporters – indeed, they may remain anonymous to the party. It’s quick, dirty and effective.

The pollsters’ terrible night

So neither of the two largest parties really did anything with social media, or the huge databases of interactions that their platforms will have generated, to generate long-running engagement with the electorate. The campaigns were disconnected from their supporters, from their grass roots.

But the differing use of social media by the parties could lend a clue to why the opinion polls throughout the campaign got it so wrong, underestimating the Conservative lead by an average of five per cent. The social media data that may be gathered from this or any campaign is a valuable source of information about what the parties are doing, how they are being received, and what people are thinking or talking about in this important space – where so many people spend so much of their time. Of course, it is difficult to read from the outside; Andrew Cooper labeled the Conservatives’ campaign of big data to identify undecided voters, and micro-targeting on social media, as ‘silent and invisible’ and it seems to have been so to the polls.

Many voters were undecided until the last minute, or decided not to vote, which is impossible to predict with polls (bar the exit poll) – but possibly observable on social media, such as the spikes in attention to UKIP on Wikipedia towards the end of the campaign, which may have signaled their impressive share of the vote. As Jim Messina put it to msnbc news following up on his May 8^th tweet that UK (and US) polling was ‘completely broken’ – ‘people communicate in different ways now’, arguing that the Miliband campaign had tried to go back to the 1970s.

Surveys – such as polls — give a (hopefully) representative picture of what people think they might do. Social media data provide an (unrepresentative) picture of what people really said or did. Long-running opinion surveys (such as the Ipsos MORI Issues Index) can monitor the hopes and fears of the electorate in between elections, but attention tends to focus on the huge barrage of opinion polls at election time – which are geared entirely at predicting the election result, and which do not contribute to more general understanding of voters. In contrast, social media are a good way to track rapid bursts in mobilization or support, which reflect immediately on social media platforms – and could also be developed to illustrate more long running trends, such as unpopular policies or failing services.

As opinion surveys face more and more challenges, there is surely good reason to supplement them with social media data, which reflect what people are really thinking on an ongoing basis – like, a video in rather than the irregular snapshots taken by polls. As a leading pollster João Francisco Meira, director of Vox Populi in Brazil (which is doing innovative work in using social media data to understand public opinion) put it in conversation with one of the authors in April – ‘we have spent so long trying to hear what people are saying – now they are crying out to be heard, every day’. It is a question of pollsters working out how to listen.

Political big data

Analysts of political behaviour – academics as well as pollsters — need to pay attention to this data. At the OII we gathered large quantities of data from Facebook, Twitter, Wikipedia and YouTube in the lead-up to the election campaign, including mentions of all candidates (as did Demos’s Centre for the Analysis of Social Media). Using this data we will be able, for example, to work out the relationship between local social media campaigns and the parties’ share of the vote, as well as modeling the relationship between social media presence and turnout.

We can already see that the story of the local campaigns varied enormously – while at the start of the campaign some candidates were probably requesting new passwords for their rusty Twitter accounts, some already had an ongoing relationship with their constituents (or potential constituents), which they could build on during the campaign. One of the candidates to take over the Labour party leadership, Chuka Umunna, joined Twitter in April 2009 and now has 100K followers, which will be useful in the forthcoming leadership contest.

Election results inject data into a research field that lacks ‘big data’. Data hungry political scientists will analyse these data in every way imaginable for the next five years. But data in between elections, for example relating to democratic or civic engagement or political mobilization, has traditionally been woefully short in our discipline. Analysis of the social media campaigns in #GE2015 will start to provide a foundation to understand patterns and trends in voting behaviour, particularly when linked to other sources of data, such as the actual constituency-level voting results and even discredited polls — which may yet yield insight, even having failed to achieve their predictive aims. As the OII’s Jonathan Bright and Taha Yasseri have argued, we need ‘a theory-informed model to drive social media predictions, that is based on an understanding of how the data is generated and hence enables us to correct for certain biases’

A political data science

Parties, pollsters and political analysts should all be thinking about these digital disconnects in #GE2015, rather than burying them with their hopes for this election. As I argued in a previous post, let’s use data generated by social media to understand political behaviour and institutions on an ongoing basis. Let’s find a way of incorporating social media analysis into polling models, for example by linking survey datasets to big data of this kind. The more such activity moves beyond the election campaign itself, the more useful social media data will be in tracking the underlying trends and patterns in political behavior.

And for the parties, these kind of ways of understanding and interacting with voters needs to be institutionalized in party structures, from top to bottom. On 8th May, the VP of a policy think-tank tweeted to both Axelrod and Messina ‘Gentlemen, welcome back to America. Let’s win the next one on this side of the pond’. The UK parties are on their own now. We must hope they use the time to build an ongoing dialogue with citizens and voters, learning from the success of the new online interest group barons, such as 38 degrees and Avaaz, by treating all internet contacts as ‘members’ and interacting with them on a regular basis. Don’t wait until 2020!

Helen Margetts is the Director of the OII, and Professor of Society and the Internet. She is a political scientist specialising in digital era governance and politics, investigating political behaviour, digital government and government-citizen interactions in the age of the internet, social media and big data. She has published over a hundred books, articles and major research reports in this area, including Political Turbulence: How Social Media Shape Collective Action (with Peter John, Scott Hale and Taha Yasseri, 2015).

Scott A. Hale is a Data Scientist at the OII. He develops and applies techniques from computer science to research questions in the social sciences. He is particularly interested in the area of human-computer interaction and the spread of information between speakers of different languages online and the roles of bilingual Internet users. He is also interested in collective action and politics more generally.

Two years after the NYT’s ‘Year of the MOOC’: how much do we actually know about them?

Rebecca Eynon — Thu, 13 Nov 2014 08:15:32 +0000

Timeline of the development of MOOCs and open education, from: Yuan, Li, and Stephen Powell. MOOCs and Open Education: Implications for Higher Education White Paper. University of Bolton: CETIS, 2013.

Ed: Does research on MOOCs differ in any way from existing research on online learning?

Rebecca: Despite the hype around MOOCs to date, there are many similarities between MOOC research and the breadth of previous investigations into (online) learning. Many of the trends we’ve observed (the prevalence of forum lurking; community formation; etc.) have been studied previously and are supported by earlier findings. That said, the combination of scale, global-reach, duration, and “semi-synchronicity” of MOOCs have made them different enough to inspire this work. In particular, the optional nature of participation among a global-body of lifelong learners for a short burst of time (e.g. a few weeks) is a relatively new learning environment that, despite theoretical ties to existing educational research, poses a new set of challenges and opportunities.

Ed: The MOOC forum networks you modelled seemed to be less efficient at spreading information than randomly generated networks. Do you think this inefficiency is due to structural constraints of the system (or just because inefficiency is not selected against); or is there something deeper happening here, maybe saying something about the nature of learning, and networked interaction?

Rebecca: First off, it’s important to not confuse the structural “inefficiency” of communication with some inherent learning “inefficiency”. The inefficiency in the sub-forums is a matter of information diffusion—i.e., because there are communities that form in the discussion spaces, these communities tend to “trap” knowledge and information instead of promoting the spread of these ideas to a vast array of learners. This information diffusion inefficiency is not necessarily a bad thing, however. It’s a natural human tendency to form communities, and there is much education research that says learning in small groups can be much more beneficial / effective than large-scale learning. The important point that our work hopes to make is that the existence and nature of these communities seems to be influenced by the types of topics that are being discussed (and vice versa)—and that educators may be able to cultivate more isolated or inclusive network dynamics in these course settings by carefully selecting and presenting these different discussion topics to learners.

Ed: Drawing on surveys and learning outcomes you could categorise four ‘learner types’, who tend to behave differently in the network. Could the network be made more efficient by streaming groups by learning objective, or by type of interaction (eg learning / feedback / social)?

Rebecca: Given our network vulnerability analysis, it appears that discussions that focus on problems or issues that are based in real life examples –e.g., those that relate to case studies of real companies and analyses posted by learners of these companies—tend to promote more inclusive engagement and efficient information diffusion. Given that certain types of learners participate in these discussions, one could argue that forming groups around learning preferences and objectives could promote more efficient communications. Still, it’s important to be aware of the potential drawbacks to this, namely, that promoting like-minded / similar people to interact with those they are similar to could further prevent “learning through diverse exposures” that these massive-scale settings can be well-suited to promote.

Ed: In the classroom, the teacher can encourage participation and discussion if it flags: are there mechanisms to trigger or seed interaction if the levels of network activity fall below a certain threshold? How much real-time monitoring tends to occur in these systems?

Rebecca: Yes, it appears that educators may be able to influence or achieve certain types of network patterns. While each MOOC is different (some course staff members tend to be much more engaged than others, learners may have different motivations, etc.), on the whole, there isn’t much real-time monitoring in MOOCs, and MOOC platforms are still in early days where there is little to no automated monitoring or feedback (beyond static analytics dashboards for instructors).

Ed: Does learner participation in these forums improve outcomes? Do the most central users in the interaction network perform better? And do they tend to interact with other very central people?

Rebecca: While we can’t infer causation, we found that when compared to the entire course, a significantly higher percentage of high achievers were also forum participants. The more likely explanation for this is that those who are committed to completing the course and performing well also tend to use the forums—but the plurality of forum participants (44% in one of the courses we analyzed) are actually those that “fail” by traditional marks (receive below 50% in the course). Indeed, many central users tend to be those that are simply auditing the course or who are interested in communicating with others without any intention of completing course assignments. These central users tend to communicate with other central users, but also, with those whose participation is much sparser / “on the fringes”.

Ed: Slightly facetiously: you can identify ‘central’ individuals in the network who spark and sustain interaction. Can you also find people who basically cause interaction to die? Who will cause the network to fall apart? And could you start to predict the strength of a network based on the profiles and proportions of the individuals who make it up?

Rebecca: It is certainly possible to further explore how different people seem. One way this can be achieved is by exploring the temporal dynamics at play—e.g., by visualizing the communication network at any point in time and creating network “snapshots” at every hour or day, or perhaps, with every new participant, to observe how the trends and structures evolve. While this method still doesn’t allow us to identify the exact influence of any given individual’s participation (since there are so many other confounding factors, for example, how far into the course it is, peoples’ schedules / lives outside of the MOOC, etc.), it may provide some insight into their roles. We could of course define some quantitative measure(s) to measure “network strength” based on learner profiles, but caution against overarching or broad claims in doing so due to confounding forces would be essential.

Ed: The majority of my own interactions are mediated by a keyboard: which is actually a pretty inefficient way of communicating, and certainly a terrible way of arguing through a complex point. Is there any sense from MOOCs that text-based communication might be a barrier to some forms of interaction, or learning?

Rebecca: This is an excellent observation. Given the global student body, varying levels of comfort in English (and written language more broadly), differing preferences for communication, etc., there is much reason to believe that a lack of participation could result from a lack of comfort with the keyboard (or written communication more generally). Indeed, in the MOOCs we’ve studied, many learners have attempted to meet up on Google Hangouts or other non-text based media to form and sustain study groups, suggesting that many learners seek to use alternative technologies to interact with others and achieve their learning objectives.

Ed: Based on this data and analysis, are there any obvious design points that might improve interaction efficiency and learning outcomes in these platforms?

Rebecca: As I have mentioned already, open-ended questions that focus on real-life case studies tend to promote the least vulnerable and most “efficient” discussions, which may be of interest to practitioners looking to cultivate these sorts of environments. More broadly, the lack of sustained participation in the forums suggests that there are a number of “forces of disengagement” at play, one of them being that the sheer amount of content being generated in the discussion spaces (one course had over 2,700 threads and 15,600 posts) could be contributing to a sense of “content overload” and helplessness for learners. Designing platforms that help mitigate this problem will be fundamental to the vitality and effectiveness of these learning spaces in the future.

Ed: I suppose there is an inherent tension between making the online environment very smooth and seductive, and the process of learning; which is often difficult and frustrating: the very opposite experience aimed for (eg) by games designers. How do MOOCs deal with this tension? (And how much gamification is common to these systems, if any?)

Rebecca: To date, gamification seems to have been sparse in most MOOCs, although there are some interesting experiments in the works. Indeed, one study (Anderson et al., 2014) used a randomized control trial to add badges (that indicate student engagement levels) next to the names of learners in MOOC discussion spaces in order to determine if and how this affects further engagement. Coursera has also started to publicly display badges next to the names of learners that have signed up for the paid Signature Track of a specific course (presumably, to signal which learners are “more serious” about completing the course than others). As these platforms become more social (and perhaps career advancement-oriented), it’s quite possible that gamification will become more popular. This gamification may not ease the process of learning or make it more comfortable, but rather, offer additional opportunities to mitigate the challenges massive-scale anonymity and lack of information about peers to facilitate more social learning.

Ed: How much of this work is applicable to other online environments that involve thousands of people exploring and interacting together: for example deliberation, crowd production and interactive gaming, which certainly involve quantifiable interactions and a degree of negotiation and learning?

Rebecca: Since MOOCs are so loosely structured and could largely be considered “informal” learning spaces, we believe the engagement dynamics we’ve found could apply to a number of other large-scale informal learning/interactive spaces online. Similar crowd-like structures can be found in a variety of policy and practice settings.

Ed: This project has adopted a mixed methods approach: what have you gained by this, and how common is it in the field?

Rebecca: Combining computational network analysis and machine learning with qualitative content analysis and in-depth interviews has been one of the greatest strengths of this work, and a great learning opportunity for the research team. Often in empirical research, it is important to validate findings across a variety of methods to ensure that they’re robust. Given the complexity of human subjects, we knew computational methods could only go so far in revealing underlying trends; and given the scale of the dataset, we knew there were patterns that qualitative analysis alone would not enable us to detect. A mixed-methods approach enabled us to simultaneously and robustly address these dimensions. MOOC research to date has been quite interdisciplinary, bringing together computer scientists, educationists, psychologists, statisticians, and a number of other areas of expertise into a single domain. The interdisciplinarity of research in this field is arguably one of the most exciting indicators of what the future might hold.

Ed: As well as the network analysis, you also carried out interviews with MOOC participants. What did you learn from them that wasn’t obvious from the digital trace data?

Rebecca: The interviews were essential to this investigation. In addition to confirming the trends revealed by our computational explorations (which revealed the what of the underlying dynamics at play), the interviews, revealed much of the why. In particular, we learned people’s motivations for participating in (or disengaging from) the discussion forums, which provided an important backdrop for subsequent quantitative (and qualitative) investigations. We have also learned a lot more about people’s experiences of learning, the strategies they employ to their support their learning and issues around power and inequality in MOOCs.

Ed: You handcoded more than 6000 forum posts in one of the MOOCs you investigated. What findings did this yield? How would you characterise the learning and interaction you observed through this content analysis?

Rebecca: The qualitative content analysis of over 6,500 posts revealed several key insights. For one, we confirmed (as the network analysis suggested), that most discussion is insignificant “noise”—people looking to introduce themselves or have short-lived discussions about topics that are beyond the scope of the course. In a few instances, however, we discovered the different patterns (and sometimes, cycles) of knowledge construction that can occur within a specific discussion thread. In some cases, we found that discussion threads grew to be so long (with over hundreds of posts), that topics were repeated or earlier posts disregarded because new participants didn’t read and/or consider them before adding their own replies.

Ed: How are you planning to extend this work?

Rebecca: As mentioned already, feelings of helplessness resulting from sheer “content overload” in the discussion forums appear to be a key force of disengagement. To that end, as we now have a preliminary understanding of communication dynamics and learner tendencies within these sorts of learning environments, we now hope to leverage this background knowledge to develop new methods for promoting engagement and the fulfilment of individual learning objectives in these settings—in particular, by trying to mitigate the “content overload” issues in some way. Stay tuned for updates

References

Anderson, A., Huttenlocher, D., Kleinberg, J. & Leskovec, J., Engaging with Massive Open Online Courses. In: WWW ’14 Proceedings of the 23rd International World Wide Web Conference, Seoul, Korea. New York: ACM (2014).

Read the full paper: Gillani, N., Yasseri, T., Eynon, R., and Hjorth, I. (2014) Structural limitations of learning in a crowd – communication vulnerability and information diffusion in MOOCs. Scientific Reports 4.

Rebecca Eynon was talking to blog editor David Sutcliffe.

Rebecca Eynon holds a joint academic post between the Oxford Internet Institute (OII) and the Department of Education at the University of Oxford. Her research focuses on education, learning and inequalities, and she has carried out projects in a range of settings (higher education, schools and the home) and life stages (childhood, adolescence and late adulthood).

What are the limitations of learning at scale? Investigating information diffusion and network vulnerability in MOOCs

Rebecca Eynon — Tue, 21 Oct 2014 11:48:51 +0000

Millions of people worldwide are currently enrolled in courses provided on large-scale learning platforms (aka ‘MOOCs’), typically collaborating in online discussion forums with thousands of peers. Current learning theory emphasizes the importance of this group interaction for cognition. However, while a lot is known about the mechanics of group learning in smaller and traditionally organized online classrooms, fewer studies have examined participant interactions when learning “at scale”. Some studies have used clickstream data to trace participant behaviour; even predicting dropouts based on their engagement patterns. However, many questions remain about the characteristics of group interactions in these courses, highlighting the need to understand whether — and how — MOOCs allow for deep and meaningful learning by facilitating significant interactions.

But what constitutes a “significant” learning interaction? In large-scale MOOC forums, with socio-culturally diverse learners with different motivations for participating, this is a non-trivial problem. MOOCs are best defined as “non-formal” learning spaces, where learners pick and choose how (and if) they interact. This kind of group membership, together with the short-term nature of these courses, means that relatively weak inter-personal relationships are likely. Many of the tens of thousands of interactions in the forum may have little relevance to the learning process. So can we actually define the underlying network of significant interactions? Only once we have done this can we explore firstly how information flows through the forums, and secondly the robustness of those interaction networks: in short, the effectiveness of the platform design for supporting group learning at scale.

To explore these questions, we analysed data from 167,000 students registered on two business MOOCs offered on the Coursera platform. Almost 8000 students contributed around 30,000 discussion posts over the six weeks of the courses; almost 30,000 students viewed at least one discussion thread, totalling 321,769 discussion thread views. We first modelled these communications as a social network, with nodes representing students who posted in the discussion forums, and edges (ie links) indicating co-participation in at least one discussion thread. Of course, not all links will be equally important: many exchanges will be trivial (‘hello’, ‘thanks’ etc.). Our task, then, was to derive a “true” network of meaningful student interactions (ie iterative, consistent dialogue) by filtering out those links generated by random encounters (Figure 1; see also full paper for methodology).

Figure 1. Comparison of observed (a; ‘all interactions’) and filtered (b; ‘significant interactions’) communication networks for a MOOC forum. Filtering affects network properties such as modularity score (ie degree of clustering). Colours correspond to the automatically detected interest communities.

One feature of networks that has been studied in many disciplines is their vulnerability to fragmentation when nodes are removed (the Internet, for example, emerged from US Army research aiming to develop a disruption-resistant network for critical communications). While we aren’t interested in the effect of missile strike on MOOC exchanges, from an educational perspective it is still useful to ask which “critical set” of learners is mostly responsible for information flow in a communication network — and what would happen to online discussions if these learners were removed. To our knowledge, this is the first time vulnerability of communication networks has been explored in an educational setting.

Network vulnerability is interesting because it indicates how integrated and inclusive the communication flow is. Discussion forums with fleeting participation will have only a very few vocal participants: removing these people from the network will markedly reduce the information flow between the other participants — as the network falls apart, it simply becomes more difficult for information to travel across it via linked nodes. Conversely, forums that encourage repeated engagement and in-depth discussion among participants will have a larger ‘critical set’, with discussion distributed across a wide range of learners.

To understand the structure of group communication in the two courses, we looked at how quickly our modelled communication network fell apart when: (a) the most central nodes were iteratively disconnected (Figure 2; blue), compared with when (b) nodes were removed at random (ie the ‘neutral’ case; green). In the random case, the network degrades evenly, as expected. When we selectively remove the most central nodes, however, we see rapid disintegration: indicating the presence of individuals who are acting as important ‘bridges’ across the network. In other words, the network of student interactions is not random: it has structure.

Figure 2. Rapid network degradation results from removal of central nodes (blue). This indicates the presence of individuals acting as ‘bridges’ between sub-groups. Removing these bridges results in rapid degradation of the overall network. Removal of random nodes (green) results in a more gradual degradation.

Of course, the structure of participant interactions will reflect the purpose and design of the particular forum. We can see from Figure 3 that different forums in the courses have different vulnerability thresholds. Forums with high levels of iterative dialogue and knowledge construction — with learners sharing ideas and insights about weekly questions, strategic analyses, or course outcomes — are the least vulnerable to degradation. A relatively high proportion of nodes have to be removed before the network falls apart (rightmost-blue line). Forums where most individuals post once to introduce themselves and then move their discussions to other platforms (such as Facebook) or cease engagement altogether tend to be more vulnerable to degradation (left-most blue line). The different vulnerability thresholds suggest that different topics (and forum functions) promote different levels of forum engagement. Certainly, asking students open-ended questions tended to encourage significant discussions, leading to greater engagement and knowledge construction as they read analyses posted by their peers and commented with additional insights or critiques.

Figure 3 – Network vulnerabilities of different course forums.

Understanding something about the vulnerability of a communication or interaction network is important, because it will tend to affect how information spreads across it. To investigate this, we simulated an information diffusion model similar to that used to model social contagion. Although simplistic, the SI model (‘susceptible-infected’) is very useful in analyzing topological and temporal effects on networked communication systems. While the model doesn’t account for things like decaying interest over time or peer influence, it allows us to compare the efficiency of different network topologies.

We compared our (real-data) network model with a randomized network in order to see how well information would flow if the community structures we observed in Figure 2 did not exist. Figure 4 shows the number of ‘infected’ (or ‘reached’) nodes over time for both the real (solid lines) and randomized networks (dashed lines). In all the forums, we can see that information actually spreads faster in the randomised networks. This is explained by the existence of local community structures in the real-world networks: networks with dense clusters of nodes (i.e. a clumpy network) will result in slower diffusion than a network with a more even distribution of communication, where participants do not tend to favor discussions with a limited cohort of their peers.

Figure 4 (a) shows the percentage of infected nodes vs. simulation time for different networks. The solid lines show the results for the original network and the dashed lines for the random networks. (b) shows the time it took for a simulated “information packet” to come into contact with half the network’s nodes.

Overall, these results reveal an important characteristic of student discussion in MOOCs: when it comes to significant communication between learners, there are simply too many discussion topics and too much heterogeneity (ie clumpiness) to result in truly global-scale discussion. Instead, most information exchange, and by extension, any knowledge construction in the discussion forums occurs in small, short-lived groups: with information “trapped” in small learner groups. This finding is important as it highlights structural limitations that may impact the ability of MOOCs to facilitate communication amongst learners that look to learn “in the crowd”.

These insights into the communication dynamics motivate a number of important questions about how social learning can be better supported, and facilitated, in MOOCs. They certainly suggest the need to leverage intelligent machine learning algorithms to support the needs of crowd-based learners; for example, in detecting different types of discussion and patterns of engagement during the runtime of a course to help students identify and engage in conversations that promote individualized learning. Without such interventions the current structural limitations of social learning in MOOCs may prevent the realization of a truly global classroom.

The next post addresses qualitative content analysis and how machine-learning community detection schemes can be used to infer latent learner communities from the content of forum posts.

The life and death of political news: using online data to measure the impact of the audience agenda

Jonathan Bright — Tue, 09 Sep 2014 07:04:47 +0000

Image of the Telegraph’s state of the art “hub and spoke” newsroom layout by David Sim.

The political agenda has always been shaped by what the news media decide to publish — through their ability to broadcast to large, loyal audiences in a sustained manner, news editors have the ability to shape ‘political reality’ by deciding what is important to report. Traditionally, journalists pass to their editors from a pool of potential stories; editors then choose which stories to publish. However, with the increasing importance of online news, editors must now decide not only what to publish and where, but how long it should remain prominent and visible to the audience on the front page of the news website.

The question of how much influence the audience has in these decisions has always been ambiguous. While in theory we might expect journalists to be attentive to readers, journalism has also been characterized as a profession with a “deliberate…ignorance of audience wants” (Anderson, 2011b). This ‘anti-populism’ is still often portrayed as an important journalistic virtue, in the context of telling people what they need to hear, rather than what they want to hear. Recently, however, attention has been turning to the potential impact that online audience metrics are having on journalism’s “deliberate ignorance”. Online publishing provides a huge amount of information to editors about visitor numbers, visit frequency, and what visitors choose to read and how long they spend reading it. Online editors now have detailed information about what articles are popular almost as soon as they are published, with these statistics frequently displayed prominently in the newsroom.

The rise of audience metrics has created concern both within the journalistic profession and academia, as part of a broader set of concerns about the way journalism is changing online. Many have expressed concern about a ‘culture of click’, whereby important but unexciting stories make way for more attention grabbing pieces, and editorial judgments are overridden by traffic statistics. At a time when media business models are under great strain, the incentives to follow the audience are obvious, particularly when business models increasingly rely on revenue from online traffic and advertising. The consequences for the broader agenda-setting function of the news media could be significant: more prolific or earlier readers might play a disproportionate role in helping to select content; particular social classes or groupings that read news online less frequently might find their issues being subtly shifted down the agenda.

The extent to which such a populist influence exists has attracted little empirical research. Many ethnographic studies have shown that audience metrics are being captured in online newsrooms, with anecdotal evidence for the importance of traffic statistics on an article’s lifetime (Anderson 2011b, MacGregor, 2007). However, many editors have emphasised that popularity is not a major determining factor (MacGregor, 2007), and that news values remain significant in terms of placement of news articles.

In order to assess the possible influence of audience metrics on decisions made by political news editors, we undertook a systematic, large-scale study of the relationship between readership statistics and article lifetime. We examined the news cycles of five major UK news outlets (the BBC, the Daily Telegraph, the Guardian, the Daily Mail and the Mirror) over a period of six weeks, capturing their front pages every 15 minutes, resulting in over 20,000 front-page captures and more than 40,000 individual articles. We measure article readership by capturing information from the BBC’s “most read” list of news articles (twelve percent of the articles were featured at some point on the ‘most read’ list, with a median time to achieving this status of two hours, and an average article life of 15 hours on the front page). Using the Cox Proportional Hazards model (which allows us to quantify the impact of an article’s appearance on the ‘most read’ list on its chance of survival) we asked whether an article’s being listed in a ‘most read’ column affected the length of time it remained on the front page.

We found that ‘most read’ articles had, on average, a 26% lower chance of being removed from the front page than equivalent articles which were not on the most read list, providing support for the idea that online editors are influenced by readership statistics. In addition to assessing the general impact of readership statistics, we also wanted to see whether this effect differs between ‘political’ and ‘entertainment’ news. Research on participatory journalism has suggested that online editors might be more willing to allow audience participation in areas of soft news such as entertainment, arts, sports, etc. We find a small amount of evidence for this claim, though the difference between the two categories was very slight.

Finally, we wanted to assess whether there is a ‘quality’ / ‘tabloid’ split. Part of the definition of tabloid style journalism lies precisely in its willingness to follow the demands of its audience. However, we found the audience ‘effect’ (surprisingly) to be most obvious in the quality papers. For tabloids, ‘most read’ status actually had a slightly negative effect on article lifetime. We wouldn’t argue that tabloid editors actively reject the wishes of their audience; however we can say that these editors are no more likely to follow their audience than the typical ‘quality’ editor, and in fact may be less so. We do not have a clear explanation for this difference, though we could speculate that, as tabloid publications are already more tuned in to the wishes of their audience, the appearance of readership statistics makes less practical difference to the overall product. However it may also simply be the case that the online environment is slowly producing new journalistic practices for which the tabloid / quality distinction will be of less usefulness.

So on the basis of our study, we can say that high-traffic articles do in fact spend longer in the spotlight than ones that attract less readership: audience readership does have a measurable impact on the lifespan of political news. The audience is no longer the unknown quantity it was in offline journalism: it appears to have a clear impact on journalistic practice. The question that remains, however, is whether this constitutes evidence of a new ‘populism’ in journalism; or whether it represents (as editors themselves have argued) the simple striking of a balance between audience demands and news values.

Read the full article: Bright, J., and Nicholls, T. (2014) The Life and Death of Political News: Measuring the Impact of the Audience Agenda Using Online Data. Social Science Computer Review 32 (2) 170-181.

References

Anderson, C. W. (2011) Between creative and quantified audiences: Web metrics and changing patterns of newswork in local US newsrooms. Journalism 12 (5) 550-566.

MacGregor, P. (2007) Tracking the Online Audience. Journalism Studies 8 (2) 280-298.

Tom Nicholls is a doctoral student at the Oxford Internet Institute. His research interests include the impact of technology on citizen/government relationships, the Internet’s implications for public management and models of electronic public service delivery.

How easy is it to research the Chinese web?

Han-Teng Liao — Tue, 18 Feb 2014 11:05:57 +0000

Access to data from the Chinese Web, like other Web data, depends on platform policies, the level of data openness, and the availability of data intermediary and tools. Image of a Chinese Internet cafe by Hal Dick.

Ed: How easy is it to request or scrape data from the “Chinese Web”? And how much of it is under some form of government control?

Han-Teng: Access to data from the Chinese Web, like other Web data, depends on the policies of platforms, the level of data openness, and the availability of data intermediary and tools. All these factors have direct impacts on the quality and usability of data. Since there are many forms of government control and intentions, increasingly not just the websites inside mainland China under Chinese jurisdiction, but also the Chinese “soft power” institutions and individuals telling the “Chinese story” or “Chinese dream” (as opposed to “American dreams”), it requires case-by-case research to determine the extent and level of government control and interventions. Based on my own research on Chinese user-generated encyclopaedias and Chinese-language twitter and Weibo, the research expectations seem to be that control and intervention by Beijing will be most likely on political and cultural topics, not likely on economic or entertainment ones.

This observation is linked to how various forms of government control and interventions are executed, which often requires massive data and human operations to filter, categorise and produce content that are often based on keywords. It is particularly true for Chinese websites in mainland China (behind the Great Firewall, excluding Hong Kong and Macao), where private website companies execute these day-to-day operations under the directives and memos of various Chinese party and government agencies.

Of course there is some extra layer of challenges if researchers try to request content and traffic data from the major Chinese websites for research, especially regarding censorship. Nonetheless, since most Web content data is open, researchers such as Professor Fu in Hong Kong University manage to scrape data sample from Weibo, helping researchers like me to access the data more easily. These openly collected data can then be used to measure potential government control, as has been done for previous research on search engines (Jiang and Akhtar 2011; Zhu et al. 2011) and social media (Bamman et al. 2012; Fu et al. 2013; Fu and Chau 2013; King et al. 2012; Zhu et al. 2012).

It follows that the availability of data intermediary and tools will become important for both academic and corporate research. Many new “public opinion monitoring” companies compete to provide better tools and datasets as data intermediaries, including the Online Public Opinion Monitoring and Measuring Unit (人民网舆情监测室) of the People’s Net (a Party press organ) with annual revenue near 200 million RMB. Hence, in addition to the on-going considerations on big data and Web data research, we need to factor in how these private and public Web data intermediaries shape the Chinese Web data environment (Liao et al. 2013).

Given the fact that the government’s control of information on the Chinese Web involves not only the marginalization (as opposed to the traditional censorship) of “unwanted” messages and information, but also the prioritisation of propaganda or pro-government messages (including those made by paid commentators and “robots”), I would add that the new challenges for researchers include the detection of paid (and sometimes robot-generated) comments. Although these challenges are not exactly the same as data access, researchers need to consider them for data collection.

Ed: How much of the content and traffic is identifiable or geolocatable by region (eg mainland vs Hong Kong, Taiwan, abroad)?

Han-Teng: Identifying geographic information from Chinese Web data, like other Web data, can be largely done by geo-IP (a straightforward IP to geographic location mapping service), domain names (.cn for China; .hk for Hong Kong; .tw for Taiwan), and language preferences (simplified Chinese used by mainland Chinese users; traditional Chinese used by Hong Kong and Taiwan). Again, like the question of data access, the availability and quality of such geographic and linguistic information depends on the policies, openness, and the availability of data intermediary and tools.

Nonetheless, there exist research efforts on using geographic and/or linguistic information of Chinese Web data to assess the level and extent of convergence and separation of Chinese information and users around the world (Etling et al. 2009; Liao 2008; Taneja and Wu 2013). Etling and colleagues (2009) concluded their mapping of Chinese blogsphere research with the interpretation of five “attentive spaces” roughly corresponding to five clusters or zones in the network map: on one side, two clusters of “Pro-state” and “Business” bloggers, and on the other, two clusters of “Overseas” bloggers (including Hong Kong and Taiwan) and “Culture”. Situated between the three clusters of “Pro-state”, “Overseas” and “Culture” (and thus at the centre of the network map) is the remaining cluster they call the “critical discourse” cluster, which is at the intersection of the two sides (albeit more on the “blocked” side of the Great Firewall).

I myself found distinct geographic focus and linguistic preferences between the online citations in Baidu Baike and Chinese Wikipedia (Liao 2008). Other research based on a sample of traffic data shows the existence of a “Chinese” cluster as an instance of a “culturally defined market”, regardless of their geographic and linguistic differences (Taneja and Wu 2013). Although I found their argument that the Great Firewall has very limited impacts on such a single “Chinese” cluster, they demonstrate the possibility of extracting geographic and linguistic information on Chinese Web data for better understanding the dynamics of Chinese online interactions; which are by no means limited within China or behind the Great Firewall.

Ed: In terms of online monitoring of public opinion, is it possible to identify robots / “50 cent party” — that is, what proportion of the “opinion” actually has a government source?

Han-Teng: There exist research efforts in identifying robot comments by analysing the patterns and content of comments, and their profile relationship with other accounts. It is more difficult to prove the direct footprint of government sources. Nonetheless, if researchers take another approach such as narrative analysis for well-defined propaganda research (such as the pro- and anti-Falun opinions), it might be easier to categorise and visualise the dynamics and then trace back to the origins of dominant keywords and narratives to identify the sources of loud messages. I personally think such research and analytical efforts require deep knowledge on both technical and cultural-political understanding of Chinese Web data, preferably with an integrated mixed method research design that incorporates both the quantitative and qualitative methods required for the data question at hand.

Ed: In terms of censorship, ISPs operate within explicit governmental guidelines; do the public (who contribute content) also have explicit rules about what topics and content are ‘acceptable’, or do they have to work it out by seeing what gets deleted?

Han-Teng: As a general rule, online censorship works better when individual contributors are isolated. Most of the time, contributors experience technical difficulties when using Beijing’s unwanted keywords or undesired websites, triggering self-censorship behaviours to avoid such difficulties. I personally believe such tacit learning serves as the most relevant psychological and behaviour mechanism (rather than explicit rules). In a sense, the power of censorship and political discipline is the fact that the real rules of engagement are never explicit to users, thereby giving more power to technocrats to exercise power in a more arbitrary fashion. I would describe the general situation as follows. Directives are given to both ISPs and ICPs about certain “hot terms”, some dynamic and some constant. Users “learn” them through encountering various forms of “technical difficulties”. Thus, while ISPs and ICPs may not enforce the same directives in the same fashion (some overshoot while others undershoot), the general tacit knowledge about the “red line” is thus delivered.

Nevertheless, there are some efforts where users do share their experiences with one another, so that they have a social understanding of what information and which category of users is being disciplined. There are also constant efforts outside mainland China, especially institutions in Hong Kong and Berkeley to monitor what is being deleted. However, given the fact that data is abundant for Chinese users, I have become more worried about the phenomenon of “marginalization of information and/or narratives”. It should be noted that censorship or deletion is just one of the tools of propaganda technocrats and that the Chinese Communist Party has had its share of historical lessons (and also victories) against its past opponents, such as the Chinese Nationalist Party and the United States during the Chinese Civil War and the Cold War. I strongly believe that as researchers we need better concepts and tools to assess the dynamics of information marginalization and prioritisation, treating censorship and data deletion as one mechanism of information marginalization in the age of data abundance and limited attention.

Ed: Has anyone tried to produce a map of censorship: ie mapping absence of discussion? For a researcher wanting to do this, how would they get hold of the deleted content?

Han-Teng: Mapping censorship has been done through experiment (MacKinnon 2008; Zhu et al. 2011) and by contrasting datasets (Fu et al. 2013; Liao 2013; Zhu et al. 2012). Here the availability of data intermediaries such as the WeiboScope in Hong Kong University, and unblocked alternative such as Chinese Wikipedia, serve as direct and indirect points of comparison to see what is being or most likely to be deleted. As I am more interested in mapping information marginalization (as opposed to prioritisation), I would say that we need more analytical and visualisation tools to map out the different levels and extent of information censorship and marginalization. The research challenges then shift to the questions of how and why certain content has been deleted inside mainland China, and thus kept or leaked outside China. As we begin to realise that the censorship regime can still achieve its desired political effects by voicing down the undesired messages and voicing up the desired ones, researchers do not necessarily have to get hold of the deleted content from the websites inside mainland China. They can simply reuse plenty of Chinese Web data available outside the censorship and filtering regime to undertake experiments or comparative study.

Ed: What other questions are people trying to explore or answer with data from the “Chinese Web”? And what are the difficulties? For instance, are there enough tools available for academics wanting to process Chinese text?

Han-Teng: As Chinese societies (including mainland China, Hong Kong, Taiwan and other overseas diaspora communities) go digital and networked, it’s only a matter of time before Chinese Web data becomes the equivalent of English Web data. However, there are challenges in processing Chinese language texts, although several of the major challenges become manageable as digital and network tools go multilingual. In fact, Chinese-language users and technologies have been the major goal and actors for a multi-lingual Internet (Liao 2009a,b). While there is technical progress in basic tools, we as Chinese Internet researchers still lack data and tool intermediaries that are designed to process Chinese texts smoothly. For instance, many analytical software and tools depend on or require the use of space characters as word boundaries, a condition that does not apply to Chinese texts.

In addition, since there exist some technical and interpretative challenges in analysing Chinese text datasets with mixed scripts (e.g. simplified and traditional Chinese) or with other foreign languages. Mandarin Chinese language is not the only language inside China; there are indications that the Cantonese and Shanghainese languages have a significant presence. Minority languages such as Tibetan, Mongolian, Uyghur, etc. are also still used by official Chinese websites to demonstrate the cultural inclusiveness of the Chinese authorities. Chinese official and semi-official diplomatic organs have also tried to tell “Chinese stories” in various of the world’s major languages, sometimes in direct competition with its political opponents such as Falun Gong.

These areas of the “Chinese Web” data remain unexplored territory for systematic research, which will require more tools and methods that are similar to the toolkits of multi-lingual Internet researchers. Hence I would say the basic data and tool challenges are not particular to the “Chinese Web”, but are rather a general challenge to the “Web” that is becoming increasingly multilingual by the day. We Chinese Internet researchers do need more collaboration when it comes to sharing data and tools, and I am hopeful that we will have more trustworthy and independent data intermediaries, such as Weiboscope and others, for a better future of the Chinese Web data ecology.

References

Bamman, D., O’Connor, B., & Smith, N. (2012). Censorship and deletion practices in Chinese social media. First Monday, 17(3-5).

Etling, B., Kelly, J., & Faris, R. (2009). Mapping Chinese Blogosphere. In 7th Annual Chinese Internet Research Conference (CIRC 2009). Annenberg School for Communication, University of Pennsylvania, Philadelphia, US.

Fu, K., Chan, C., & Chau, M. (2013). Assessing Censorship on Microblogs in China: Discriminatory Keyword Analysis and Impact Evaluation of the “Real Name Registration” Policy. IEEE Internet Computing, 17(3), 42–50.

Fu, K., & Chau, M. (2013). Reality Check for the Chinese Microblog Space: a random sampling approach. PLOS ONE, 8(3), e58356.

Jiang, M., & Akhtar, A. (2011). Peer into the Black Box of Chinese Search Engines: A Comparative Study of Baidu, Google, and Goso. Presented at the The 9th Chinese Internet Research Conference (CIRC 2011), Washington, D.C.: Institute for the Study of Diplomacy. Georgetown University.

King, G., Pan, J., & Roberts, M. (2012). How censorship in China allows government criticism but silences collective expression. In APSA 2012 Annual Meeting Paper.

Liao, H.-T. (2008). A webometric comparison of Chinese Wikipedia and Baidu Baike and its implications for understanding the Chinese-speaking Internet. In 9th annual Internet Research Conference: Rethinking Community, Rethinking Place. Copenhagen.

Liao, H.-T. (2009a). Are Chinese characters not modern enough? An essay on their role online. GLIMPSE: the art + science of seeing, 2(1), 16–24.

Liao, H.-T. (2009b). Conflict and Consensus in the Chinese version of Wikipedia. IEEE Technology and Society Magazine, 28(2), 49–56. doi:10.1109/MTS.2009.932799

Liao, H.-T. (2013, August 5). How do Baidu Baike and Chinese Wikipedia filter contribution? A case study of network gatekeeping. To be presented at the Wikisym 2013: The Joint International Symposium on Open Collaboration, Hong Kong.

Liao, H.-T., Fu, K., Jiang, M., & Wang, N. (2013, June 15). Chinese Web Data: Definition, Uses, and Scholarship. (Accepted). To be presented at the 11th Annual Chinese Internet Research Conference (CIRC 2013), Oxford, UK.

MacKinnon, R. (2008). Flatter world and thicker walls? Blogs, censorship and civic discourse in China. Public Choice, 134(1), 31–46. doi:10.1007/s11127-007-9199-0

Taneja, H., & Wu, A. X. (2013). How Does the Great Firewall of China Affect Online User Behavior? Isolated “Internets” as Culturally Defined Markets on the WWW. Presented at the 11th Annual Chinese Internet Research Conference (CIRC 2013), Oxford, UK.

Zhu, T., Bronk, C., & Wallach, D. S. (2011). An Analysis of Chinese Search Engine Filtering. arXiv:1107.3794.

Zhu, T., Phipps, D., Pridgen, A., Crandall, J. R., & Wallach, D. S. (2012). Tracking and Quantifying Censorship on a Chinese Microblogging Site. arXiv:1211.6166.

Han-Teng was talking to blog editor David Sutcliffe.

Han-Teng Liao is an OII DPhil student whose research aims to reconsider the role of keywords (as in understanding “keyword advertising” using knowledge from sociolinguistics and information science) and hyperlinks (webometrics) in shaping the sense of “fellow users” in digital networked environments. Specifically, his DPhil project is a comparative study of two major user-contributed Chinese encyclopedias, Chinese Wikipedia and Baidu Baike.

Mapping collective public opinion in the Russian blogosphere

Olessia Koltsova — Mon, 10 Feb 2014 11:30:05 +0000

Widely reported as fraudulent, the 2011 Russian Parliamentary elections provoked mass street protest action by tens of thousands of people in Moscow and cities and towns across Russia. Image by Nikolai Vassiliev.

Blogs are becoming increasingly important for agenda setting and formation of collective public opinion on a wide range of issues. In countries like Russia where the Internet is not technically filtered, but where the traditional media is tightly controlled by the state, they may be particularly important. The Russian language blogosphere counts about 85 million blogs – an amount far beyond the capacities of any government to control – and the Russian search engine Yandex, with its blog rating service, serves as an important reference point for Russia’s educated public in its search of authoritative and independent sources of information. The blogosphere is thereby able to function as a mass medium of “public opinion” and also to exercise influence.

One topic that was particularly salient over the period we studied concerned the Russian Parliamentary elections of December 2011. Widely reported as fraudulent, they provoked immediate and mass street protest action by tens of thousands of people in Moscow and cities and towns across Russia, as well as corresponding activity in the blogosphere. Protesters made effective use of the Internet to organize a movement that demanded cancellation of the parliamentary election results, and the holding of new and fair elections. These protests continued until the following summer, gaining widespread national and international attention.

Most of the political and social discussion blogged in Russia is hosted on the blog platform LiveJournal. Some of these bloggers can claim a certain amount of influence; the top thirty bloggers have over 20,000 “friends” each, representing a good circulation for the average Russian newspaper. Part of the blogosphere may thereby resemble the traditional media; the deeper into the long tail of average bloggers, however, the more it functions as more as pure public opinion. This “top list” effect may be particularly important in societies (like Russia’s) where popularity lists exert a visible influence on bloggers’ competitive behavior and on public perceptions of their significance. Given the influence of these top bloggers, it may be claimed that, like the traditional media, they act as filters of issues to be thought about, and as definers of their relative importance and salience.

Gauging public opinion is of obvious interest to governments and politicians, and opinion polls are widely used to do this, but they have been consistently criticized for the imposition of agendas on respondents by pollsters, producing artefacts. Indeed, the public opinion literature has tended to regard opinion as something to be “extracted” by pollsters, which inevitably pre-structures the output. This literature doesn’t consider that public opinion might also exist in the form of natural language texts, such as blog posts, that have not been pre-structured by external observers.

There are two basic ways to detect topics in natural language texts: the first is manual coding of texts (ie by traditional content analysis), and the other involves rapidly developing techniques of automatic topic modeling or text clustering. The media studies literature has relied heavily on traditional content analysis; however, these studies are inevitably limited by the volume of data a person can physically process, given there may be hundreds of issues and opinions to track — LiveJournal’s 2.8 million blog accounts, for example, generate 90,000 posts daily.

For large text collections, therefore, only the second approach is feasible. In our article we explored how methods for topic modeling developed in computer science may be applied to social science questions – such as how to efficiently track public opinion on particular (and evolving) issues across entire populations. Specifically, we demonstrate how automated topic modeling can identify public agendas, their composition, structure, the relative salience of different topics, and their evolution over time without prior knowledge of the issues being discussed and written about. This automated “discovery” of issues in texts involves division of texts into topically — or more precisely, lexically — similar groups that can later be interpreted and labeled by researchers. Although this approach has limitations in tackling subtle meanings and links, experiments where automated results have been checked against human coding show over 90 percent accuracy.

The computer science literature is flooded with methodological papers on automatic analysis of big textual data. While these methods can’t entirely replace manual work with texts, they can help reduce it to the most meaningful and representative areas of the textual space they help to map, and are the only means to monitor agendas and attitudes across multiple sources, over long periods and at scale. They can also help solve problems of insufficient and biased sampling, when entire populations become available for analysis. Due to their recentness, as well as their mathematical and computational complexity, these approaches are rarely applied by social scientists, and to our knowledge, topic modeling has not previously been applied for the extraction of agendas from blogs in any social science research.

The natural extension of automated topic or issue extraction involves sentiment mining and analysis; as Gonzalez-Bailon, Kaltenbrunner, and Banches (2012) have pointed out, public opinion doesn’t just involve specific issues, but also encompasses the state of public emotion about these issues, including attitudes and preferences. This involves extracting opinions on the issues/agendas that are thought to be present in the texts, usually by dividing sentences into positive and negative. These techniques are based on human-coded dictionaries of emotive words, on algorithmic construction of sentiment dictionaries, or on machine learning techniques.

Both topic modeling and sentiment analysis techniques are required to effectively monitor self-generated public opinion. When methods for tracking attitudes complement methods to build topic structures, a rich and powerful map of self-generated public opinion can be drawn. Of course this mapping can’t completely replace opinion polls; rather, it’s a new way of learning what people are thinking and talking about; a method that makes the vast amounts of user-generated content about society – such as the 65 million blogs that make up the Russian blogosphere — available for social and policy analysis.

Naturally, this approach to public opinion and attitudes is not free of limitations. First, the dataset is only representative of the self-selected population of those who have authored the texts, not of the whole population. Second, like regular polled public opinion, online public opinion only covers those attitudes that bloggers are willing to share in public. Furthermore, there is still a long way to go before the relevant instruments become mature, and this will demand the efforts of the whole research community: computer scientists and social scientists alike.

Read the full paper: Olessia Koltsova and Sergei Koltcov (2013) Mapping the public agenda with topic modeling: The case of the Russian livejournal. Policy and Internet 5 (2) 207–227.

Also read on this blog: Can text mining help handle the data deluge in public policy analysis? by Aude Bicquelet.

References

González-Bailón, S., A. Kaltenbrunner, and R.E. Banches. 2012. “Emotions, Public Opinion and U.S. Presidential Approval Rates: A 5 Year Analysis of Online Political Discussions,” Human Communication Research 38 (2): 121–43.

Technological innovation and disruption was a big theme of the WEF 2014 in Davos: but where was government?

Helen Margetts — Thu, 30 Jan 2014 11:23:09 +0000

The World Economic Forum engages business, political, academic and other leaders of society to shape global, regional and industry agendas. Image by World Economic Forum.

Last week, I was at the World Economic Forum in Davos, the first time that the Oxford Internet Institute has been represented there. Being closeted in a Swiss ski resort with 2,500 of the great, the good and the super-rich provided me with a good chance to see what the global elite are thinking about technological change and its role in ‘The Reshaping of the World: Consequences for Society, Politics and Business’, the stated focus of the WEF Annual Meeting in 2014.

What follows are those impressions that relate to public policy and the internet, and reflect only my own experience there. Outside the official programme there are whole hierarchies of breakfasts, lunches, dinners and other events, most of which a newcomer to Davos finds it difficult to discover and some of which require one to be at least a president of a small to medium-sized state — or Matt Damon.

There was much talk of hyperconnectivity, spirals of innovation, S-curves and exponential growth of technological diffusion, digitalization and disruption. As you might expect, the pace of these was emphasized most by those participants from the technology industry. The future of work in the face of leaps forward in robotics was a key theme, drawing on the new book by Eric Brynjolfsson and Andrew McAfee, The Second Machine Age: Work, Progress and Prosperity in a Time of Brilliant Technologies, which is just out in the US. There were several sessions on digital health and the eventual fruition of decades of pilots in telehealth (a banned term now, apparently), as applications based on mobile technologies start to be used more widely. Indeed, all delegates were presented with a ‘Jawbone’ bracelet which tracks the wearer’s exercise and sleep patterns (7,801 steps so far today). And of course there was much talk about the possibilities afforded by big data, if not quite as much as I expected.

The University of Oxford was represented in an ‘Ideas Lab’, convened by the Oxford Martin School on Data, Machines and the Human Factor. This format involves each presenter talking for five minutes in front of their 15 selected images rolling at 20 seconds each, with no control over the timing (described by the designer of the format before the session as ‘waterboarding for academics’, due to the conciseness and brevity required — and I can vouch for that). It was striking how much synergy there was in the presentations by the health engineer Lionel Tarassenko (talking about developments in digital healthcare in the home), the astrophysicist Chris Lintott (on crowdsourcing of science) and myself talking about collective action and mobilization in the social media age. We were all talking about the new possibilities that the internet and social media afford for citizens to contribute to healthcare, scientific knowledge and political change. Indeed, I was surprised that the topics of collective action and civic engagement, probably not traditional concerns of Davos, attracted widespread interest, including a session on ‘The New Citizen’ with the founders of Avaaz.

Of course there was some discussion of the Snowden revelations of the data crawling activities of the US NSA and UK GCHQ, and the privacy implications. A dinner on ‘the Digital Me’ generated an interesting discussion on privacy in the age of social media, reflecting a growing and welcome (to me anyway) pragmatism with respect to issues often hotly contested. As one participant put it, in an age of imperfect, partial information, we become used to the idea that what we read on Facebook is often, through its relation to the past, irrelevant to the present time and not to be taken into consideration when (for example) considering whether to offer someone a job. The wonderful danah boyd gave some insight from her new book It’s Complicated: the social lives of networked teens, from which emerged a discussion of a ‘taxonomy of privacy’ and the importance of considering the use to which data is put, as opposed to just the possession or collection of the data – although this could be dangerous ground, in the light of the Snowden revelations.

There was more talk of the future than the past. I participated in one dinner discussion of the topic of ‘Rethinking Living’ in 50 years time, a timespan challenged by Google Chairman Eric Schmidt’s argument earlier in the day that five years was an ‘infinite’ amount of time in the current speed of technological innovation. The after dinner discussion was surprisingly fun, and at my table at least we found ourselves drawn back to the past, wondering if the rise of preventative health care and the new localism that connectivity affords might look like a return to the pre-industrial age. When it came to the summing up and drawing out the implications for government, I was struck how most elements of any trajectory of change exposed a growing disconnect between citizens, or business, on the one hand – and government on the other.

This was the one topic that for me was notably absent from WEF 2014; the nature of government in this rapidly changing world, in spite of the three pillars — politics, society, and business — of the theme of the conference noted above. At one lunch convened by McKinsey that was particularly ebullient regarding the ceaseless pace of technological change, I pointed out that government was only at the beginning of the S-curve, or perhaps that such a curve had no relevance for government. Another delegate asked how the assembled audience might help government to manage better here, and another pointed out that globally, we were investing less and less in government at a time when it needed more resources, including far higher remuneration for top officials. But the panellists were less enthusiastic to pick up on these points.

As I have discussed previously on this blog and elsewhere, we are in an era where governments struggle to innovate technologically or to incorporate social media into organizational processes, where digital services lag far behind those of business, where the big data revolution is passing government by (apart from open data, which is not the same thing as big data, see my Guardian blog post on this issue). Pockets of innovation like the UK Government Digital Service push for governmentwide change, but we are still seeing major policy initiatives such as Obama’s healthcare plans in the US or Universal Credit in the UK flounder on technological grounds. Yet there were remarkably few delegates at the WEF representing the executive arm of government, particularly for the UK. So on the relationship between government and citizens in an age of rapid technological change, it was citizens – rather than governments – and, of course, business (given the predominance of CEOs) that received the attention of this high-powered crowd.

At the end of the ‘Rethinking Living’ dinner, a participant from another table said to me that in contrast to the participants from the technology industry, he thought 50 years was a rather short time horizon. As a landscape architect, designing with trees that take 30 years to grow, he had no problem imagining how things would look on this timescale. It occurred to me that there could be an analogy here with government, which likewise could take this kind of timescale to catch up with the technological revolution. But by that time, technology will have moved on and it may be that governments cannot afford this relaxed pace catching up with their citizens and the business world. Perhaps this should be a key theme for future forums.

Helen Margetts is the Director of the OII, and Professor of Society and the Internet. She is a political scientist specialising in digital era governance and politics.

Edit wars! Measuring and mapping society’s most controversial topics

taha yasseri — Tue, 03 Dec 2013 08:21:43 +0000

Ed: How did you construct your quantitative measure of ‘conflict’? Did you go beyond just looking at content flagged by editors as controversial?

Taha: Yes we did … actually, we have shown that controversy measures based on “controversial” flags are not inclusive at all and although they might have high precision, they have very low recall. Instead, we constructed an automated algorithm to locate and quantify the editorial wars taking place on the Wikipedia platform. Our algorithm is based on reversions, i.e. when editors undo each other’s contributions. We focused specifically on mutual reverts between pairs of editors and we assigned a maturity score to each editor, based on the total volume of their previous contributions. While counting the mutual reverts, we used more weight for those ones committed by/on editors with higher maturity scores; as a revert between two experienced editors indicates a more serious problem. We always validated our method and compared it with other methods, using human judgement on a random selection of articles.

Ed: Was there any discrepancy between the content deemed controversial by your own quantitative measure, and what the editors themselves had flagged?

Taha: We were able to capture all the flagged content, but not all the articles found to be controversial by our method are flagged. And when you check the editorial history of those articles, you soon realise that they are indeed controversial but for some reason have not been flagged. It’s worth mentioning that the flagging process is not very well implemented in smaller language editions of Wikipedia. Even if the controversy is detected and flagged in English Wikipedia, it might not be in the smaller language editions. Our model is of course independent of the size and editorial conventions of different language editions.

Ed: Were there any differences in the way conflicts arose / were resolved in the different language versions?

Taha: We found the main differences to be the topics of controversial articles. Although some topics are globally debated, like religion and politics, there are many topics which are controversial only in a single language edition. This reflects the local preferences and importances assigned to topics by different editorial communities. And then the way editorial wars initiate and more importantly fade to consensus is also different in different language editions. In some languages moderators interfere very soon, while in others the war might go on for a long time without any moderation.

Ed: In general, what were the most controversial topics in each language? And overall?

Taha: Generally, religion, politics, and geographical places like countries and cities (sometimes even villages) are the topics of debates. But each language edition has also its own focus, for example football in Spanish and Portuguese, animations and TV series in Chinese and Japanese, sex and gender-related topics in Czech, and Science and Technology related topics in French Wikipedia are very often behind editing wars.

Ed: What other quantitative studies of this sort of conflict -ie over knowledge and points of view- are there?

Taha: My favourite work is one by researchers from Barcelona Media Lab. In their paper Jointly They Edit: Examining the Impact of Community Identification on Political Interaction in Wikipedia they provide quantitative evidence that editors interested in political topics identify themselves more significantly as Wikipedians than as political activists, even though they try hard to reflect their opinions and political orientations in the articles they contribute to. And I think that’s the key issue here. While there are lots of debates and editorial wars between editors, at the end what really counts for most of them is Wikipedia as a whole project, and the concept of shared knowledge. It might explain how Wikipedia really works despite all the diversity among its editors.

Ed: How would you like to extend this work?

Taha: Of course some of the controversial topics change over time. While Jesus might stay a controversial figure for a long time, I’m sure the article on President (W) Bush will soon reach a consensus and most likely disappear from the list of the most controversial articles. In the current study we examined the aggregated data from the inception of each Wikipedia-edition up to March 2010. One possible extension that we are working on now is to study the dynamics of these controversy-lists and the positions of topics in them.

Read the full paper: Yasseri, T., Spoerri, A., Graham, M. and Kertész, J. (2014) The most controversial topics in Wikipedia: A multilingual and geographical analysis. In: P.Fichman and N.Hara (eds) Global Wikipedia: International and cross-cultural issues in online collaboration. Scarecrow Press.

Taha was talking to blog editor David Sutcliffe.

Taha Yasseri is the Big Data Research Officer at the OII. Prior to coming to the OII, he spent two years as a Postdoctoral Researcher at the Budapest University of Technology and Economics, working on the socio-physical aspects of the community of Wikipedia editors, focusing on conflict and editorial wars, along with Big Data analysis to understand human dynamics, language complexity, and popularity spread. He has interests in analysis of Big Data to understand human dynamics, government-society interactions, mass collaboration, and opinion dynamics.

The physics of social science: using big data for real-time predictive modelling

taha yasseri — Thu, 21 Nov 2013 09:49:27 +0000

Ed: You are interested in analysis of big data to understand human dynamics; how much work is being done in terms of real-time predictive modelling using these data?

Taha: The socially generated transactional data that we call “big data” have been available only very recently; the amount of data we now produce about human activities in a year is comparable to the amount that used to be produced in decades (or centuries). And this is all due to recent advancements in ICTs. Despite the short period of availability of big data, the use of them in different sectors including academia and business has been significant. However, in many cases, the use of big data is limited to monitoring and post hoc analysis of different patterns. Predictive models have been rarely used in combination with big data. Nevertheless, there are very interesting examples of using big data to make predictions about disease outbreaks, financial moves in the markets, social interactions based on human mobility patterns, election results, etc.

Ed: What were the advantages of using Wikipedia as a data source for your study — as opposed to Twitter, blogs, Facebook or traditional media, etc.?

Taha: Our results have shown that the predictive power of Wikipedia page view and edit data outperforms similar box office-prediction models based on Twitter data. This can partially be explained by considering the different nature of Wikipedia compared to social media sites. Wikipedia is now the number one source of online information, and Wikipedia article page view statistics show how much Internet users have been interested in knowing about a specific movie. And the edit counts — even more importantly — indicate the level of interest of the editors in sharing their knowledge about the movies with others. Both indicators are much stronger than what you could measure on Twitter, which is mainly the reaction of the users after watching or reading about the movie. The cost of participation in Wikipedia’s editorial process makes the activity data more revealing about the potential popularity of the movies.

Another advantage is the sheer availability of Wikipedia data. Twitter streams, by comparison, are limited in both size and time. Gathering Facebook data is also problematic, whereas all the Wikipedia editorial activities and page views are recorded in full detail — and made publicly available.

Ed: Could you briefly describe your method and model?

Taha: We retrieved two sets of data from Wikipedia, the editorial activity and the page views relating to our set of 312 movies. The former indicates the popularity of the movie among the Wikipedia editors and the latter among Wikipedia readers. We then defined different measures based on these two data streams (eg number of edits, number of unique editors, etc.) In the next step we combined these data into a linear model that assumes the more popular the movie is, the larger the size of these parameters. However this model needs both training and calibration. We calibrated the model based on the IMBD data on the financial success of a set of ‘training’ movies. After calibration, we applied the model to a set of “test” movies and (luckily) saw that the model worked very well in predicting the financial success of the test movies.

Ed: What were the most significant variables in terms of predictive power; and did you use any content or sentiment analysis?

Taha: The nice thing about this method is that you don’t need to perform any content or sentiment analysis. We deal only with volumes of activities and their evolution over time. The parameter that correlated best with financial success (and which was therefore the best predictor) was the number of page views. I can easily imagine that these days if someone wants to go to watch a movie, they most likely turn to the Internet and make a quick search. Thanks to Google, Wikipedia is going to be among the top results and it’s very likely that the click will go to the Wikipedia article about the movie. I think that’s why the page views correlate to the box office takings so significantly.

Ed: Presumably people are picking up on signals, ie Wikipedia is acting like an aggregator and normaliser of disparate environmental signals — what do you think these signals might be, in terms of box office success? ie is it ultimately driven by the studio media machine?

Taha: This is a very difficult question to answer. There are numerous factors that make a movie (or a product in general) popular. Studio marketing strategies definitely play an important role, but the quality of the movie, the collective mood of the public, herding effects, and many other hidden variables are involved as well. I hope our research serves as a first step in studying popularity in a quantitative framework, letting us answer such questions. To fully understand a system the first thing you need is a tool to monitor and observe it very well quantitatively. In this research we have shown that (for example) Wikipedia is a nice window and useful tool to observe and measure popularity and its dynamics; hopefully leading to a deep understanding of the underlying mechanisms as well.

Ed: Is there similar work / approaches to what you have done in this study?

Taha: There have been other projects using socially generated data to make predictions on the popularity of movies or movement in financial markets, however to the best of my knowledge, it’s been the first time that Wikipedia data have been used to feed the models. We were positively surprised when we observed that these data have stronger predictive power than previously examined datasets.

Ed: If you have essentially shown that ‘interest on Wikipedia’ tracks ‘real-world interest’ (ie box office receipts), can this be applied to other things? eg attention to legislation, political scandal, environmental issues, humanitarian issues: ie Wikipedia as “public opinion monitor”?

Taha: I think so. Now I’m running two other projects using a similar approach; one to predict election outcomes and the other one to do opinion mining about the new policies implemented by governing bodies. In the case of elections, we have observed very strong correlations between changes in the information seeking rates of the general public and the number of ballots cast. And in the case of new policies, I think Wikipedia could be of great help in understanding the level of public interest in searching for accurate information about the policies, and how this interest is satisfied by the information provided online. And more interestingly, how this changes overtime as the new policy is fully implemented.

Ed: Do you think there are / will be practical applications of using social media platforms for prediction, or is the data too variable?

Taha: Although the availability and popularity of social media are recent phenomena, I’m sure that social media data are already being used by different bodies for predictions in various areas. We have seen very nice examples of using these data to predict disease outbreaks or the arrival of earthquake waves. The future of this field is very promising, considering both the advancements in the methodologies and also the increase in popularity and use of social media worldwide.

Ed: How practical would it be to generate real-time processing of this data — rather than analysing databases post hoc?

Taha: Data collection and analysis could be done instantly. However the challenge would be the calibration. Human societies and social systems — similarly to most complex systems — are non-stationary. That means any statistical property of the system is subject to abrupt and dramatic changes. That makes it a bit challenging to use a stationary model to describe a continuously changing system. However, one could use a class of adaptive models or Bayesian models which could modify themselves as the system evolves and more data are available. All these could be done in real time, and that’s the exciting part of the method.

Ed: As a physicist; what are you learning in a social science department? And what does physicist bring to social science and the study of human systems?

Taha: Looking at complicated phenomena in a simple way is the art of physics. As Einstein said, a physicist always tries to “make things as simple as possible, but not simpler”. And that works very well in describing natural phenomena, ranging from sub-atomic interactions all the way to cosmology. However, studying social systems with the tools of natural sciences can be very challenging, and sometimes too much simplification makes it very difficult to understand the real underlying mechanisms. Working with social scientists, I’m learning a lot about the importance of the individual attributes (and variations between) the elements of the systems under study, outliers, self-awarenesses, ethical issues related to data, agency and self-adaptation, and many other details that are mostly overlooked when a physicist studies a social system.

At the same time, I try to contribute the methodological approaches and quantitative skills that physicists have gained during two centuries of studying complex systems. I think statistical physics is an amazing example where statistical techniques can be used to describe the macro-scale collective behaviour of billions and billions of atoms with a single formula. I should admit here that humans are way more complicated than atoms — but the dialogue between natural scientists and social scientists could eventually lead to multi-scale models which could help us to gain a quantitative understanding of social systems, thereby facilitating accurate predictions of social phenomena.

Ed: What database would you like access to, if you could access anything?

Taha: I have day dreams about the database of search queries from all the Internet users worldwide at the individual level. These data are being collected continuously by search engines and technically could be accessed, but due to privacy policy issues it’s impossible to get a hold on; even if only for research purposes. This is another difference between social systems and natural systems. An atom never gets upset being watched through a microscope all the time, but working on social systems and human-related data requires a lot of care with respect to privacy and ethics.

Read the full paper: Mestyán, M., Yasseri, T., and Kertész, J. (2013) Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data. PLoS ONE 8 (8) e71226.

Taha Yasseri was talking to blog editor David Sutcliffe.

Five recommendations for maximising the relevance of social science research for public policy-making in the big data era

Helen Margetts — Mon, 04 Nov 2013 10:30:30 +0000

As I discussed in a previous post on the promises and threats of big data for public policy-making, public policy making has entered a period of dramatic change. Widespread use of digital technologies, the Internet and social media means citizens and governments leave digital traces that can be harvested to generate big data. This increasingly rich data environment poses both promises and threats to policy-makers.

So how can social scientists help policy-makers in this changed environment, ensuring that social science research remains relevant? Social scientists have a good record on having policy influence, indeed in the UK better than other academic fields, including medicine, as recent research from the LSE Public Policy group has shown. Big data hold major promise for social science, which should enable us to further extend our record in policy research. We have access to a cornucopia of data of a kind which is more like that traditionally associated with so-called ‘hard’ science. Rather than being dependent on surveys, the traditional data staple of empirical social science, social media such as Wikipedia, Twitter, Facebook, and Google Search present us with the opportunity to scrape, generate, analyse and archive comparative data of unprecedented quantity. For example, at the OII over the last four years we have been generating a dataset of all petition signing in the UK and US, which contains the joining rate (updated every hour) for the 30,000 petitions created in the last three years. As a political scientist, I am very excited by this kind of data (up to now, we have had big data like this only for voting, and that only at election time), which will allow us to create a complete ecology of petition signing, one of the more popular acts of political participation in the UK. Likewise, we can look at the entire transaction history of online organizations like Wikipedia, or map the link structure of government’s online presence.

But big data holds threats for social scientists too. The technological challenge is ever present. To generate their own big data, researchers and students must learn to code, and for some that is an alien skill. At the OII we run a course on Digital Social Research that all our postgraduate students can take; but not all social science departments could either provide such a course, or persuade their postgraduate students that they needed it. Ours, who study the social science of the Internet, are obviously predisposed to do so. And big data analysis requires multi-disciplinary expertise. Our research team working on petitions data includes a computer scientist (Scott Hale), a physicist (Taha Yasseri) and a political scientist (myself). I can’t imagine doing this sort of research without such technical expertise, and as a multi-disciplinary department we are (reasonably) free to recruit these type of research faculty. But not all social science departments can promise a research career for computer scientists, or physicists, or any of the other disciplinary specialists that might be needed to tackle big data problems.

Five Recommendations for Social Scientists

So, how can social scientists overcome these challenges, and thereby be in a good position to aid policy-makers tackle their own barriers to making the most of the possibilities afforded by big data? Here are five recommendations:

• Accept that multi-disciplinary research teams are going to become the norm for social science research, extending beyond social science disciplines into the life sciences, mathematics, physics, and engineering. At Policy and Internet’s 2012 Big Data conference, the keynote speaker Duncan Watts (physicist turned sociologist) called for a ‘dating agency’ for engineers and social scientists – with the former providing the technological expertise, and the latter identifying the important research questions. We need to make sure that forums exist where social scientists and technologists meet and discuss big data research at the earliest stages, so that research projects and programmes incorporate the core competencies of both.

• We need to provide the normative and ethical basis for policy decisions in the big data era. That means bringing in normative political theorists and philosophers of information into our research teams. The government has committed £65 million to big data research funding, but it seems likely that any successful research proposals will have a strong ethics component embedded in the research programme, rather than an ethics add on or afterthought.

• Training in data science. Many leading US universities are now admitting undergraduates to data science courses, but lack social science input. Of the 20 US masters courses in big data analytics compiled by Information Week, nearly all came from computer science or informatics departments. Social science research training needs to incorporate coding and analysis skills of the kind these courses provide, but with a social science focus. If we as social scientists leave the training to computer scientists, we will find that the new cadre of data scientists tend to leave out social science concerns or questions.

• Bringing policy makers and academic researchers together to tackle the challenges that big data present. Last month the OII and Policy and Internet convened a workshop in Harvard on Responsible Research Agendas for Public Policy in the Big Data Era, which included various leading academic researchers in the government and big data field, and government officials from the Census Bureau, the Federal Reserve Board, the Bureau of Labor Statistics, and the Office of Management and Budget (OMB). The discussions revealed that there is continual procession of major events on big data in Washington DC (usually with a corporate or scientific research focus) to which US federal officials are invited, but also how few were really dedicated to tackling the distinctive issues that face government agencies such as those represented around the table.

• Taking forward theoretical development in social science, incorporating big data insights. I recently spoke at the Oxford Analytica Global Horizons conference, at a session on Big Data. One of the few policy-makers (in proportion to corporate representatives) in the audience asked the panel “where is the theory”? As social scientists, we need to respond to that question, and fast.

This post is based on discussions at the workshop on Responsible Research Agendas for Public Policy in the era of Big Data workshop and the Political Studies Association Why Universities Matter: How Academic Social Science Contributes to Public Policy Impact, held at the LSE on 26 September 2013.

Helen Margetts is the Director of the OII, and Professor of Society and the Internet. She is a political scientist specialising in e-government and digital era governance and politics, investigating the nature and implications of relationships between governments, citizens and the Internet and related digital technologies in the UK and internationally.

The promises and threats of big data for public policy-making

Helen Margetts — Mon, 28 Oct 2013 15:07:29 +0000

The environment in which public policy is made has entered a period of dramatic change. Widespread use of digital technologies, the Internet and social media means both citizens and governments leave digital traces that can be harvested to generate big data. Policy-making takes place in an increasingly rich data environment, which poses both promises and threats to policy-makers.

On the promise side, such data offers a chance for policy-making and implementation to be more citizen-focused, taking account of citizens’ needs, preferences and actual experience of public services, as recorded on social media platforms. As citizens express policy opinions on social networking sites such as Twitter and Facebook; rate or rank services or agencies on government applications such as NHS Choices; or enter discussions on the burgeoning range of social enterprise and NGO sites, such as Mumsnet, 38 degrees and patientopinion.org, they generate a whole range of data that government agencies might harvest to good use. Policy-makers also have access to a huge range of data on citizens’ actual behaviour, as recorded digitally whenever citizens interact with government administration or undertake some act of civic engagement, such as signing a petition.

Data mined from social media or administrative operations in this way also provide a range of new data which can enable government agencies to monitor – and improve – their own performance, for example through log usage data of their own electronic presence or transactions recorded on internal information systems, which are increasingly interlinked. And they can use data from social media for self-improvement, by understanding what people are saying about government, and which policies, services or providers are attracting negative opinions and complaints, enabling identification of a failing school, hospital or contractor, for example. They can solicit such data via their own sites, or those of social enterprises. And they can find out what people are concerned about or looking for, from the Google Search API or Google trends, which record the search patterns of a huge proportion of internet users.

As for threats, big data is technologically challenging for government, particularly those governments which have always struggled with large-scale information systems and technology projects. The UK government has long been a world leader in this regard and recent events have only consolidated its reputation. Governments have long suffered from information technology skill shortages and the complex skill sets required for big data analytics pose a particularly acute challenge. Even in the corporate sector, over a third of respondents to a recent survey of business technology professionals cited ‘Big data expertise is scarce and expensive’ as their primary concern about using big data software.

And there are particular cultural barriers to government in using social media, with the informal style and blurring of organizational and public-private boundaries which they engender. And gathering data from social media presents legal challenges, as companies like Facebook place barriers to the crawling and scraping of their sites.

More importantly, big data presents new moral and ethical dilemmas to policy makers. For example, it is possible to carry out probabilistic policy-making, where policy is made on the basis of what a small segment of individuals will probably do, rather than what they have done. Predictive policing has had some success particularly in California, where robberies declined by a quarter after use of the ‘PredPol’ policing software, but can lead to a “feedback loop of injustice” as one privacy advocacy group put it, as policing resources are targeted at increasingly small socio-economic groups. What responsibility does the state have to devote disproportionately more – or less – resources to the education of those school pupils who are, probabilistically, almost certain to drop out of secondary education? Such challenges are greater for governments than corporations. We (reasonably) happily trade privacy to allow Tesco and Facebook to use our data on the basis it will improve their products, but if government tries to use social media to understand citizens and improve its own performance, will it be accused of spying on its citizenry in order to quash potential resistance.

And of course there is an image problem for government in this field – discussion of big data and government puts the word ‘big’ dangerously close to the word ‘government’ and that is an unpopular combination. Policy-makers’ responses to Snowden’s revelations of the US Tempora and UK Prism programmes have done nothing to improve this image, with their focus on the use of big data to track down individuals and groups involved in acts of terrorism and criminality – rather than on anything to make policy-making better, or to use the wealth of information that these programmes collect for the public good.

However, policy-makers have no choice but to tackle some of these challenges. Big data has been the hottest trend in the corporate world for some years now, and commentators from IBM to the New Yorker are starting to talk about the big data ‘backlash’. Government has been far slower to recognize the advantages for policy-making and services. But in some policy sectors, big data poses very fundamental questions which call for an answer; how should governments conduct a census, for or produce labour statistics, for example, in the age of big data? Policy-makers will need to move fast to beat the backlash.

This post is based on discussions at the workshop on Responsible Research Agendas for Public Policy in the era of Big Data workshop.

Helen Margetts is the Director of the OII, and Professor of Society and the Internet. She is a political scientist specialising in digital era governance and politics.

Can text mining help handle the data deluge in public policy analysis?

Aude Bicquelet — Sun, 27 Oct 2013 12:29:01 +0000

Policy makers today must contend with two inescapable phenomena. On the one hand, there has been a major shift in the policies of governments concerning participatory governance – that is, engaged, collaborative, and community-focused public policy. At the same time, a significant proportion of government activities have now moved online, bringing about “a change to the whole information environment within which government operates” (Margetts 2009, 6).

Indeed, the Internet has become the main medium of interaction between government and citizens, and numerous websites offer opportunities for online democratic participation. The Hansard Society, for instance, regularly runs e-consultations on behalf of UK parliamentary select committees. For examples, e-consultations have been run on the Climate Change Bill (2007), the Human Tissue and Embryo Bill (2007), and on domestic violence and forced marriage (2008). Councils and boroughs also regularly invite citizens to take part in online consultations on issues affecting their area. The London Borough of Hammersmith and Fulham, for example, recently asked its residents for thier views on Sex Entertainment Venues and Sex Establishment Licensing policy.

However, citizen participation poses certain challenges for the design and analysis of public policy. In particular, governments and organizations must demonstrate that all opinions expressed through participatory exercises have been duly considered and carefully weighted before decisions are reached. One method for partly automating the interpretation of large quantities of online content typically produced by public consultations is text mining. Software products currently available range from those primarily used in qualitative research (integrating functions like tagging, indexing, and classification), to those integrating more quantitative and statistical tools, such as word frequency and cluster analysis (more information on text mining tools can be found at the National Centre for Text Mining).

While these methods have certainly attracted criticism and skepticism in terms of the interpretability of the output, they offer four important advantages for the analyst: namely categorization, data reduction, visualization, and speed.

1. Categorization. When analyzing the results of consultation exercises, analysts and policymakers must make sense of the high volume of disparate responses they receive; text mining supports the structuring of large amounts of this qualitative, discursive data into predefined or naturally occurring categories by storage and retrieval of sentence segments, indexing, and cross-referencing. Analysis of sentence segments from respondents with similar demographics (eg age) or opinions can itself be valuable, for example in the construction of descriptive typologies of respondents.

2. Data Reduction. Data reduction techniques include stemming (reduction of a word to its root form), combining of synonyms, and removal of non-informative “tool” or stop words. Hierarchical classifications, cluster analysis, and correspondence analysis methods allow the further reduction of texts to their structural components, highlighting the distinctive points of view associated with particular groups of respondents.

3. Visualization. Important points and interrelationships are easy to miss when read by eye, and rapid generation of visual overviews of responses (eg dendrograms, 3D scatter plots, heat maps, etc.) make large and complex datasets easier to comprehend in terms of identifying the main points of view and dimensions of a public debate.

4. Speed. Speed depends on whether a special dictionary or vocabulary needs to be compiled for the analysis, and on the amount of coding required. Coding is usually relatively fast and straightforward, and the succinct overview of responses provided by these methods can reduce the time for consultation responses.

Despite the above advantages of automated approaches to consultation analysis, text mining methods present several limitations. Automatic classification of responses runs the risk of missing or miscategorising distinctive or marginal points of view if sentence segments are too short, or if they rely on a rare vocabulary. Stemming can also generate problems if important semantic variations are overlooked (eg lumping together ‘ill+ness’, ‘ill+defined’, and ‘ill+ustration’). Other issues applicable to public e-consultation analysis include the danger that analysts distance themselves from the data, especially when converting words to numbers. This is quite apart from the issues of inter-coder reliability and data preparation, missing data, and insensitivity to figurative language, meaning and context, which can also result in misclassification when not human-verified.

However, when responding to criticisms of specific tools, we need to remember that different text mining methods are complementary, not mutually exclusive. A single solution to the analysis of qualitative or quantitative data would be very unlikely; and at the very least, exploratory techniques provide a useful first step that could be followed by a theory-testing model, or by triangulation exercises to confirm results obtained by other methods.

Apart from these technical issues, policy makers and analysts employing text mining methods for e-consultation analysis must also consider certain ethical issues in addition to those of informed consent, privacy, and confidentiality. First (of relevance to academics), respondents may not expect to end up as research subjects. They may simply be expecting to participate in a general consultation exercise, interacting exclusively with public officials and not indirectly with an analyst post hoc; much less ending up as a specific, traceable data point.

This has been a particularly delicate issue for healthcare professionals. Sharf (1999, 247) describes various negative experiences of following up online postings: one woman, on being contacted by a researcher seeking consent to gain insights from breast cancer patients about their personal experiences, accused the researcher of behaving voyeuristically and “taking advantage of people in distress.” Statistical interpretation of responses also presents its own issues, particularly if analyses are to be returned or made accessible to respondents.

Respondents might also be confused about or disagree with text mining as a method applied to their answers; indeed, it could be perceived as dehumanizing – reducing personal opinions and arguments to statistical data points. In a public consultation, respondents might feel somewhat betrayed that their views and opinions eventually result in just a dot on a correspondence analysis with no immediate, apparent meaning or import, at least in lay terms. Obviously the consultation organizer needs to outline clearly and precisely how qualitative responses can be collated into a quantifiable account of a sample population’s views.

This is an important point; in order to reduce both technical and ethical risks, researchers should ensure that their methodology combines both qualitative and quantitative analyses. While many text mining techniques provide useful statistical output, the UK Government’s prescribed Code of Practice on public consultation is quite explicit on the topic: “The focus should be on the evidence given by consultees to back up their arguments. Analyzing consultation responses is primarily a qualitative rather than a quantitative exercise” (2008, 12). This suggests that the perennial debate between quantitative and qualitative methodologists needs to be updated and better resolved.

References

Margetts, H. 2009. “The Internet and Public Policy.” Policy & Internet 1 (1).

Sharf, B. 1999. “Beyond Netiquette: The Ethics of Doing Naturalistic Discourse Research on the Internet.” In Doing Internet Research, ed. S. Jones, London: Sage.

Read the full paper: Bicquelet, A., and Weale, A. (2011) Coping with the Cornucopia: Can Text Mining Help Handle the Data Deluge in Public Policy Analysis? Policy & Internet 3 (4).

Dr Aude Bicquelet is a Fellow in LSE’s Department of Methodology. Her main research interests include computer-assisted analysis, Text Mining methods, comparative politics and public policy. She has published a number of journal articles in these areas and is the author of a forthcoming book, “Textual Analysis” (Sage Benchmarks in Social Research Methods, in press).

Can Twitter provide an early warning function for the next pandemic?

Patty Kostkova — Mon, 14 Oct 2013 08:00:41 +0000

Communication of risk in any public health emergency is a complex task for healthcare agencies; a task made more challenging when citizens are bombarded with online information. Mexico City, 2009. Image by Eneas.

Ed: Could you briefly outline your study?

Patty: We investigated the role of Twitter during the 2009 swine flu pandemics from two perspectives. Firstly, we demonstrated the role of the social network to detect an upcoming spike in an epidemic before the official surveillance systems – up to week in the UK and up to 2-3 weeks in the US – by investigating users who “self-diagnosed” themselves posting tweets such as “I have flu / swine flu”. Secondly, we illustrated how online resources reporting the WHO declaration of “pandemics” on 11 June 2009 were propagated through Twitter during the 24 hours after the official announcement [1,2,3].

Ed: Disease control agencies already routinely follow media sources; are public health agencies aware of social media as another valuable source of information?

Patty: Social media are providing an invaluable real-time data signal complementing well-established epidemic intelligence (EI) systems monitoring online media, such as MedISys and GPHIN. While traditional surveillance systems will remain the pillars of public health, online media monitoring has added an important early-warning function, with social media bringing additional benefits to epidemic intelligence: virtually real-time information available in the public domain that is contributed by users themselves, thus not relying on the editorial policies of media agencies.

Public health agencies (such as the European Centre for Disease Prevention and Control) are interested in social media early warning systems, but more research is required to develop robust social media monitoring solutions that are ready to be integrated with agencies’ EI services.

Ed: How difficult is this data to process? Eg: is this a full sample, processed in real-time?

Patty: No, obtaining all Twitter search query results is not possible. In our 2009 pilot study we were accessing data from Twitter using a search API interface querying the database every minute (the number of results was limited to 100 tweets). Currently, only 1% of the ‘Firehose’ (massive real-time stream of all public tweets) is made available using the streaming API. The searches have to be performed in real-time as historical Twitter data are normally available only through paid services. Twitter analytics methods are diverse; in our study, we used frequency calculations, developed algorithms for geo-location, automatic spam and duplication detection, and applied time series and cross-correlation with surveillance data [1,2,3].

Ed: What’s the relationship between traditional and social media in terms of diffusion of health information? Do you have a sense that one may be driving the other?

Patty: This is a fundamental question. “Does media coverage of certain topic causes buzz on social media or does social media discussion causes media frenzy?” This was particularly important to investigate for the 2009 swine flu pandemic, which experienced unprecedented media interest. While it could be assumed that disease cases preceded media coverage, or that media discussion sparked public interest causing Twitter debate, neither proved to be the case in our experiment. On some days, media coverage for flu was higher, and on others Twitter discussion was higher; but peaks seemed synchronized – happening on the same days.

Ed: In terms of communicating accurate information, does the Internet make the job easier or more difficult for health authorities?

Patty: The communication of risk in any public health emergencies is a complex task for government and healthcare agencies; this task is made more challenging when citizens are bombarded with online information, from a variety of sources that vary in accuracy. This has become even more challenging with the increase in users accessing health-related information on their mobile phones (17% in 2010 and 31% in 2012, according to the US Pew Internet study).

Our findings from analyzing Twitter reaction to online media coverage of the WHO declaration of swine flu as a “pandemic” (stage 6) on 11 June 2009, which unquestionably was the most media-covered event during the 2009 epidemic, indicated that Twitter does favour reputable sources (such as the BBC, which was by far the most popular) but also that bogus information can still leak into the network.

Ed: What differences do you see between traditional and social media, in terms of eg bias / error rate of public health-related information?

Patty: Fully understanding quality of media coverage of health topics such as the 2009 swine flu pandemics in terms of bias and medical accuracy would require a qualitative study (for example, one conducted by Duncan in the EU [4]). However, the main role of social media, in particular Twitter due to the 140 character limit, is to disseminate media coverage by propagating links rather than creating primary health information about a particular event. In our study around 65% of tweets analysed contained a link.

Ed: Google flu trends (which monitors user search terms to estimate worldwide flu activity) has been around a couple of years: where is that going? And how useful is it?

Patty: Search companies such as Google have demonstrated that online search queries for keywords relating to flu and its symptoms can serve as a proxy for the number of individuals who are sick (Google Flu Trends), however, in 2013 the system “drastically overestimated peak flu levels”, as reported by Nature. Most importantly, however, unlike Twitter, Google search queries remain proprietary and are therefore not useful for research or the construction of non-commercial applications.

Ed: What are implications of social media monitoring for countries that may want to suppress information about potential pandemics?

Patty: The importance of event-based surveillance and monitoring social media for epidemic intelligence is of particular importance in countries with sub-optimal surveillance systems and those lacking the capacity for outbreak preparedness and response. Secondly, the role of user-generated information on social media is also of particular importance in counties with limited freedom of press or those that actively try to suppress information about potential outbreaks.

Ed: Would it be possible with this data to follow spread geographically, ie from point sources, or is population movement too complex to allow this sort of modelling?

Patty: Spatio-temporal modelling is technically possible as tweets are time-stamped and there is a support for geo-tagging. However, the location of all tweets can’t be precisely identified; however, early warning systems will improve in accuracy as geo-tagging of user generated content becomes widespread. Mathematical modelling of the spread of diseases and population movements are very topical research challenges (undertaken by, for example, by Colliza et al. [5]) but modelling social media user behaviour during health emergencies to provide a robust baseline for early disease detection remains a challenge.

Ed: A strength of monitoring social media is that it follows what people do already (eg search / Tweet / update statuses). Are there any mobile / SNS apps to support collection of epidemic health data? eg a sort of ‘how are you feeling now’ app?

Patty: The strength of early warning systems using social media is exactly in the ability to piggy-back on existing users’ behaviour rather than having to recruit participants. However, there are a growing number of participatory surveillance systems that ask users to provide their symptoms (web-based such as Flusurvey in the UK, and “Flu Near You” in the US that also exists as a mobile app). While interest in self-reporting systems is growing, challenges include their reliability, user recruitment and long-term retention, and integration with public health services; these remain open research questions for the future. There is also a potential for public health services to use social media two-ways – by providing information over the networks rather than only collect user-generated content. Social media could be used for providing evidence-based advice and personalized health information directly to affected citizens where they need it and when they need it, thus effectively engaging them in active management of their health.

References

[1.] M Szomszor, P Kostkova, C St Louis: Twitter Informatics: Tracking and Understanding Public Reaction during the 2009 Swine Flu Pandemics, IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology 2011, WI-IAT, Vol. 1, pp.320-323.

[2.] Szomszor, M., Kostkova, P., de Quincey, E. (2010). #swineflu: Twitter Predicts Swine Flu Outbreak in 2009. M Szomszor, P Kostkova (Eds.): ehealth 2010, Springer Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering LNICST 69, pages 18-26, 2011.

[3.] Ed de Quincey, Patty Kostkova Early Warning and Outbreak Detection Using Social Networking Websites: the Potential of Twitter, P Kostkova (Ed.): ehealth 2009, Springer Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering LNICST 27, pages 21-24, 2010.

[4.] B Duncan. How the Media reported the first day of the pandemic H1N1) 2009: Results of EU-wide Media Analysis. Eurosurveillance, Vol 14, Issue 30, July 2009

[5.] Colizza V, Barrat A, Barthelemy M, Valleron AJ, Vespignani A (2007) Modeling the worldwide spread of pandemic influenza: Baseline case an containment interventions. PloS Med 4(1): e13. doi:10.1371/journal. pmed.0040013

Further information on this project and related activities, can be found at: BMJ-funded scientific film: http://www.youtube.com/watch?v=_JNogEk-pnM ; Can Twitter predict disease outbreaks? http://www.bmj.com/content/344/bmj.e2353 ; 1st International Workshop on Public Health in the Digital Age: Social Media, Crowdsourcing and Participatory Systems (PHDA 2013): http://www.digitalhealth.ws/ ; Social networks and big data meet public health @ WWW 2013: http://www2013.org/2013/04/25/social-networks-and-big-data-meet-public-health/

Patty Kostkova was talking to blog editor David Sutcliffe.

Dr Patty Kostkova is a Principal Research Associate in eHealth at the Department of Computer Science, University College London (UCL) and held a Research Scientist post at the ISI Foundation in Italy. Until 2012, she was the Head of the City eHealth Research Centre (CeRC) at City University, London, a thriving multidisciplinary research centre with expertise in computer science, information science and public health. In recent years, she was appointed a consultant at WHO responsible for the design and development of information systems for international surveillance.

Researchers who were instrumental in this project include Ed de Quincey, Martin Szomszor and Connie St Louis.

Responsible research agendas for public policy in the era of big data

Vicki Nash — Thu, 19 Sep 2013 15:17:01 +0000

Last week the OII went to Harvard. Against the backdrop of a gathering storm of interest around the potential of computational social science to contribute to the public good, we sought to bring together leading social science academics with senior government agency staff to discuss its public policy potential. Supported by the OII-edited journal Policy and Internet and its owners, the Washington-based Policy Studies Organization (PSO), this one-day workshop facilitated a thought-provoking conversation between leading big data researchers such as David Lazer, Brooke Foucault-Welles and Sandra Gonzalez-Bailon, e-government experts such as Cary Coglianese, Helen Margetts and Jane Fountain, and senior agency staff from US federal bureaus including Labor Statistics, Census, and the Office for the Management of the Budget.

It’s often difficult to appreciate the impact of research beyond the ivory tower, but what this productive workshop demonstrated is that policy-makers and academics share many similar hopes and challenges in relation to the exploitation of ‘big data’. Our motivations and approaches may differ, but insofar as the youth of the ‘big data’ concept explains the lack of common language and understanding, there is value in mutual exploration of the issues. Although it’s impossible to do justice to the richness of the day’s interactions, some of the most pertinent and interesting conversations arose around the following four issues.

Managing a diversity of data sources. In a world where our capacity to ask important questions often exceeds the availability of data to answer them, many participants spoke of the difficulties of managing a diversity of data sources. For agency staff this issue comes into sharp focus when available administrative data that is supposed to inform policy formulation is either incomplete or inadequate. Consider, for example, the challenge of regulating an economy in a situation of fundamental data asymmetry, where private sector institutions track, record and analyse every transaction, whilst the state only has access to far more basic performance metrics and accounts. Such asymmetric data practices also affect academic research, where once again private sector tech companies such as Google, Facebook and Twitter often offer access only to portions of their data. In both cases participants gave examples of creative solutions using merged or blended data sources, which raise significant methodological and also ethical difficulties which merit further attention. The Berkman Center’s Rob Faris also noted the challenges of combining ‘intentional’ and ‘found’ data, where the former allow far greater certainty about the circumstances of their collection.

Data dictating the questions. If participants expressed the need to expend more effort on getting the most out of available but diverse data sources, several also canvassed against the dangers of letting data availability dictate the questions that could be asked. As we’ve experienced at the OII, for example, the availability of Wikipedia or Twitter data means that questions of unequal digital access (to political resources, knowledge production etc.) can often be addressed through the lens of these applications or platforms. But these data can provide only a snapshot, and large questions of great social or political importance may not easily be answered through such proxy measurements. Similarly, big data may be very helpful in providing insights into policy-relevant patterns or correlations, such as identifying early indicators of seasonal diseases or neighbourhood decline, but seem ill-suited to answer difficult questions regarding say, the efficacy of small-scale family interventions. Just because the latter are harder to answer using currently vogue-ish tools doesn’t mean we should cease to ask these questions.

Ethics. Concerns about privacy are frequently raised as a significant limitation of the usefulness of big data. Given that with two or more data sets even supposedly anonymous data subjects may be identified, the general consensus seems to be that ‘privacy is dead’. Whilst all participants recognised the importance of public debate around this issue, several academics and policy-makers expressed a desire to get beyond this discussion to a more nuanced consideration of appropriate ethical standards. Accountability and transparency are often held up as more realistic means of protecting citizens’ interests, but one workshop participant also suggested it would be helpful to encourage more public debate about acceptable and unacceptable uses of our data, to determine whether some uses might simply be deemed ‘off-limits’, whilst other uses could be accepted as offering few risks.

Accountability. Following on from this debate about the ethical limits of our uses of big data, discussion exposed the starkly differing standards to which government and academics (to say nothing of industry) are held accountable. As agency officials noted on several occasions it matters less what they actually do with citizens’ data, than what they are perceived to do with it, or even what it’s feared they might do. One of the greatest hurdles to be overcome here concerns the fundamental complexity of big data research, and the sheer difficulty of communicating to the public how it informs policy decisions. Quite apart from the opacity of the algorithms underlying big data analysis, the explicit focus on correlation rather than causation or explanation presents a new challenge for the justification of policy decisions, and consequently, for public acceptance of their legitimacy. As Greg Elin of Gitmachines emphasised, policy decisions are still the result of explicitly normative political discussion, but the justifiability of such decisions may be rendered more difficult given the nature of the evidence employed.

We could not resolve all these issues over the course of the day, but they served as pivot points for honest and productive discussion amongst the group. If nothing else, they demonstrate the value of interaction between academics and policy-makers in a research field where the stakes are set very high. We plan to reconvene in Washington in the spring.

*We are very grateful to the Policy Studies Organization (PSO) and the American Public University for their generous support of this workshop. The workshop “Responsible Research Agendas for Public Policy in the Era of Big Data” was held at the Harvard Faculty Club on 13 September 2013.

Also read: Big Data and Public Policy Workshop by Eric Meyer, workshop attendee and PI of the OII project Accessing and Using Big Data to Advance Social Science Knowledge.

Victoria Nash received her M.Phil in Politics from Magdalen College in 1996, after completing a First Class BA (Hons) Degree in Politics, Philosophy and Economics, before going on to complete a D.Phil in Politics from Nuffield College, Oxford University in 1999. She was a Research Fellow at the Institute of Public Policy Research prior to joining the OII in 2002. As Research and Policy Fellow at the OII, her work seeks to connect OII research with policy and practice, identifying and communicating the broader implications of OII’s research into Internet and technology use.

Predicting elections on Twitter: a different way of thinking about the data

Nick Anstead — Sun, 04 Aug 2013 11:43:52 +0000

GOP presidential nominee Mitt Romney, centre, waving to crowd, after delivering his acceptance speech on the final night of the 2012 Republican National Convention. Image by NewsHour.

Recently, there has been a lot of interest in the potential of social media as a means to understand public opinion. Driven by an interest in the potential of so-called “big data”, this development has been fuelled by a number of trends. Governments have been keen to create techniques for what they term “horizon scanning”, which broadly means searching for the indications of emerging crises (such as runs on banks or emerging natural disasters) online, and reacting before the problem really develops. Governments around the world are already committing massive resources to developing these techniques. In the private sector, big companies’ interest in brand management has fitted neatly with the potential of social media monitoring. A number of specialised consultancies now claim to be able to monitor and quantify reactions to products, interactions or bad publicity in real time.

It should therefore come as little surprise that, like other research methods before, these new techniques are now crossing over into the competitive political space. Social media monitoring, which in theory can extract information from tweets and Facebook posts and quantify positive and negative public reactions to people, policies and events has an obvious utility for politicians seeking office. Broadly, the process works like this: vast datasets relating to an election, often running into millions of items, are gathered from social media sites such as Twitter. These data are then analysed using natural language processing software, which automatically identifies qualities relating to candidates or policies and attributes a positive or negative sentiment to each item. Finally, these sentiments and other properties mined from the text are totalised, to produce an overall figure for public reaction on social media.

These techniques have already been employed by the mainstream media to report on the 2010 British general election (when the country had its first leaders debate, an event ripe for this kind of research) and also in the 2012 US presidential election. This growing prominence led my co-author Mike Jensen of the University of Canberra and myself to question: exactly how useful are these techniques for predicting election results? In order to answer this question, we carried out a study on the Republican nomination contest in 2012, focused on the Iowa Caucus and Super Tuesday. Our findings are published in the current issue of Policy and Internet.

There are definite merits to this endeavour. US candidate selection contests are notoriously hard to predict with traditional public opinion measurement methods. This is because of the unusual and unpredictable make-up of the electorate. Voters are likely (to greater or lesser degrees depending on circumstances in a particular contest and election laws in the state concerned) to share a broadly similar outlook, so the electorate is harder for pollsters to model. Turnout can also vary greatly from one cycle to the next, adding an additional layer of unpredictability to the proceedings.

However, as any professional opinion pollster will quickly tell you, there is a big problem with trying to predict elections using social media. The people who use it are simply not like the rest of the population. In the case of the US, research from Pew suggests that only 16 per cent of internet users use Twitter, and while that figure goes up to 27 per cent of those aged 18-29, only 2 per cent of over 65s use the site. The proportion of the electorate voting for within those categories, however, is the inverse: over 65s vote at a relatively high rate compared to the 18-29 cohort. furthermore, given that we know (from research such as Matthew Hindman’s The Myth of Digital Democracy) that the only a very small proportion of people online actually create content on politics, those who are commenting on elections become an even more unusual subset of the population.

Thus (and I can say this as someone who does use social media to talk about politics!) we are looking at an unrepresentative sub-set (those interested in politics) of an unrepresentative sub-set (those using social media) of the population. This is hardly a good omen for election prediction, which relies on modelling the voting population as closely as possible. As such, it seems foolish to suggest that a simply culmination of individual preferences can simply be equated to voting intentions.

However, in our article we suggest a different way of thinking about social media data, more akin to James Surowiecki’s idea of The Wisdom of Crowds. The idea here is that citizens commenting on social media should not be treated like voters, but rather as commentators, seeking to understand and predict emerging political dynamics. As such, the method we operationalized was more akin to an electoral prediction market, such as the Iowa Electronic Markets, than a traditional opinion poll.

We looked for two things in our dataset: sudden changes in the number of mentions of a particular candidate and also words that indicated momentum for a particular candidate, such as “surge”. Our ultimate finding was that this turned out to be a strong predictor. We found that the former measure had a good relationship with Rick Santorum’s sudden surge in the Iowa caucus, although it did also tend to disproportionately-emphasise a lot of the less successful candidates, such as Michelle Bachmann. The latter method, on the other hand, picked up the Santorum surge without generating false positives, a finding certainly worth further investigation.

Our aim in the paper was to present new ways of thinking about election prediction through social media, going beyond the paradigm established by the dominance of opinion polling. Our results indicate that there may be some value in this approach.

Read the full paper: Michael J. Jensen and Nick Anstead (2013) Psephological investigations: Tweets, votes, and unknown unknowns in the republican nomination process. Policy and Internet 5 (2) 161–182.

Dr Nick Anstead was appointed as a Lecturer in the LSE’s Department of Media and Communication in September 2010, with a focus on Political Communication. His research focuses on the relationship between existing political institutions and new media, covering such topics as the impact of the Internet on politics and government (especially e-campaigning), electoral competition and political campaigns, the history and future development of political parties, and political mobilisation and encouraging participation in civil society.

Dr Michael Jensen is a Research Fellow at the ANZSOG Institute for Governance (ANZSIG), University of Canberra. His research spans the subdisciplines of political communication, social movements, political participation, and political campaigning and elections. In the last few years, he has worked particularly with the analysis of social media data and other digital artefacts, contributing to the emerging field of computational social science.

Seeing like a machine: big data and the challenges of measuring Africa’s informal economies

Laura Mann — Mon, 22 Jul 2013 12:01:11 +0000

State research capacity has been weakened since the 1980s. It is now hoped that the ‘big data’ generated by mobile phone use can shed light on African economic and social issues, but we must pay attention to what new technologies are doing to the bigger research environment. Image by Nicki Kindersley.

As Linnet Taylor’s recent post on this blog has argued, researchers are gaining interest in Africa’s big data. Linnet’s excellent post focused on what the profusion of big data might mean for privacy concerns and frameworks for managing personal data. My own research focuses on the implications of big (and open) data on knowledge about Africa; specifically, economic knowledge.

As an introduction, it might be helpful to reflect on the French colonial concepts of l’Afrique utile and l’Afrique inutile (concepts most recently re-invoked by William Reno in 1999 and James Ferguson in 2005). L’Afrique utile, or usable Africa represented parts of Africa over which private actors felt they could exercise a degree of governance and control, and therefore extract profit. L’Afrique inutile, on the other hand, was the no-go area: places deemed too risky, too opaque and too wild for commercial profit. Until recently, it was difficult to convince multinationals to view Africa as usable and profitable because much economic activity took place in the unaccounted informal economy. With the exception of a few oil, gas and mineral installations and some export commodities like cocoa, cotton, tobacco, rubber, coffee, and tea, multinationals stayed out of the continent. Likewise, within the accounts of national public policy-making institutions, it was only the very narrow formal and recordable parts of the economy that were recorded. In a similar way that economists have traditionally excluded unpaid domestic labour from national accounts, most African states only scratched the surface of their populations’ true economic lives.

The mobile phone has undoubtedly changed the way private companies and public bodies view African economies. Firstly, the mobile phone has demonstrated that Africans can be voracious consumers at the bottom of the pyramid (paving the way for the distribution of other low-cost items such as soap, sanitary pads, soft drinks, etc.). While the colonial scramble for Africa focused on what lay in Africa’s lands and landscapes, the new scramble is focused on its people and markets (and workers; as the growing interest in business process outsourcing demonstrates).

Secondly, mobile phones (and other kinds of information and communication technologies) have created new channels of information about Africans and African markets, particularly in the informal sector. In an era where so much of the apparatus for measuring Africa’s economies has been weakened, this kind of data reaps enormous potential. One might say that the mobile phone and the internet have made former parts of l’Afrique inutile into l’Afrique utile — open for business, profit, analysis, and perhaps, control.

The ‘scramble for Africa’s data‘ is taking place within a particular historical trajectory of knowledge production. Africa has always been a laboratory for Western scientists and researchers, with local knowledge production often influenced by foreign powers and foreign ideas (think back to the early reliance on primary products for export, to which the entire colonial system of economic measurement and development planning was geared). Within the contemporary context of ever-expanding higher education and dwindling finances for local research, African academics and researchers have been forced to take on more and more consultancies and private contracts.

This ‘extraversion’ of African institutions of higher education has contributed to a re-orientation of the apparatus for academic research towards questions posed from outside. Within state bodies, similar processes are underway. Weakened by corruption, Structural Adjustment Policies (SAP), and pervasive informal economic activity, management of the economy has migrated from state institutions into the better paid offices of NGOs, consultancies and private companies. State capacity to measure and model is presently very weak, and African governments are therefore being encouraged to ‘open’ up their own records to non-state researchers. It is into this research context that big data emerges as a new source of ‘legibility’.

ICTs offer obvious benefits to economic researchers. They have often been heralded as offering potentially more democratic and participatory kinds of ‘legibility’. Their potential partly lies in the way that ICTs activate ‘social networks’ into infrastructures through which external actors can deliver and extract information. This ‘sociability’ makes them particularly suitable for studying informal economic networks. ICTs also offer the potential to modernise existing streams of data collection and broaden intra-institutional coordination, leading to better collaboration and more targeted public policy. In our project on the economic impacts of fibre optic broadband in East Africa, we have seen how institutions such as the Kenya Tea Board and the Rwandan Health Ministry are better integrating their information systems in order to gain a better national picture, and thereby contribute to industrial upgrading in the case of tea or better public services in the case of health. Nevertheless, big data is not accessible to all, and researchers must often prove commercial or strategic value in order to gain access.

Use of ‘big data’ is still a growing field, born within the discipline of computer science. My initial interviews with big data researchers working on Africa indicate they are still figuring out what kinds of questions can be answered with big data and how they might justify themselves and their methodologies to mainstream economics. Big data’s potential for hypothesis-building is somewhat at odds with the tradition of hypothesis-testing in economics. Big data researchers start with the question, ‘Where can this data lead me?’ There is also the question of how restricted access might frame research design. To date, the researchers that have been most successful in gaining access to African big data have worked with private companies, banks and financial institutions. It is therefore the incorporation and integration of poor people into private sector understandings that big data currently seems to offer.

This vision of development fits into a broader trend. Just as Hernando de Soto has argued that development is hampered by the exclusion of poor people from formalised property rights, proponents of microcredit have likewise argued it is the poor’s exclusion from financial institutions that limit their ability to develop self-sustaining enterprises. Researchers are therefore encouraged to use big data to model poor peoples’ actions and credit worthiness to incorporate them into financial systems, thereby transforming them from invisible selves into visible selves.

Critics of microfinance have cautioned that incorporating poor people into globalised structures of finance makes them more vulnerable to state interference in the form of taxes and to debt and international financial crises. It is also unclear what the drift into the private sector might do to wider understandings of poverty. While national measures situate citizens as members of national or collective groups, mobile financial innovations often focus on the individual’s financial records and credit worthiness. It remains to be seen whether this change of focus might move us away from more social definitions for poverty towards more individual or private explanations.

Likewise the flow of digital information across geographical space has the potential to change the nature of collaboration. As Mahmoud Mamdani has cautioned, “The global market tends to relegate Africa to providing raw material (“data”) to outside academics who process it and then re-export their theories back to Africa. Research proposals are increasingly descriptive accounts of data collection and the methods used to collate data, collaboration is reduced to assistance, and there is a general impoverishment of theory and debate”. This problem could potentially be exacerbated by open data initiatives that seek to get more people using publicly collected data. As Morten Jerven writes in his recent book, Poor Numbers, interactions between African data producers and users are currently limited, with users often unable to effectively assess the source and methods used to collect the original data. Nevertheless, such numbers are often taken at face value, with dubious policy recommendations formed as a result. While multiple sources of data (from the public and private sector) can help increase the precision of research and lead to better conclusions, we do not understand how big data (and open data) will impact the overall research environment in Africa.

My next project will examine these issues in relation to economic studies of unemployment in Egypt and financial inclusion in Uganda. The key objectives will be to improve our understanding of how data is being collected, how data is being communicated across groups and within systems, how new models of the economy are being formed, and what these changes are doing to political and economic relationships on the ground. Specifically, the project poses six interrelated questions: Where is economic intelligence and expertise currently located? What is being measured by whom, and how, and why? How do different tools of measurement change the way researchers understand economic truth and construct their models? How does more ‘legibility’ over African economies change power relations? What resistance or critical thinking exists within these new configurations of expertise? How can we combine approaches to assemble a fuller picture of economic understanding? The project will emphasise how economics, as a discipline, does not merely measure external reality, but helps to shape and influence that reality.

How we measure economies matters, particularly in the context of ever increasing evidence-based policy-making and with increasing interest from the private sector in Africa. Measurement often changes and shapes our realities of the external world. As Timothy Mitchell writes: “the practices that form the economy operate, in part, to establish equivalences, contain circulations, identify social actors or agents, make quantities and performances measurable, and designate relations of control and command”. In other words, researchers cannot make sense of an economy without first establishing a research infrastructure through which subjects are measured and incorporated. The particular shape, tools and technologies of that research infrastructure help frame and construct economic models and truth.

Such frames also have political implications, as control over information often strengthens one group over others. Indeed, as James C. Scott’s work Seeing Like a State has shown, the struggle to establish legibility over societies is inherently political. Elites have always attempted to standardise and regularise more marginal groups in an effort to draw them into dominant political and economic orders. However, legibility does not have be ‘top-down’. Weaker groups suffer most from illegible societies, and can benefit from more legibility. As information and trust become more deeply embedded within stronger ties and within transnational networks of skill and expertise, marginalised ‘out groups’ are particularly disadvantaged.

While James C. Scott’s work highlighted the dangers of a high modernist ‘legibility’, the very absence of legibility can also disempower marginal groups. It is the kind of legibility at stake that is important. While big data offers enormous potential for economists to better understand what is going on in Africa’s informal economies, economic sociologists, anthropologists and historians must remind them how our tools and measurements influence systems of knowledge production and change our understandings and beliefs about the external world. Africa might be becoming ‘more usable’ and ‘more legible,’ but we need to ask, for whom, by whom, and for what purpose?

Dr Laura Mann is a Postdoctoral Researcher at the Oxford Internet Institute, University of Oxford. Her research focuses on the political economy of markets and value chains in Africa. Her current research examines the effects of broadband internet on the tea, tourism and outsourcing value chains of Kenya and Rwanda. From January 2014 she will be based at the African Studies Centre at Leiden University. Read Laura’s blog.

The scramble for Africa’s data

Linnet Taylor — Mon, 08 Jul 2013 09:21:02 +0000

Africa is in the midst of a technological revolution, and the current wave of digitisation has the potential to make the continent’s citizens a rich mine of data. Intersection in Zomba, Malawi. Image by john.duffell.

After the last decade’s exponential rise in ICT use, Africa is fast becoming a source of big data. Africans are increasingly emitting digital information with their mobile phone calls, internet use and various forms of digitised transactions, while on a state level e-government starts to become a reality. As Africa goes digital, the challenge for policymakers becomes what the WRR, a Dutch policy organisation, has identified as ‘i-government’: moving from digitisation to managing and curating digital data in ways that keep people’s identities and activities secure.

On one level, this is an important development for African policymakers, given that accurate information on their populations has been notoriously hard to come by and, where it exists, has not been shared. On another, however, it represents a tremendous challenge. The WRR has pointed out the unpreparedness of European governments, who have been digitising for decades, for the age of i-government. How are African policymakers, as relative newcomers to digital data, supposed to respond?

There are two possible scenarios. One is that systems will develop for the release and curation of Africans’ data by corporations and governments, and that it will become possible, in the words of the UN’s Global Pulse initiative, to use it as a ‘public good’ – an invaluable tool for development policies and crisis response. The other is that there will be a new scramble for Africa: a digital resource grab that may have implications as great as the original scramble amongst the colonial powers in the late 19^th century.

We know that African data is not only valuable to Africans. The current wave of digitisation has the potential to make the continent’s citizens a rich mine of data about health interventions, human mobility, conflict and violence, technology adoption, communication dynamics and financial behaviour, with the default mode being for this to happen without their consent or involvement, and without ethical and normative frameworks to ensure data protection or to weigh the risks against the benefits. Orange’s recent release of call data from Cote d’Ivoire both represents an example of the emerging potential of African digital data, but also the challenge of understanding the kind of anonymisation and ethical challenge that it represents.

I have heard various arguments as to why data protection is not a problem for Africans. One is that people in African countries don’t care about their privacy because they live in a ‘collective society’. (Whatever that means.) Another is that they don’t yet have any privacy to protect because they are still disconnected from the kinds of system that make data privacy important. Another more convincing and evidence-based argument is that the ends may justify the means (as made here by the ICRC in a thoughtful post by Patrick Meier about data privacy in crisis situations), and that if significant benefits can be delivered using African big data these outweigh potential or future threats to privacy. The same argument is being made by Global Pulse, a UN initiative which aims to convince corporations to release data on developing countries as a public good for use in devising development interventions.

There are three main questions: what can incentivise African countries’ citizens and policymakers to address privacy in parallel with the collection of massive amounts of personal data, rather than after abuses occur? What are the models that might be useful in devising privacy frameworks for groups with restricted technological access and sophistication? And finally, how can such a system be participatory enough to be relevant to the needs of particular countries or populations?

Regarding the first question, this may be a lost cause. The WRR’s i-government work suggests that only public pressure due to highly publicised breaches of data security may spur policymakers to act. The answer to the second question is being pursued, among others, by John Clippinger and Alex Pentland at MIT (with their work on the social stack); by the World Economic Forum, which is thinking about the kinds of rules that should govern personal data worldwide; by the aforementioned Global Pulse, which has a strong interest in building frameworks which make it safe for corporations to share people’s data; by Microsoft, which is doing some serious thinking about differential privacy for large datasets; by independent researchers such as Patrick Meier, who is looking at how crowdsourced data about crises and human rights abuses should be handled; and by the Oxford Internet Institute’s new M-Data project which is devising privacy guidelines for collecting and using mobile connectivity data.

Regarding the last question, participatory systems will require African country activists, scientists and policymakers to build them. To be relevant, they will also need to be made enforceable, which may be an even greater challenge. Privacy frameworks are only useful if they are made a living part of both governance and citizenship: there must be the institutional power to hold offenders accountable (in this case extremely large and powerful corporations, governments and international institutions), and awareness amongst ordinary people about the existence and use of their data. This, of course, has not really been achieved in developed countries, so doing it in Africa may not exactly be a piece of cake.

Notwithstanding these challenges, the region offers an opportunity to push researchers and policymakers – local and worldwide – to think clearly about the risks and benefits of big data, and to make solutions workable, enforceable and accessible. In terms of data privacy, if it works in Burkina Faso, it will probably work in New York, but the reverse is unlikely to be true. This makes a strong argument for figuring it out in Burkina Faso.

Some may contend that this discussion only points out the massive holes in the governance of technology that prevail in Africa – and in fact a whole other level of problems regarding accountability and power asymmetries. My response: Yes. Absolutely.

Linnet Taylor’s research focuses on social and economic aspects of the diffusion of the internet in Africa, and human mobility as a factor in technology adoption (.. read her blog). Her doctoral research was on Ghana, where she looked at mobility’s influence on the formation and viability of internet cafes in poor and remote areas, networking amongst Ghanaian technology professionals and ICT4D policy. At the OII she works on a Sloan Foundation funded project on Accessing and Using Big Data to Advance Social Science Knowledge.

Investigating the structure and connectivity of online global protest networks

Sandra Gonzalez-Bailon — Mon, 10 Jun 2013 12:04:26 +0000

How have online technologies reconfigured collective action? It is often assumed that the rise of social networking tools, accompanied by the mass adoption of mobile devices, have strengthened the impact and broadened the reach of today’s political protests. Enabling massive self-communication allows protesters to write their own interpretation of events – free from a mass media often seen as adversarial – and emerging protests may also benefit from the cheaper, faster transmission of information and more effective mobilization made possible by online tools such as Twitter.

The new networks of political protest, which harness these new online technologies are often described in theoretical terms as being ‘fluid’ and ‘horizontal’, in contrast to the rigid and hierarchical structure of earlier protest organization. Yet such theoretical assumptions have seldom been tested empirically. This new language of networks may be useful as a shorthand to describe protest dynamics, but does it accurately reflect how protest networks mediate communication and coordinate support?

The global protests against austerity and inequality which took place on May 12, 2012 provide an interesting case study to test the structure and strength of a transnational online protest movement. The ‘indignados’ movement emerged as a response to the Spanish government’s politics of austerity in the aftermath of the global financial crisis. The movement flared in May 2011, when hundreds of thousands of protesters marched in Spanish cities, and many set up camps ahead of municipal elections a week later.

These protests contributed to the emergence of the worldwide Occupy movement. After the original plan to occupy New York City’s financial district mobilised thousands of protesters in September 2011, the movement spread to other cities in the US and worldwide, including London and Frankfurt, before winding down as the camp sites were dismantled weeks later. Interest in these movements was revived, however, as the first anniversary of the ‘indignados’ protests approached in May 2012.

To test whether the fluidity, horizontality and connectivity often claimed for online protest networks holds true in reality, tweets referencing these protest movements during May 2012 were collected. These tweets were then classified as relating either to the ‘indignados’ or Occupy movement, using hashtags as a proxy for content. Many tweets, however, contained hashtags relevant for the two movements, creating bridges across the two streams of information. The users behind those bridges acted as information ‘brokers’, and are fundamentally important to the global connectivity of the two movements: they joined the two streams of information and their audiences on Twitter. Once all the tweets were classified by content and author, it emerged that around 6.5% of all users posted at least one message relevant for the two movements by using hashtags from both sides jointly.

Analysis of the Twitter data shows that this small minority of ‘brokers’ play an important role connecting users to a network that would otherwise be disconnected. Brokers are significantly more active in the contribution of messages and more visible in the stream of information, being re-tweeted and mentioned more often than other users. The analysis also shows that these brokers play an important role in the global network, by helping to keep the network together and improving global connectivity. In a simulation, the removal of brokers fragmented the network faster than the removal of random users at the same rate.

What does this tell us about global networks of protest? Firstly, it is clear that global networks are more vulnerable and fragile than is often assumed. Only a small percentage of users disseminate information across transnational divides, and if any of these users cease to perform this role, they are difficult to immediately replace, thus limiting the assumed fluidity of such networks. The decentralized nature of online networks, with no central authority imposing order or even suggesting a common strategy, make the role of ‘brokers’ all the more vital to the survival of networks which cross national borders.

Secondly, the central role performed by brokers suggests that global networks of online protest lack the ‘horizontal’ structure that is often described in the literature. Talking about horizontal structures can be useful as shorthand to refer to decentralised organisations, but not to analyse the process by which these organisations materialise in communication networks. The distribution of users in those networks reveals a strong hierarchy in terms of connections and the ability to communicate effectively.

Future research into online networks, then, should keep in mind that the language of protest networks in the digital age, particularly terms like horizontality and fluidity, do not necessarily stand up to empirical scrutiny. The study of contentious politics in the digital age should be evaluated, first and foremost, through the lens of what protesters actually reveal through their actions.

Read the paper: Sandra Gonzalez-Bailon and Ning Wang (2013) The Bridges and Brokers of Global Campaigns in the Context of Social Media.

How accessible are online legislative data archives to political scientists?

David Leal — Mon, 03 Jun 2013 12:07:40 +0000

A view inside the House chamber of the Utah State Legislature. Image by deltaMike.

Public demands for transparency in the political process have long been a central feature of American democracy, and recent technological improvements have considerably facilitated the ability of state governments to respond to such public pressures. With online legislative archives, state legislatures can make available a large number of public documents. In addition to meeting the demands of interest groups, activists, and the public at large, these websites enable researchers to conduct single-state studies, cross-state comparisons, and longitudinal analysis.

While online legislative archives are, in theory, rich sources of information that save researchers valuable time as they gather data across the states, in practice, government agencies are rarely completely transparent, often do not provide clear instructions for accessing the information they store, seldom use standardized norms, and can overlook user needs. These obstacles to state politics research are longstanding: Malcolm Jewell noted almost three decades ago the need for “a much more comprehensive and systematic collection and analysis of comparative state political data.” While the growing availability of online legislative resources helps to address the first problem of collection, the limitations of search and retrieval functions remind us that the latter remains a challenge.

The fifty state legislative websites are quite different; few of them are intuitive or adequately transparent, and there is no standardized or systematic process to retrieve data. For many states, it is not possible to identify issue-specific bills that are introduced and/or passed during a specific period of time, let alone the sponsors or committees, without reading the full text of each bill. For researchers who are interested in certain time periods, policy areas, committees, or sponsors, the inability to set filters or immediately see relevant results limits their ability to efficiently collect data.

Frustrated by the obstacles we faced in undertaking a study of state-level immigration legislation before and after September 11, 2001, we decided to instead evaluate each state legislative website — a “state of the states” analysis — to help scholars who need to understand the limitations of the online legislative resources they may want to use. We evaluated three main dimensions on an eleven-point scale: (1) the number of searchable years; (2) the keyword search filters; and (3) the information available on the immediate results pages. The number of searchable sessions is crucial for researchers interested in longitudinal studies, before/after comparisons, other time-related analyses, and the activity of specific legislators across multiple years. The “search interface” helps researchers to define, filter, and narrow the scope of the bills—a particularly important feature when keywords can generate hundreds of possibilities. The “results interface” allows researchers to determine if a given bill is relevant to a research project.

Our paper builds on the work of other scholars and organizations interested in state policy. To help begin a centralized space for data collection, Kevin Smith and Scott Granberg-Rademacker publicly invited “researchers to submit descriptions of data sources that were likely to be of interest to state politics and policy scholars,” calling for “centralized, comprehensive, and reliable datasets” that are easy to download and manipulate. In this spirit, Jason Sorens, Fait Muedini, and William Ruger introduced a free database that offered a comprehensive set of variables involving over 170 public policies at the state and local levels in order to “reduce reduplication of scholarly effort.” The National Conference of State Legislatures (NCSL) provides links to state legislatures, bill lists, constitutions, reports, and statutes for all fifty states. The State Legislative History Research Guides compiled by the University of Indiana Law School also include links to legislative and historical resources for the states, such as the Legislative Reference Library of Texas. However, to our knowledge, no existing resource assesses usability across all state websites.

So, what did we find during our assessment of the state websites? In general, we observed that the archival records as well as the search and results functions leave considerable room for improvement. The maximum possible score was 11 in each year, and the average was 3.87 in 2008 and 4.25 in 2010. For researchers interested in certain time periods, policy areas, committees, or sponsors, the inability to set filters, immediately see relevant results, and access past legislative sessions limits their ability to complete projects in a timely manner (or at all). We also found a great deal of variation in site features, content, and navigation. Greater standardization would improve access to information about state policymaking by researchers and the general public—although some legislators may well see benefits to opacity.

While we noted some progress over the study period, not all change was positive. By 2010, two states had scored 10 points (no state scored the full 11), fewer states had very low scores, and the average score rose slightly from 3.87 to 4.25 (out of 11). This suggests slow but steady improvement, and the provision of a baseline of support for researchers. However, a quarter of the states showed score drops over the study period, for the most part reflecting the adoption of “Powered by Google” search tools that used only keywords, and some in a very limited manner. If the latter becomes a trend, we could see websites becoming less, not more, user friendly in the future.

In addition, our index may serve as a proxy variable for state government transparency. While the website scores were not statistically associated with Robert Erikson, Gerald Wright, and John McIver’s measure of state ideology, there may nevertheless be promise for future research along these lines; additional transparency determinants worth testing include legislative professionalism and social capital. Moving forward, the states might consider creating a working group to share ideas and best practices, perhaps through an organization like the National Conference of State Legislatures, rather than the national government, as some states might resist leadership from D.C. on federalist grounds.

Helen Margetts (2009) has noted that “The Internet has the capacity to provide both too much (which poses challenges to analysis) and too little data (which requires innovation to fill the gaps).” It is notable, and sometimes frustrating, that state legislative websites illustrate both dynamics. As datasets come online at an increasing rate, it is also easy to forget that websites can vary in terms of user friendliness, hierarchical structure, search terms and functions, terminology, and navigability — causing unanticipated methodological and data capture problems (i.e. headaches) to scholars working in this area.

Read the full paper: Taofang Huang, David Leal, B.J. Lee, and Jill Strube (2012) Assessing the Online Legislative Resources of the American States. Policy and Internet 4 (3-4).

Online crowd-sourcing of scientific data could document the worldwide loss of glaciers to climate change

Klaus Thymann — Tue, 14 May 2013 09:12:33 +0000

Ed: Project Pressure has created a platform for crowdsourcing glacier imagery, often photographs taken by climbers and trekkers. Why are scientists interested in these images? And what’s the scientific value of the data set that’s being gathered by the platform?

Klaus: Comparative photography using historical photography allows year-on-year comparisons to document glacier change. The platform aims to create long-lasting scientific value with minimal technical entry barriers — it is valuable to have a global resource that combines photographs generated by Project Pressure in less documented areas, with crowdsourced images taken by for example by climbers and trekkers, combined with archival pictures. The platform is future focused and will hopefully allow an up-to-date view on glaciers across the planet.

The other ways for scientists to monitor glaciers takes a lot of time and effort; direct measurements of snow fall is a complicated, resource intensive and time-consuming process. And while glacier outlines can be traced from satellite imagery, this still needs to be done manually. Also, you can’t measure the thickness, images can be obscured by debris and cloud cover, and some areas just don’t have very many satellite fly-bys.

Ed: There are estimates that the glaciers of Montana’s Glacier National Park will likely to be gone by 2020 and the Ugandan glaciers by 2025, and the Alps are rapidly turning into a region of lakes. These are the famous and very visible examples of glacier loss — what’s the scale of the missing data globally?

Klaus: There’s a lot of great research being conducted in this area, however there are approximately 300,000 glaciers world wide, with huge data gaps in South America and the Himalayas for instance. Sharing of Himalayan data between Indian and Chinese scientists has been a sensitive issue, given glacier meltwater is an important strategic resource in the region. But this is a popular trekking route, and it is relatively easy to gather open-source data from the public. Furthermore, there are also numerous national and scientific archives with images lying around that don’t have a central home.

Ed: What metadata are being collected for the crowdsourced images?

Klaus: The public can upload their own photos embedded with GPS, compass direction, and date. This data is aggregated into a single managed platform. With GPS becoming standard in cameras, it’s very simple contribute to the project — taking photos with embedded GPS data is almost foolproof. The public can also contribute by uploading archival images and adding GPS data to old photographs.

Ed: So you are crowd sourcing the gathering of this data; are there any plans to crowd-source the actual analysis?

Klaus: It’s important to note that accuracy is very important in a database, and the automated (or semiautomated) process of data generation should result in good data. And while the analytical side should be done be professionals, we are making the data open source so it can be used in education for instance. We need to take harness what crowds are good at, and know what the limitations are.

Ed: You mentioned in your talk that the sheer amount of climate data — and also the way it is communicated — means that the public has become disconnected from the reality and urgency of climate change: how is the project working to address this? What are the future plans?

Klaus: Recent studies have demonstrated a disconnect between scientific information regarding climate change and the public. The problem is not access to scientific information, but the fact that is can be overwhelming. Project Pressure is working to reconnect the public with the urgency of the problem by inspiring people to action and participation, and to engage with climate change. Project Pressure is very scalable in terms of the scientific knowledge required to use the platform: from kids to scientists. On the interface one can navigate the world, find locations and directions of photographs, and once funding permits we will also add the time-dimension.

Ed: Project Pressure has deliberately taken a non-political stance on climate change: can you explain why?

Klaus: Climate change has unfortunately become a political subject, but we want to preserve our integrity by not taking a political stance. It’s important that everyone can engage with Project Pressure regardless of their political views. We want to be an independent, objective partner.

Ed: Finally .. what’s your own background? How did you get involved?

Klaus: I’m the founder, and my background is in communication and photography. Input on how to strengthen the conceptualisation has come from a range of very smart people; in particular, Dr M. Zemph from the World Glacier Monitoring Service has been very valuable.

Klaus Thymann was talking at the OII on 18 March 2013; he talked later to blog editor David Sutcliffe.

Time for debate about the societal impact of the Internet of Things

Jeremy Crump — Mon, 22 Apr 2013 14:32:22 +0000

The 2nd Annual Internet of Things Europe 2010: A Roadmap for Europe, 2010. Image by Pierre Metivier.

On 17 April 2013, the US Federal Trade Commission published a call for inputs on the ‘consumer privacy and security issues posed by the growing connectivity of consumer devices, such as cars, appliances, and medical devices’, in other words, about the impact of the Internet of Things (IoT) on the everyday lives of citizens. The call is in large part one for information to establish what the current state of technology development is and how it will develop, but it also looks for views on how privacy risks should be weighed against potential societal benefits.

There’s a lot that’s not very new about the IoT. Embedded computing, sensor networks and machine to machine communications have been around a long time. Mark Weiser was developing the concept of ubiquitous computing (and prototyping it) at Xerox PARC in 1990. Many of the big ideas in the IoT — smart cars, smart homes, wearable computing — are already envisaged in works such as Nicholas Negroponte’s Being Digital, which was published in 1995 before the mass popularisation of the internet itself. The term ‘Internet of Things’ has been around since at least 1999. What is new is the speed with which technological change has made these ideas implementable on a societal scale. The FTC’s interest reflects a growing awareness of the potential significance of the IoT, and the need for public debate about its adoption.

As the cost and size of devices falls and network access becomes ubiquitous, it is evident that not only major industries but whole areas of consumption, public service and domestic life will be capable of being transformed. The number of connected devices is likely to grow fast in the next few years. The Organisation for Economic Co-operation and Development (OECD) estimates that while a family with two teenagers may have 10 devices connected to the internet, in 2022 this may well grow to 50 or more. Across the OECD area the number of connected devices in households may rise from an estimated 1.7 billion today to 14 billion by 2022. Programmes such as smart cities, smart transport and smart metering will begin to have their effect soon. In other countries, notably in China and Korea, whole new cities are being built around smart infrastructure—giving technology companies the opportunity to develop models that could be implemented subsequently in Western economies.

Businesses and governments alike see this as an opportunity for new investment both as a basis for new employment and growth and for the more efficient use of existing resources. The UK Government is funding a strand of work under the auspices of the Technology Strategy Board on the IoT, and the IoT is one of five themes that are the subject of the Department for Business, Innovation & Skills (BIS)’s consultation on the UK’s Digital Economy Strategy (alongside big data, cloud computing, smart cities, and eCommerce).

The enormous quantity of information that will be produced will provide further opportunities for collecting and analysing big data. There is consequently an emerging agenda about privacy, transparency and accountability. There are challenges too to the way we understand and can manage the complexity of interacting systems that will underpin critical social infrastructure.

The FTC is not alone in looking to open public debate about these issues. In February, the OII and BCS (the Chartered Institute for IT) ran a joint seminar to help the BCS’s consideration about how it should fulfil its public education and lobbying role in this area. A summary of the contributions is published on the BCS website.

The debate at the seminar was wide ranging. There was no doubt that the train has left the station as far as this next phase of the Internet is concerned. The scale of major corporate investment, government encouragement and entrepreneurial enthusiasm are not to be deflected. In many sectors of the economy there are already changes that are being felt already by consumers or will be soon enough. Smart metering, smart grid, and transport automation (including cars) are all examples. A lot of the discussion focused on risk. In a society which places high value on audit and accountability, it is perhaps unsurprising that early implementations have often been in using sensors and tags to track processes and monitor activity. This is especially attractive in industrial structures that have high degrees of subcontracting.

Wider societal risks were also discussed. As for the FTC, the privacy agenda is salient. There is real concern that the assumptions which underlie the data protection regime—especially its reliance on data minimisation—will not be adequate to protect individuals in an era of ubiquitous data. Nor is it clear that the UK’s regulator—the Information Commissioner—will be equipped to deal with the volume of potential business. Alongside privacy, there is also concern for security and the protection of critical infrastructure. The growth of reliance on the IoT will make cybersecurity significant in many new ways. There are issues too about complexity and the unforeseen—and arguably unforeseeable—consequences of the interactions between complex, large, distributed systems acting in real time, and with consequences that go very directly to the wellbeing of individuals and communities.

There are great opportunities and a pressing need for social research into the IoT. The data about social impacts has been limited hitherto given the relatively few systems deployed. This will change rapidly. As Governments consult and bodies like the BCS seek to advise, it’s very desirable that public debate about privacy and security, access and governance, take place on the basis of real evidence and sound analysis.

Why do (some) political protest mobilisations succeed?

Sandra Gonzalez-Bailon — Fri, 19 Apr 2013 13:40:55 +0000

The communication technologies once used by rebels and protesters to gain global visibility now look burdensome and dated: much separates the once-futuristic-looking image of Subcomandante Marcos posing in the Chiapas jungle draped in electronic gear (1994) from the uprisings of the 2011 Egyptian revolution. While the only practical platform for amplifying a message was once provided by organisations, the rise of the Internet means that cross-national networks are now reachable by individuals—who are able to bypass organisations, ditch membership dues, and embrace self-organization. As social media and mobile applications increasingly blur the distinction between public and private, ordinary citizens are becoming crucial nodes in the contemporary protest network.

The personal networks that are the main channels of information flow in sites such as Facebook, Twitter and LinkedIn mean that we don’t need to actively seek out particular information; it can be served to us with no more effort than that of maintaining a connection with our contacts. News, opinions, and calls for justice are now shared and forwarded by our friends—and their friends—in a constant churn of information, all attached to familiar names and faces. Given we are more likely to pass on information if the source belongs to our social circle, this has had an important impact on the information environment within which protest movements are initiated and develop.

Mobile connectivity is also important for understanding contemporary protest, given that the ubiquitous streams of synchronous information we access anywhere are shortening our reaction times. This is important, as the evolution of mass recruitments—whether they result in flash mobilisations, slow burns, or simply damp squibs—can only be properly understood if we have a handle on the distribution of reaction times within a population. The increasing integration of the mainstream media into our personal networks is also important, given that online networks (and independent platforms like Indymedia) are not the clear-cut alternative to corporate media they once were. We can now write on the walls or feeds of mainstream media outlets, creating two-way communication channels and public discussion.

Online petitions have also transformed political protest; lower information diffusion costs mean that support (and signatures) can be scaled up much faster. These petitions provide a mine of information for researchers interested in what makes protests succeed or fail. The study of cascading behaviour in online networks suggests that most chain reactions fail quickly, and most petitions don’t gather that much attention anyway. While large cascades tend to start at the core of networks, network centrality is not always a guarantor of success.

So what does a successful cascade look like? Work by Duncan Watts has shown that the vast majority of cascades are small and simple, terminating within one degree of an initial adopting ‘seed.’ Research has also shown that adoptions resulting from chains of referrals are extremely rare; even for the largest cascades observed, the bulk of adoptions often took place within one degree of a few dominant individuals. Conversely, research on the spreading dynamics of a petition organised in opposition to the 2002-2003 Iraq war showed a narrow but very deep tree-like distribution, progressing through many steps and complex paths. The deepness and narrowness of the observed diffusion tree meant that it was fragile—and easily broken at any of the levels required for further distribution. Chain reactions are only successful with the right alignment of factors, and this becomes more likely as more attempts are launched. The rise of social media means that there are now more attempts.

One consequence of these—very recent—developments is the blurring of the public and the private. A significant portion of political information shared online travels through networks that are not necessarily political, but that can be activated for political purposes as circumstances arise. Online protest networks are decentralised structures that pull together local sources of information and create efficient channels for a potentially global diffusion, but they replicate the recruitment dynamics that operated in social networks prior to the emergence of the Internet.

The wave of protests seen in 2011—including the Arab Spring, the Spanish Indignados, and the Global Occupy Campaign—reflects this global interdependence of localised, personal networks, with protest movements emerging spontaneously from the individual actions of many thousands (or millions) of networked users. Political protest movements are seldom stable and fixed organisational structures, and online networks are inherently suited to channeling this fluid commitment and identity. However, systematic research to uncover the bridges and precise network mechanisms that facilitate cross-border diffusion is still lacking. Decentralized networks facilitate mobilisations of unprecedented reach and speed—but are actually not very good at maintaining momentum, or creating particularly stable structures. For this, traditional organisations are still relevant, even while they struggle to maintain a critical mass.

The general failure of traditional organisations to harness the power of these personal networks results from their complex structure, which complicates any attempts at prediction, planning, and engineering. Mobilization paths are difficult to predict because they depend on the right alignment of conditions on different levels—from the local information contexts of individuals who initiate or sustain diffusion chains, to the global assembly of separate diffusion branches. The networked chain reactions that result as people jump onto bandwagons follow complex paths; furthermore, the cumulative effects of these individual actions within the network are not linear, due to feedback mechanisms that can cause sudden changes and flips in mobilisation dynamics, such as exponential growth.

Of course, protest movements are not created by social media technologies; they provide just one mechanism by which a movement can emerge, given the right social, economic, and historical circumstances. We therefore need to focus less on the specific technologies and more on how they are used if we are to explain why most mobilisations fail, but some succeed. Technology is just a part of the story—and today’s Twitter accounts will soon look as dated as the electronic gizmos used by the Zapatistas in the Chiapas jungle.

Uncovering the structure of online child exploitation networks

Martin Bouchard — Thu, 07 Feb 2013 10:11:17 +0000

The Internet has provided the social, individual, and technological circumstances needed for child pornography to flourish. Sex offenders have been able to utilize the Internet for dissemination of child pornographic content, for social networking with other pedophiles through chatrooms and newsgroups, and for sexual communication with children. A 2009 estimate by the United Nations estimates that there are more than four million websites containing child pornography, with 35 percent of them depicting serious sexual assault [1]. Even if this report or others exaggerate the true prevalence of those websites by a wide margin, the fact of the matter is that those websites are pervasive on the world wide web.

Despite large investments of law enforcement resources, online child exploitation is nowhere near under control, and while there are numerous technological products to aid in finding child pornography online, they still require substantial human intervention. Despite this, steps can be taken to increase the automation process of these searches, to reduce the amount of content police officers have to examine, and increase the time they can spend on investigating individuals.

While law enforcement agencies will aim for maximum disruption of online child exploitation networks by targeting the most connected players, there is a general lack of research on the structural nature of these networks; something we aimed to address in our study, by developing a method to extract child exploitation networks, map their structure, and analyze their content. Our custom-written Child Exploitation Network Extractor (CENE) automatically crawls the Web from a user-specified seed page, collecting information about the pages it visits by recursively following the links out of the page; the result of the crawl is a network structure containing information about the content of the websites, and the linkages between them [2].

We chose ten websites as starting points for the crawls; four were selected from a list of known child pornography websites while the other six were selected and verified through Google searches using child pornography search terms. To guide the network extraction process we defined a set of 63 keywords, which included words commonly used by the Royal Canadian Mounted Police to find illegal content; most of them code words used by pedophiles. Websites included in the analysis had to contain at least seven of the 63 unique keywords, on a given web page; manual verification showed us that seven keywords distinguished well between child exploitation web pages and regular web pages. Ten sports networks were analyzed as a control.

The web crawler was found to be able to properly identify child exploitation websites, with a clear difference found in the hardcore content hosted by child exploitation and non-child exploitation websites. Our results further suggest that a ‘network capital’ measure — which takes into account network connectivity, as well as severity of content — could aid in identifying the key players within online child exploitation networks. These websites are the main concern of law enforcement agencies, making the web crawler a time saving tool in target prioritization exercises. Interestingly, while one might assume that website owners would find ways to avoid detection by a web crawler of the type we have used, these websites — despite the fact that much of the content is illegal — turned out to be easy to find. This fits with previous research that has found that only 20-25 percent of online child pornography arrestees used sophisticated tools for hiding illegal content [3,4].

As mentioned earlier, the huge amount of content found on the Internet means that the likelihood of eradicating the problem of online child exploitation is nil. As the decentralized nature of the Internet makes combating child exploitation difficult, it becomes more important to introduce new methods to address this. Social network analysis measurements, in general, can be of great assistance to law enforcement investigating all forms of online crime—including online child exploitation. By creating a web crawler that reduces the amount of hours officers need to spend examining possible child pornography websites, and determining whom to target, we believe that we have touched on a method to maximize the current efforts by law enforcement. An automated process has the added benefit of aiding to keep officers in the department longer, as they would not be subjugated to as much traumatic content.

There are still areas for further research; the first step being to further refine the web crawler. Despite being a considerable improvement over a manual analysis of 300,000 web pages, it could be improved to allow for efficient analysis of larger networks, bringing us closer to the true size of the full online child exploitation network, but also, we expect, to some of the more hidden (e.g., password/membership protected) websites. This does not negate the value of researching publicly accessible websites, given that they may be used as starting locations for most individuals.

Much of the law enforcement to date has focused on investigating images, with the primary reason being that databases of hash values (used to authenticate the content) exists for images, and not for videos. Our web crawler did not distinguish between the image content, but utilizing known hash values would help improve the validity of our severity measurement. Although it would be naïve to suggest that online child exploitation can be completely eradicated, the sorts of social network analysis methods described in our study provide a means of understanding the structure (and therefore key vulnerabilities) of online networks; in turn, greatly improving the effectiveness of law enforcement.

[1] Engeler, E. 2009. September 16. UN Expert: Child Porn on Internet Increases. The Associated Press.

[2] Westlake, B.G., Bouchard, M., and Frank, R. 2012. Finding the Key Players in Online Child Exploitation Networks. Policy and Internet 3 (2).

[3] Carr, J. 2004. Child Abuse, Child Pornography and the Internet. London: NCH.

[4] Wolak, J., D. Finkelhor, and K.J. Mitchell. 2005. “Child Pornography Possessors Arrested in Internet-Related Crimes: Findings from the National Juvenile Online Victimization Study (NCMEC 06–05–023).” Alexandria, VA: National Center for Missing and Exploited Children.

Read the full paper: Westlake, B.G., Bouchard, M., and Frank, R. 2012. Finding the Key Players in Online Child Exploitation Networks. Policy and Internet 3 (2).

The “IPP2012: Big Data, Big Challenges” conference explores the new research frontiers opened up by big data .. as well as its limitations

David Sutcliffe — Mon, 24 Sep 2012 10:50:20 +0000

Recent years have seen an increasing buzz around how ‘Big Data’ can uncover patterns of human behaviour and help predict social trends. Most social activities today leave digital imprints that can be collected and stored in the form of large datasets of transactional data. Access to this data presents powerful and often unanticipated opportunities for researchers and policy makers to generate new, precise, and rapid insights into economic, social and political practices and processes, as well as to tackle longstanding problems that have hitherto been impossible to address, such as how political movements like the ‘Arab Spring’ and Occupy originate and spread.

Opening comments from convenor, Helen Margetts

While big data can allow the design of efficient and realistic policy and administrative change, it also brings ethical challenges (for example, when it is used for probabilistic policy-making), raising issues of justice, equity and privacy. It also presents clear methodological and technical challenges: big data generation and analysis requires expertise and skills which can be a particular challenge to governmental organizations, given their dubious record on the guardianship of large scale datasets, the management of large technology-based projects, and capacity to innovate. It is these opportunities and challenges that were addressed by the recent conference “Internet, Politics, Policy 2012: Big Data, Big Challenges?” organised by the Oxford Internet Institute (University of Oxford) on behalf of the OII-edited academic journal Policy and Internet. Over the two days of paper and poster presentations and discussion it explored the new research frontiers opened up by big data as well as its limitations, serving as a forum to encourage discussion across disciplinary boundaries on how to exploit this data to inform policy debates and advance social science research.

Duncan Watts (Keynote Speaker)

The conference was organised along three tracks: “Policy,” “Politics,” and Data+Methods (see the programme) with panels focusing on the impact of big data on (for example) political campaigning, collective action and political dissent, sentiment analysis, prediction of large-scale social movements, government, public policy, social networks, data visualisation, and privacy. Webcasts are now available of the keynote talks given by Nigel Shadbolt (University of Southampton and Open Data Institute) and Duncan Watts (Microsoft Research). A webcast is also available of the opening plenary panel, which set the scene for the conference, discussing the potential and challenges of big data for public policy-making, with participation from Helen Margetts (OII), Lance Bennett (University of Washington, Seattle), Theo Bertram (UK Policy Manager, Google), and Patrick McSharry (Mathematical Institute, University of Oxford), chaired by Victoria Nash (OII).

Poster Prize Winner Shawn Walker (left) and Paper Prize Winner Jonathan Bright (right) with IPP2012 convenors Sandra Gonzalez-Bailon (left) and Helen Margetts (right).

The evening receptions were held in the Ashmolean Museum (allowing us to project exciting data visualisations onto their shiny white walls), and the University’s Natural History Museum, which provided a rather more fossil-focused ambience. We are very pleased to note that the “Best Paper” winners were Thomas Chadefaux (ETH Zurich) for his paper: Early Warning Signals for War in the News, and Jonathan Bright (EUI) for his paper: The Dynamics of Parliamentary Discourse in the UK: 1936-2011. The Google-sponsored “Best Poster” prize winners were Shawn Walker (University of Washington) for his poster (with Joe Eckert, Jeff Hemsley, Robert Mason, and Karine Nahon): SoMe Tools for Social Media Research, and Giovanni Grasso (University of Oxford) for his poster (with Tim Furche, Georg Gottlob, and Christian Schallhart): OXPath: Everyone can Automate the Web!

Many of the conference papers are available on the conference website; the conference special issue on big data will be published in the journal Policy and Internet in 2013.

Slicing digital data: methodological challenges in computational social science

Sandra Gonzalez-Bailon — Wed, 30 May 2012 10:45:26 +0000

One of the big social science questions is how our individual actions aggregate into collective patterns of behaviour (think crowds, riots, and revolutions). This question has so far been difficult to tackle due to a lack of appropriate data, and the complexity of the relationship between the individual and the collective. Digital trails are allowing Social Scientists to understand this relationship better.

Small changes in individual actions can have large effects at the aggregate level; this opens up the potential for drawing incorrect conclusions about generative mechanisms when only aggregated patterns are analysed, as Schelling aimed to show in his classic example of racial segregation.

Part of the reason why it has been so difficult to explore this connection between the individual and the collective — and the unintended consequences that arise from that connection — is lack of proper empirical data, particularly around the structure of interdependence that links individual actions. This relational information is what digital data is now providing; however, they present some new challenges to the social scientist, particularly those who are used to working with smaller, cross-sectional datasets. Suddenly, we can track and analyse the interactions of thousands (if not millions) of people with a time resolution that can go down to the second. The question is how to best aggregate that data and deal with the time dimension.

Interactions take place in continuous time; however, most digital interactions are recorded as events (i.e. sending or receiving messages), and different network structures emerge when those events are aggregated according to different windows (i.e. days, weeks, months). We still don’t have systematic knowledge on how transforming continuous data into discrete observation windows affects the networks of interaction we analyse. Reconstructing interpersonal networks (particularly longitudinal network data) used to be extremely time consuming and difficult; now it is relatively easy to obtain that sort of network data, but modelling and analysing them is still a challenge.

Another problem faced by social scientists using digital data is that most social networks are multiplex in nature, that is, we belong to many different networks that interact and affect each other by means of feedback effects: How do all these different network structures co-evolve? If we only focus on one network, such as Twitter, we lose information about how activity in other networks (like Facebook, or email, or offline communication) is related to changes in the network we observe. In our study on the Spanish protests, we only track part of the relevant activity: we have a good idea of what was happening on Twitter, but there were obviously lots of other communication networks simultaneously having an influence on people’s behaviour. And while it is exciting as a social scientist to be able to access and analyse huge quantities of detailed data about social movements as they happen, the Twitter network only provides part of the picture.

Finally, when analysing the cascading effects of individual actions there is also the challenge of separating out the effects of social influence and self-selection. Digital data allow us to follow cascading behaviour with better time resolution, but the observational data usually does not help discriminate if people behave similarly because they influence and follow each other or because they share similar attributes and motivations. Social scientists need to find ways of controlling for this self-selection in online networks; although digital data often lacks the demographic information that allows applying this control, digital technologies are also helping researchers conduct experiments that help them pin down the effects of social influence.

Digital data is allowing social scientists pose questions that couldn’t be answered before. However, there are many methodological challenges that need solving. This talk considers a few, emphasising that strong theoretical motivations should still direct the questions we pose to digital data.

Further reading:

Gonzalez-Bailon, S., Borge-Holthoefer, J. and Moreno, Y. (2013) Broadcasters and Hidden Influentials in Online Protest Diffusion. American Behavioural Scientist (forthcoming).

Gonzalez-Bailon, S., Wang, N., Rivero, A., Borge-Holthoefer, J., and Moreno, Y. (2012) Assessing the Bias in Communication Networks Sampled from Twitter. Working Paper.

Gonzalez-Bailon, S., Borge-Holthoefer, J., Rivero, A. and Moreno, Y. (2011) The Dynamics of Protest Recruitment Through an Online Network. Scientific Reports 1, 197. DOI: 10.1038/srep00197

González-Bailón, S., Kaltenbrunner, A. and Banchs, R.E. (2010) The Structure of Political Discussion Networks: A Model for the Analysis of Online Deliberation. Journal of Information Technology 25 (2) 230-243.