big data – The Policy and Internet Blog https://ensr.oii.ox.ac.uk Understanding public policy online Mon, 07 Dec 2020 14:25:46 +0000 en-GB hourly 1 What are the barriers to big data analytics in local government? https://ensr.oii.ox.ac.uk/what-are-the-barriers-to-big-data-analytics-in-local-government/ Wed, 28 Jun 2017 08:11:58 +0000 http://blogs.oii.ox.ac.uk/policy/?p=4208 The concept of Big Data has become very popular over the last decade, with many large technology companies successfully building their business models around its exploitation. The UK’s public sector has tried to follow suit, with local governments in particular trying to introduce new models of service delivery based on the routine extraction of information from their own big data. These attempts have been hailed as the beginning of a new era for the public sector, with some commentators suggesting that it could help local governments transition toward a model of service delivery where the quantity and quality of commissioned services is underpinned by data intelligence on users and their current and future needs.

In their Policy & Internet article “Data Intelligence for Local Government? Assessing the Benefits and Barriers to Use of Big Data in the Public Sector“, Fola Malomo and Vania Sena examine the extent to which local governments in the UK are indeed using intelligence from big data, in light of the structural barriers they face when trying to exploit it. Their analysis suggests that the ambitions around the development of big data capabilities in local government are not reflected in actual use. Indeed, these methods have mostly been employed to develop new digital channels for service delivery, and even if the financial benefits of these initiatives are documented, very little is known about the benefits generated by them for the local communities.

While this is slowly changing as councils start to develop their big data capability, the overall impression gained from even a cursory overview is that the full potential of big data is yet to be exploited.

We caught up with the authors to discuss their findings:

Ed.: So what actually is “the full potential” that local government is supposed to be aiming for? What exactly is the promise of “big data” in this context?

Fola / Vania: Local governments seek to improve service delivery amongst other things. Big Data helps to increase the number of ways that local service providers can reach out to, and better the lives of, local inhabitants. In addition, the exploitation of Big Data allows to better target the beneficiaries of their services and emphasise early prevention which may result into a reduction of the delivery costs. Commissioners in a Council needed to understand the drivers of the demand for services across different departments and their connections: how the services are connected to each other and how changes in the provision of “upstream” services can affect the “downstream” provision. Many local governments have reams of data (both hard data and soft data) on local inhabitants and local businesses. Big Data can be used to improve services, increase quality of life and make doing business easier.

Ed.: I wonder: can the data available to a local authority even be considered to be “big data” — you mention that local government data tends to be complex, rather than “big and fast”, as in the industry understanding of “big data”. What sorts of data are we talking about?

Fola / Vania: Local governments hold data on individuals, companies, projects and other activities concerning the local community. Health data, including information on children and other at-risk individuals, forms a huge part of the data within local governments. We use the concept of the data-ecosystem to talk about Big Data within local governments. The data ecosystem consists of different types of data on different topics and units which may be used for different purposes.

Complexity within data is driven by the volume of data and the large number of data sources. One must consider the fact that public agencies address needs from communities that cross administrative boundaries of a single administrative body. Also, the choice of data collection methodology and observation unit is driven by reporting requirements which is influenced by central government. Lastly, data storage infrastructure may be designed to comply with reporting requirements rather than linking data across agencies; data is not necessarily produced to be merged The data is not always “big and fast” but requires the use of advanced storage and analytic tools to get useful information that local areas benefit from.

Ed.: Do you think local governments will ever have the capacity (budget, skill) to truly exploit “big data”? What were the three structural barriers you particularly identified?

Fola / Vania: Without funding there is no chance that local governments can fully exploit big data. With funding, Local government can benefit from Big Data in a number of ways. The improved usage of Big Data usually requires collaboration between agents. The three main structural barriers to the fruitful exploitation of big data by local governments are: data access; ethical issues; and organisational changes. In addition, skill gaps; and investment in information technology have proved problematic.

Data access can be a problem if data exists in separate locations with little communication between the housing organisations and no easy way to move the data from one place to another. The main advantage of big data technologies is their ability to merge different types of data; mine them for insights; and combine them for actionable insights. Nevertheless, while the use of big data approaches to data exploitation assumes that organisations can access all the data they need; this is not the case in the public sector. A uniform practice on what data can be shared locally has not yet emerged. Furthermore there is no solution to the fact that data can span across organisations that are not part of the public sector and that may therefore be unwilling to share data with public bodies.

De-identifying personal data is another key requirement to fulfil before personal data can be shared under the terms of the Data Protection Agreement. It is argued that this requirement is relevant when trying to merge small data sets as individuals can be easily re-identified once the data linkage is completed. As a result, the only option left to facilitate the linkage of data sets with personal information is to create a secure environment where data can be safely de-identified and then matched. Safe havens and trusted third parties have been developed exactly for` this purpose. Data warehouses, where data from local governments and from other parts of the public sector can be matched and linked, have been developed as an intermediate solution to the lack of infrastructure for matching sensitive data.

Due to the personal nature of the data, ethical issues arise concerning how to use information about individuals and whether persons should be identifiable. There is a huge debate on ethical challenges posed by the routine extraction of information from Big Data. The extraction and manipulation of personal information cannot be easily reconciled with what is perceived to be ethically acceptable in this area. Additional ethical issues related to the re-use of output from specific predictive models for other purposes within the public sector. This issue is particularly relevant given the fact that most predictive analytics algorithms only provide an estimate of the risk of an event.

Data usage is related to culture; and organisational changes can be a medium to longer term process. As long as key stakeholders in the organisation accept that insights from data will inform service delivery; big data technologies can be used as levers to introduce changes in the way services are provided. Unfortunately, it is commonly believed that the deployment of big data technologies simply implies a change in the way data are interrogated and interpreted and therefore should not have any bearing on the way internal processes are organised.

In addition, data usage can involve investment in information technology and training. It is well known that investment in IT has been very uneven between the private and public sector, and within the private sector as well. Despite the growth in information and communications technology (ICT) budgets across the private sector, the banking sector and the financial services industry spend 8 percent of their total operating expenditure on ICT, among local authorities, ICT spending makes up only 3-6% of the total budget. Furthermore, successful deployment of Big Data technologies needs to be accompanied by the development of internal skills that allow for the analysis and modelling of complex phenomena that is essential to the development of a data-driven approach to decision making within local governments. However, local governments tend to lack these skills and this skills gap may be exacerbated by the high turnover in the sector. All this, in addition to the sector’s fragmentation in terms of IT provision, reinforces the structural silos that prevent local authorities from sharing and exploiting their data.

Ed.: And do you think these big data techniques will just sort-of seep in to local government, or that there will need to be a proper step-change in terms of skills and attitudes?

Fola / Vania: The benefits of data-driven analysis are being increasingly accepted. Whilst the techniques used might seem to be steadily accepted by local governments, in order to make a real and lasting improvement public bodies should ideally have a big data strategy in place to determine how they will use the data they have available to them. Attitudes can take time to change and the provision of information can help people become more willing to use Big Data in their work.

Ed.: I suppose one solution might for local councils to buy in the services of third-party specialist “big data for local government” providers, rather than trying to develop in-house capacity: do these providers exist? I imagine local government might have data that would be attractive to commercial companies, maybe as a profit-sharing data partnership?

Fola / Vania: The truth is that providers do exist and they always charge local governments. What is underestimated is the role that data centres can play in this arena. The authors are members of the economic and social research council funded business and local government data research centre for smart analytics. This centre helps local councils use their big data better by collating data and performing analysis that is of use to local councils. The centre also provides training to public officials, giving them tools to understand and use data better. The centre is a collaboration between the Universities of Essex, Kent, East Anglia and the London School of Economics. Academics work closely with public officials to come up with solutions to problems facing local areas. In addition, commercial companies are interested in working with local government data. Working with third-party organisations is a good method to ease into the process of using Big Data solutions without having to make a huge changes to one’s organisation.

Ed.: Finally — is there anything that central Government can do (assuming it isn’t already 100% occupied with Brexit) to help local governments develop their data analytic capacity?

Fola / Vania: Central governments influence the environment in which local government operate. Despite local councils making decisions over things such as how data is stored, central government can assist by removing some of the previously-mentioned barriers to data usage. For example, government cuts are excessive and are making the sector very volatile so financial help will be useful in this area. Moreover, data access and transfer is made easier with uniformity of data storage protocols. In addition, the public will have more confidence in providing data if there is transparency in the collection, usage and provision of data. Guidelines for the use of sensitive data should be agreed upon and made known in order to improve the quality of the work. Central governments can also help change the general culture of local governments and attitudes towards Big Data. In order for Big Data to work well for all, individuals, companies, local governments and central governments should be well informed about the issues and able to effect change concerning Big Data issues.

Read the full article: Malomo, F. and Sena, V. (2107) Data Intelligence for Local Government? Assessing the Benefits and Barriers to Use of Big Data in the Public Sector. Policy & Internet 9 (1) DOI: 10.1002/poi3.141.


Fola Malomo and Vania Sena were talking to blog editor David Sutcliffe.

]]>
Alan Turing Institute and OII: Summit on Data Science for Government and Policy Making https://ensr.oii.ox.ac.uk/alan-turing-institute-and-oii-summit-on-data-science-for-government-and-policy-making/ Tue, 31 May 2016 06:45:39 +0000 http://blogs.oii.ox.ac.uk/policy/?p=3804 The benefits of big data and data science for the private sector are well recognised. So far, considerably less attention has been paid to the power and potential of the growing field of data science for policy-making and public services. On Monday 14th March 2016 the Oxford Internet Institute (OII) and the Alan Turing Institute (ATI) hosted a Summit on Data Science for Government and Policy Making, funded by the EPSRC. Leading policy makers, data scientists and academics came together to discuss how the ATI and government could work together to develop data science for the public good. The convenors of the Summit, Professors Helen Margetts (OII) and Tom Melham (Computer Science), report on the day’s proceedings.

The Alan Turing Institute will build on the UK’s existing academic strengths in the analysis and application of big data and algorithm research to place the UK at the forefront of world-wide research in data science. The University of Oxford is one of five university partners, and the OII is the only partnering department in the social sciences. The aim of the summit on Data Science for Government and Policy-Making was to understand how government can make better use of big data and the ATI – with the academic partners in listening mode.

We hoped that the participants would bring forward their own stories, hopes and fears regarding data science for the public good. Crucially, we wanted to work out a roadmap for how different stakeholders can work together on the distinct challenges facing government, as opposed to commercial organisations. At the same time, data science research and development has much to gain from the policy-making community. Some of the things that government does – collect tax from the whole population, or give money away at scale, or possess the legitimate use of force – it does by virtue of being government. So the sources of data and some of the data science challenges that public agencies face are unique and tackling them could put government working with researchers at the forefront of data science innovation.

During the Summit a range of stakeholders provided insight from their distinctive perspectives; the Government Chief Scientific Advisor, Sir Mark Walport; Deputy Director of the ATI, Patrick Wolfe; the National Statistician and Director of ONS, John Pullinger; Director of Data at the Government Digital Service, Paul Maltby. Representatives of frontline departments recounted how algorithmic decision-making is already bringing predictive capacity into operational business, improving efficiency and effectiveness.

Discussion revolved around the challenges of how to build core capability in data science across government, rather than outsourcing it (as happened in an earlier era with information technology) or confining it to a data science profession. Some delegates talked of being in the ‘foothills’ of data science. The scale, heterogeneity and complexity of some government departments currently works against data science innovation, particularly when larger departments can operate thousands of databases, creating legacy barriers to interoperability. Out-dated policies can work against data science methodologies. Attendees repeatedly voiced concerns about sharing data across government departments, in some case because of limitations of legal protections; in others because people were unsure what they can and cannot do.

The potential power of data science creates an urgent need for discussion of ethics. Delegates and speakers repeatedly affirmed the importance of an ethical framework and for thought leadership in this area, so that ethics is ‘part of the science’. The clear emergent option was a national Council for Data Ethics (along the lines of the Nuffield Council for Bioethics) convened by the ATI, as recommended in the recent Science and Technology parliamentary committee report The big data dilemma and the government response. Luciano Floridi (OII’s professor of the philosophy and ethics of information) warned that we cannot reduce ethics to mere compliance. Ethical problems do not normally have a single straightforward ‘right’ answer, but require dialogue and thought and extend far beyond individual privacy. There was consensus that the UK has the potential to provide global thought leadership and to set the standard for the rest of Europe. It was announced during the Summit that an ATI Working Group on the Ethics of Data Science has been confirmed, to take these issues forward.

So what happens now?

Throughout the Summit there were calls from policy makers for more data science leadership. We hope that the ATI will be instrumental in providing this, and an interface both between government, business and academia, and between separate Government departments. This Summit showed just how much real demand – and enthusiasm – there is from policy makers to develop data science methods and harness the power of big data. No-one wants to repeat with data science the history of government information technology – where in the 1950s and 60s, government led the way as an innovator, but has struggled to maintain this position ever since. We hope that the ATI can act to prevent the same fate for data science and provide both thought leadership and the ‘time and space’ (as one delegate put it) for policy-makers to work with the Institute to develop data science for the public good.

So since the Summit, in response to the clear need that emerged from the discussion and other conversations with stakeholders, the ATI has been designing a Policy Innovation Unit, with the aim of working with government departments on ‘data science for public good’ issues. Activities could include:

  • Secondments at the ATI for data scientists from government
  • Short term projects in government departments for ATI doctoral students and postdoctoral researchers
  • Developing ATI as an accredited data facility for public data, as suggested in the current Cabinet Office consultation on better use of data in government
  • ATI pilot policy projects, using government data
  • Policy symposia focused on specific issues and challenges
  • ATI representation in regular meetings at the senior level (for example, between Chief Scientific Advisors, the Cabinet Office, the Office for National Statistics, GO-Science).
  • ATI acting as an interface between public and private sectors, for example through knowledge exchange and the exploitation of non-government sources as well as government data
  • ATI offering a trusted space, time and a forum for formulating questions and developing solutions that tackle public policy problems and push forward the frontiers of data science
  • ATI as a source of cross-fertilization of expertise between departments
  • Reviewing the data science landscape in a department or agency, identifying feedback loops – or lack thereof – between policy-makers, analysts, front-line staff and identifying possibilities for an ‘intelligent centre’ model through strategic development of expertise.

The Summit, and a series of Whitehall Roundtables convened by GO-Science which led up to it, have initiated a nascent network of stakeholders across government, which we aim to build on and develop over the coming months. If you are interested in being part of this, please do be in touch with us

Helen Margetts, Oxford Internet Institute, University of Oxford (director@oii.ox.ac.uk)

Tom Melham, Department of Computer Science, University of Oxford

]]>
Exploring the Ethics of Monitoring Online Extremism https://ensr.oii.ox.ac.uk/exploring-the-ethics-of-monitoring-online-extremism/ Wed, 23 Mar 2016 09:59:02 +0000 http://blogs.oii.ox.ac.uk/policy/?p=3616 (Part 2 of 2) The Internet serves not only as a breeding ground for extremism, but also offers myriad data streams which potentially hold great value to law enforcement. The report by the OII’s Ian Brown and Josh Cowls for the VOX-Pol project: Check the Web: Assessing the Ethics and Politics of Policing the Internet for Extremist Material explores the complexities of policing the web for extremist material, and its implications for security, privacy and human rights. In the second of a two-part post, Josh Cowls and Ian Brown discuss the report with blog editor Bertie Vidgen. Read the first post.

Surveillance in NYC's financial district. Photo by Jonathan McIntosh (flickr).
Surveillance in NYC’s financial district. Photo by Jonathan McIntosh (flickr).

Ed: Josh, political science has long posed a distinction between public spaces and private ones. Yet it seems like many platforms on the Internet, such as Facebook, cannot really be categorized in such terms. If this correct, what does it mean for how we should police and govern the Internet?

Josh: I think that is right – many online spaces are neither public nor private. This is also an issue for some for privacy legal frameworks (especially in the US).. A lot of the covenants and agreements were written forty or fifty years ago, long before anyone had really thought about the Internet. That has now forced governments, societies and parliaments to adapt these existing rights and protocols for the online sphere. I think that we have some fairly clear laws about the use of human intelligence sources, and police law in the offline sphere. The interesting question is how we can take that online. How can the pre-existing standards, like the requirement that procedures are necessary and proportionate, or the ‘right to appeal’, be incorporated into online spaces? In some cases there are direct analogies. In other cases there needs to be some re-writing of the rule book to try figure out what we mean. And, of course, it is difficult because the internet itself is always changing!

Ed: So do you think that concepts like proportionality and justification need to be updated for online spaces?

Josh: I think that at a very basic level they are still useful. People know what we mean when we talk about something being necessary and proportionate, and about the importance of having oversight. I think we also have a good idea about what it means to be non-discriminatory when applying the law, though this is one of those areas that can quickly get quite tricky. Consider the use of online data sources to identify people. On the one hand, the Internet is ‘blind’ in that it does not automatically codify social demographics. In this sense it is not possible to profile people in the same way that we can offline. On the other hand, it is in some ways the complete opposite. It is very easy to directly, and often invisibly, create really firm systems of discrimination – and, most problematically, to do so opaquely.

This is particularly challenging when we are dealing with extremism because, as we pointed out in the report, extremists are generally pretty unremarkable in terms of demographics. It perhaps used to be true that extremists were more likely to be poor or to have had challenging upbringings, but many of the people going to fight for the Islamic State are middle class. So we have fewer demographic pointers to latch onto when trying to find these people. Of course, insofar as there are identifiers they won’t be released by the government. The real problem for society is that there isn’t very much openness and transparency about these processes.

Ed: Governments are increasingly working with the private sector to gain access to different types of information about the public. For example, in Australia a Telecommunications bill was recently passed which requires all telecommunication companies to keep the metadata – though not the content data – of communications for two years. A lot of people opposed the Bill because metadata is still very informative, and as such there are some clear concerns about privacy. Similar concerns have been expressed in the UK about an Investigatory Powers Bill that would require new Internet Connection Records about customers, online activities.  How much do you think private corporations should protect people’s data? And how much should concepts like proportionality apply to them?

Ian: To me the distinction between metadata and content data is fairly meaningless. For example, often just knowing when and who someone called and for how long can tell you everything you need to know! You don’t have to see the content of the call. There are a lot of examples like this which highlight the slightly ludicrous nature of distinguishing between metadata and content data. It is all data. As has been said by former US CIA and NSA Director Gen. Michael Hayden, “we kill people based on metadata.”

One issue that we identified in the report is the increased onus on companies to monitor online spaces, and all of the legal entanglements that come from this given that companies might not be based in the same country as the users. One of our interviewees called this new international situation a ‘very different ballgame’. Working out how to deal with problematic online content is incredibly difficult, and some huge issues of freedom of speech are bound up in this. On the one hand, there is a government-led approach where we use the law to take down content. On the other hand is a broader approach, whereby social networks voluntarily take down objectionable content even if it is permissible under the law. This causes much more serious problems for human rights and the rule of law.

Read the full report: Brown, I., and Cowls, J., (2015) Check the Web: Assessing the Ethics and Politics of Policing the Internet for Extremist Material. VOX-Pol Publications.


Ian Brown is Professor of Information Security and Privacy at the OII. His research is focused on surveillance, privacy-enhancing technologies, and Internet regulation.

Josh Cowls is a a student and researcher based at MIT, working to understand the impact of technology on politics, communication and the media.

Josh and Ian were talking to Blog Editor Bertie Vidgen.

]]>
P-values are widely used in the social sciences, but often misunderstood: and that’s a problem. https://ensr.oii.ox.ac.uk/many-of-us-scientists-dont-understand-p-values-and-thats-a-problem/ https://ensr.oii.ox.ac.uk/many-of-us-scientists-dont-understand-p-values-and-thats-a-problem/#comments Mon, 07 Mar 2016 18:53:29 +0000 http://blogs.oii.ox.ac.uk/policy/?p=3604 P-values are widely used in the social sciences, especially ‘big data’ studies, to calculate statistical significance. Yet they are widely criticized for being easily hacked, and for not telling us what we want to know. Many have argued that, as a result, research is wrong far more often than we realize. In their recent article P-values: Misunderstood and Misused OII Research Fellow Taha Yasseri and doctoral student Bertie Vidgen argue that we need to make standards for interpreting p-values more stringent, and also improve transparency in the academic reporting process, if we are to maximise the value of statistical analysis.

“Significant”: an illustration of selective reporting and statistical significance from XKCD. Available online at http://xkcd.com/882/
“Significant”: an illustration of selective reporting and
statistical significance from XKCD. Available online at
http://xkcd.com/882/

In an unprecedented move, the American Statistical Association recently released a statement (March 7 2016) warning against how p-values are currently used. This reflects a growing concern in academic circles that whilst a lot of attention is paid to the huge impact of big data and algorithmic decision-making, there is considerably less focus on the crucial role played by statistics in enabling effective analysis of big data sets, and making sense of the complex relationships contained within them. Because much as datafication has created huge social opportunities, it has also brought to the fore many problems and limitations with current statistical practices. In particular, the deluge of data has made it crucial that we can work out whether studies are ‘significant’. In our paper, published three days before the ASA’s statement, we argued that the most commonly used tool in the social sciences for calculating significance – the p-value – is misused, misunderstood and, most importantly, doesn’t tell us what we want to know.

The basic problem of ‘significance’ is simple: it is simply unpractical to repeat an experiment an infinite number of times to make sure that what we observe is “universal”. The same applies to our sample size: we are often unable to analyse a “whole population” sample and so have to generalize from our observations on a limited size sample to the whole population. The obvious problem here is that what we observe is based on a limited number of experiments (sometimes only one experiment) and from a limited size sample, and as such could have been generated by chance rather than by an underlying universal mechanism! We might find it impossible to make the same observation if we were to replicate the same experiment multiple times or analyse a larger sample. If this is the case then we will mischaracterise what is happening – which is a really big problem given the growing importance of ‘evidence-based’ public policy. If our evidence is faulty or unreliable then we will create policies, or intervene in social settings, in an equally faulty way.

The way that social scientists have got round this problem (that samples might not be representative of the population) is through the ‘p-value’. The p-value tells you the probability of making a similar observation in a sample with the same size and in the same number of experiments, by pure chance In other words,  it is actually telling you is how likely it is that you would see the same relationship between X and Y even if no relationship exists between them. On the face of it this is pretty useful, and in the social sciences we normally say that a p-value of 1 in 20 means the results are significant. Yet as the American Statistical Association has just noted, even though they are incredibly widespread many researchers mis-interpret what p-values really mean.

In our paper we argued that p-values are misunderstood and misused because people think the p-value tells you much more than it really does. In particular, people think the p-value tells you (i) how likely it is that a relationship between X and Y really exists and (ii) the percentage of all findings that are false (which is actually something different called the False Discovery Rate). As a result, we are far too confident that academic studies are correct. Some commentators have argued that at least 30% of studies are wrong because of problems related to p-values: a huge figure. One of the main problems is that p-values can be ‘hacked’ and as such easily manipulated to show significance when none exists.

If we are going to base public policy (and as such public funding) on ‘evidence’ then we need to make sure that the evidence used is reliable. P-values need to be used far more rigorously, with significance levels of 0.01 or 0.001 seen as standard. We also need to start being more open and transparent about how results are recorded. It is a fine line between data exploration (a legitimate academic exercise) and ‘data dredging’ (where results are manipulated in order to find something noteworthy). Only if researchers are honest about what they are doing will we be able to maximise the potential benefits offered by Big Data. Luckily there are some great initiatives – like the Open Science Framework – which improve transparency around the research process, and we fully endorse researchers making use of these platforms.

Scientific knowledge advances through corroboration and incremental progress, and it is crucial that we use and interpret statistics appropriately to ensure this progress continues. As our knowledge and use of big data methods increase, we need to ensure that our statistical tools keep pace.

Read the full paper: Vidgen, B. and Yasseri, T., (2016) P-values: Misunderstood and Misused, Frontiers in Physics, 4:6. http://dx.doi.org/10.3389/fphy.2016.00006


Bertie Vidgen is a doctoral student at the Oxford Internet Institute researching far-right extremism in online contexts. He is supervised by Dr Taha Yasseri, a research fellow at the Oxford Internet Institute interested in how Big Data can be used to understand human dynamics, government-society interactions, mass collaboration, and opinion dynamics.

]]>
https://ensr.oii.ox.ac.uk/many-of-us-scientists-dont-understand-p-values-and-thats-a-problem/feed/ 1
New Voluntary Code: Guidance for Sharing Data Between Organisations https://ensr.oii.ox.ac.uk/new-voluntary-code-guidance-for-sharing-data-between-organisations/ Fri, 08 Jan 2016 10:40:37 +0000 http://blogs.oii.ox.ac.uk/policy/?p=3540 Many organisations are coming up with their own internal policy and guidelines for data sharing. However, for data sharing between organisations to be straight forward, there needs to a common understanding of basic policy and practice. During her time as an OII Visiting Associate, Alison Holt developed a pragmatic solution in the form of a Voluntary Code, anchored in the developing ISO standards for the Governance of Data. She discusses the voluntary code, and the need to provide urgent advice to organisations struggling with policy for sharing data.

Collecting, storing and distributing digital data is significantly easier and cheaper now than ever before, in line with predictions from Moore, Kryder and Gilder. Organisations are incentivised to collect large volumes of data with the hope of unleashing new business opportunities or maybe even new businesses. Consider the likes of uber, Netflix, and Airbnb and the other data mongers who have built services based solely on digital assets.

The use of this new abundant data will continue to disrupt traditional business models for years to come, and there is no doubt that these large data volumes can provide value. However, they also bring associated risks (such as unplanned disclosure and hacks) and they come with constraints (for example in the form of privacy or data protection legislation). Hardly a week goes by without a data breach hitting the headlines. Even if your telecommunications provider didn’t inadvertently share your bank account and sort code with hackers, and your child wasn’t one of the hundreds of thousands of children whose birthdays, names, and photos were exposed by a smart toy company, you might still be wondering exactly how your data is being looked after by the banks, schools, clinics, utility companies, local authorities and government departments that are so quick to collect your digital details.

Then there are the companies who have invited you to sign away the rights to your data and possibly your privacy too – the ones that ask you to sign the Terms and Conditions for access to a particular service (such as a music or online shopping service) or have asked you for access to your photos. And possibly you are one of the “worried well” who wear or carry a device that collects your health data and sends it back to storage in a faraway country, for analysis.

So unless you live in a lead-lined concrete bunker without any access to internet connected devices, and you don’t have the need to pass by webcams or sensors, or use public transport or public services; then your data is being collected and shared. And for the majority of the time, you benefit from this enormously. The bus stop tells you exactly when the next bus is coming, you have easy access to services and entertainment fitted very well to your needs, and you can do most of your bank and utility transactions online in the peace and quiet of your own home. Beyond you as an individual, there are organisations “out there” sharing your data to provide you better healthcare, education, smarter city services and secure and efficient financial services, and generally matching the demand for services with the people needing them.

So we most likely all have data that is being shared and it is generally in our interest to share it, but how can we trust the organisations responsible for sharing our data? As an organisation, how can I know that my partner and supplier organisations are taking care of my client and product information?

Organisations taking these issues seriously are coming up with their own internal policy and guidelines. However, for data sharing between organisations to be straight forward, there needs to a common understanding of basic policy and practice. During my time as a visiting associate at the Oxford Internet Institute, University of Oxford, I have developed a pragmatic solution in the form of a Voluntary Code. The Code has been produced using the guidelines for voluntary code development produced by the Office of Community Affairs, Industry Canada. More importantly, the Code is anchored in the developing ISO standards for the Governance of Data (the 38505 series). These standards apply the governance principles and model from the 38500 standard and introduce the concept of a data accountability map, highlighting six focus areas for a governing body to apply governance. The early stage standard suggests considering the aspects of Value, Risk and Constraint for each area, to determine what practice and policy should be applied to maximise the value from organisational data, whilst applying constraints as set by legislation and local policy, and minimising risk.

I am Head of the New Zealand delegation to the ISO group developing IT Service Management and IT Governance standards, SC40, and am leading the development of the 38505 series of Governance of Data standards, working with a talented editorial team of industry and standards experts from Australia, China and the Netherlands. I am confident that the robust ISO consensus-led process involving subject matter experts from around the world, will result in the publication of best practice guidance for the governance of data, presented in a format that will have relevance and acceptance internationally.

In the meantime, however, I see a need to provide urgent advice to organisations struggling with policy for sharing data. I have used my time at Oxford to interview policy, ethics, smart city, open data, health informatics, education, cyber security and social science experts and users, owners and curators of large data sets, and have come up with a “Voluntary Code for Data Sharing”. The Code takes three areas from the data accountability map in the developing ISO standard 38505-1; namely Collect, Store, Distribute, and applies the aspects of Value, Risk and Constraint to provide seven maxims for sharing data. To assist with adoption and compliance, the Code provides references to best practice and examples. As the ISO standards for the Governance of Data develop, the Code will be updated. New examples of good practice will be added as they come to light.

[A permanent home for the voluntary code is currently being organised; please email me in the meantime if you are interested in it: Alison.holt@longitude174.com]

The Code is deliberately short and succinct, but it does provide links for those who need to read more to understand the underpinning practices and standards, and those tasked with implementing organisational data policy and practice. It cannot guarantee good outcomes. With new security threats arising daily, nobody can fully guarantee the safety of your information. However, if you deal with an organisation that is compliant with the Voluntary Code, then at least you can have assurance that the organisation has at least considered how it is using your data now and how it might want to reuse your data in the future, how and where your data will be stored, and then finally how your data will be distributed or discarded. And that’s a good start!


alison_holtAlison Holt was an OII Academic Visitor in late 2015. She is an internationally acclaimed expert in the Governance of Information Technology and Data, heading up the New Zealand delegations to the international standards committees for IT Governance and Service Management (SC40) and Software and Systems Engineering (SC7). The British Computer Society published Alison’s first book on the Governance of IT in 2013.

]]>
Government “only” retaining online metadata still presents a privacy risk https://ensr.oii.ox.ac.uk/government-only-retaining-online-metadata-still-presents-a-privacy-risk/ Mon, 30 Nov 2015 08:14:56 +0000 http://blogs.oii.ox.ac.uk/policy/?p=3514 Issues around data capture, retention and control are gaining significant attention in many Western countries — including in the UK. In this piece originally posted on the Ethics Centre Blog, the OII’s Brent Mittelstadt considers the implications of metadata retention for privacy. He argues that when considered in relation to individuals’ privacy, metadata should not be viewed as fundamentally different to data about the content of a communication.

From 13 October onwards telecommunications providers in Australia will be required to retain metadata on communications for two years. Image by r2hox (Flickr).
Since 13 October 2015 telecommunications providers in Australia have been required to retain metadata on communications for two years. Image by h2hox (Flickr)

Australia’s new data retention law for telecommunications providers, comparable to extant UK and US legislation, came into effect 13 October 2015. Telecoms and ISPs are now required to retain metadata about communications for two years to assist law enforcement agencies in crime and terrorism investigation. Despite now being in effect, the extent and types of data to be collected remain unclear. The law has been widely criticised for violating Australians’ right to privacy by introducing overly broad surveillance of civilians. The Government has argued against this portrayal. They argue the content of communications will not be retained but rather the “data about the data” – location, time, date and duration of a call.

Metadata retention raises complex ethical issues often framed in terms of privacy which are relevant globally. A popular argument is that metadata offers a lower risk of violating privacy compared to primary data – the content of communication. The distinction between the “content” and “nature” of a communication implies that if the content of a message is protected, so is the privacy of the sender and receiver.

The assumption that metadata retention is more acceptable because of its lower privacy risks is unfortunately misguided. Sufficient volumes of metadata offer comparable opportunities to generate invasive information about civilians. Consider a hypothetical. I am given access to a mobile carrier’s dataset that specifies time, date, caller and receiver identity in addition to a continuous record of location constructed with telecommunication tower triangulation records. I see from this that when John’s wife Jane leaves the house, John often calls Jill and visits her for a short period from afterwards. From this I conclude that John may be having an affair with Jill. Now consider the alternative. Instead of metadata I have access to recordings of the calls between John and Jill with which I reach the same conclusion.

From a privacy perspective the method I used to infer something about John’s marriage is trivial. In both cases I am making an intrusive inference about John based on data that describes his behaviours. I cannot be certain but in both cases I am sufficiently confident that my inference is correct based on the data available. My inferences are actionable – I treat them as if they are reliable, accurate knowledge when interacting with John. It is this willingness to act on uncertainty (which is central to ‘Big Data’) that makes metadata ethically similar to primary data. While it is comparatively difficult to learn something from metadata, the potential is undeniable. Both types allow for invasive inferences to be made about the lives and behaviours of people.

Going further, some would argue that metadata can actually be more invasive than primary data. Variables such as location, time and duration are easier to assemble into a historical record of behaviour than content. These concerns are deepened by the difficulty of “opting out” of metadata surveillance. While a person can hypothetically forego all modern communication technologies, privacy suddenly has a much higher cost in terms of quality of life.

Technologies such as encrypted communication platforms, virtual private networks (VPN) and anonymity networks have all been advocated as ways to subvert metadata collection by hiding aspects of your communications. It is worth remembering that these techniques remain feasible only so long as they remain legal, one has the technical knowledge and (in some cases) ability to pay. These technologies raise a question of whether a right to anonymity exists. Perhaps privacy enhancing technologies are immoral? Headlines about digital piracy and the “dark web” show how quickly technologically hiding one’s identity and behaviours can take on a criminal and immoral tone. The status quo of privacy subtly shifts when techniques to hide aspects of one’s personal life are portrayed as necessarily subversive. The technologies to combat metadata retention are not criminal or immoral – they are privacy enhancing technologies.

Privacy is historically a fundamental human value. Individuals have a right to privacy. Violations must be justified by a competing interest. In discussing the ethics of metadata retention and anonymity technologies it is easy to forget this status quo. Privacy is not something that individuals have to justify or argue for – it should be assumed.


Brent Mittelstadt is a Postdoctoral Research Fellow at the Oxford Internet Institute working on the ‘Ethics of Biomedical Big Data‘ project with Prof. Luciano Floridi. His research interests include the ethics of information handled by medical ICT, theoretical developments in discourse and virtue ethics, and epistemology of information.

]]>
How big data is breathing new life into the smart cities concept https://ensr.oii.ox.ac.uk/how-big-data-is-breathing-new-life-into-the-smart-cities-concept/ Thu, 23 Jul 2015 09:57:10 +0000 http://blogs.oii.ox.ac.uk/policy/?p=3297 “Big data” is a growing area of interest for public policy makers: for example, it was highlighted in UK Chancellor George Osborne’s recent budget speech as a major means of improving efficiency in public service delivery. While big data can apply to government at every level, the majority of innovation is currently being driven by local government, especially cities, who perhaps have greater flexibility and room to experiment and who are constantly on a drive to improve service delivery without increasing budgets.

Work on big data for cities is increasingly incorporated under the rubric of “smart cities”. The smart city is an old(ish) idea: give urban policymakers real time information on a whole variety of indicators about their city (from traffic and pollution to park usage and waste bin collection) and they will be able to improve decision making and optimise service delivery. But the initial vision, which mostly centred around adding sensors and RFID tags to objects around the city so that they would be able to communicate, has thus far remained unrealised (big up front investment needs and the requirements of IPv6 are perhaps the most obvious reasons for this).

The rise of big data – large, heterogeneous datasets generated by the increasing digitisation of social life – has however breathed new life into the smart cities concept. If all the cars have GPS devices, all the people have mobile phones, and all opinions are expressed on social media, then do we really need the city to be smart at all? Instead, policymakers can simply extract what they need from a sea of data which is already around them. And indeed, data from mobile phone operators has already been used for traffic optimisation, Oyster card data has been used to plan London Underground service interruptions, sewage data has been used to estimate population levels … the examples go on.

However, at the moment these examples remain largely anecdotal, driven forward by a few cities rather than adopted worldwide. The big data driven smart city faces considerable challenges if it is to become a default means of policymaking rather than a conversation piece. Getting access to the right data; correcting for biases and inaccuracies (not everyone has a GPS, phone, or expresses themselves on social media); and communicating it all to executives remain key concerns. Furthermore, especially in a context of tight budgets, most local governments cannot afford to experiment with new techniques which may not pay off instantly.

This is the context of two current OII projects in the smart cities field: UrbanData2Decide (2014-2016) and NEXUS (2015-2017). UrbanData2Decide joins together a consortium of European universities, each working with a local city partner, to explore how local government problems can be resolved with urban generated data. In Oxford, we are looking at how open mapping data can be used to estimate alcohol availability; how website analytics can be used to estimate service disruption; and how internal administrative data and social media data can be used to estimate population levels. The best concepts will be built into an application which allows decision makers to access these concepts real time.

NEXUS builds on this work. A collaborative partnership with BT, it will look at how social media data and some internal BT data can be used to estimate people movement and traffic patterns around the city, joining these data into network visualisations which are then displayed to policymakers in a data visualisation application. Both projects fill an important gap by allowing city officials to experiment with data driven solutions, providing proof of concepts and showing what works and what doesn’t. Increasing academic-government partnerships in this way has real potential to drive forward the field and turn the smart city vision into a reality.


OII Resarch Fellow Jonathan Bright is a political scientist specialising in computational and ‘big data’ approaches to the social sciences. His major interest concerns studying how people get information about the political process, and how this is changing in the internet era.

]]>
Digital Disconnect: Parties, Pollsters and Political Analysis in #GE2015 https://ensr.oii.ox.ac.uk/digital-disconnect-parties-pollsters-and-political-analysis-in-ge2015/ Mon, 11 May 2015 15:16:16 +0000 http://blogs.oii.ox.ac.uk/policy/?p=3268 We undertook some live analysis of social media data over the night of the 2015 UK General Election. See more photos from the OII's election night party, or read about the data hack
The Oxford Internet Institute undertook some live analysis of social media data over the night of the 2015 UK General Election. See more photos from the OII’s election night party, or read about the data hack

Counts of public Facebook posts mentioning any of the party leaders’ surnames. Data generated by social media can be used to understand political behaviour and institutions on an ongoing basis.[/caption]‘Congratulations to my friend @Messina2012 on his role in the resounding Conservative victory in Britain’ tweeted David Axelrod, campaign advisor to Miliband, to his former colleague Jim Messina, Cameron’s strategy adviser, on May 8th. The former was Obama’s communications director and the latter campaign manager of Obama’s 2012 campaign. Along with other consultants and advisors and large-scale data management platforms from Obama’s hugely successful digital campaigns, Conservative and Labour used an arsenal of social media and digital tools to interact with voters throughout, as did all the parties competing for seats in the 2015 election.

The parties ran very different kinds of digital campaigns. The Conservatives used advanced data science techniques borrowed from the US campaigns to understand how their policy announcements were being received and to target groups of individuals. They spent ten times as much as Labour on Facebook, using ads targeted at Facebook users according to their activities on the platform, geo-location and demographics. This was a top down strategy that involved working out was happening on social media and responding with targeted advertising, particularly for marginal seats. It was supplemented by the mainstream media, such as the Telegraph for example, which contacted its database of readers and subscribers to services such as Telegraph Money, urging them to vote Conservative. As Andrew Cooper tweeted after the election, ‘Big data, micro-targeting and social media campaigns just thrashed “5 million conversations” and “community organizing”’.

He has a point. Labour took a different approach to social media. Widely acknowledged to have the most boots on the real ground, knocking on doors, they took a similar ‘ground war’ approach to social media in local campaigns. Our own analysis at the Oxford Internet Institute shows that of the 450K tweets sent by candidates of the six largest parties in the month leading up to the general election, Labour party candidates sent over 120,000 while the Conservatives sent only 80,000, no more than the Greens and not much more than UKIP. But the greater number of Labour tweets were no more productive in terms of impact (measured in terms of mentions generated: and indeed the final result).

Both parties’ campaigns were tightly controlled. Ostensibly, Labour generated far more bottom-up activity from supporters using social media, through memes like #votecameron out, #milibrand (responding to Miliband’s interview with Russell Brand), and what Miliband himself termed the most unlikely cult of the 21st century in his resignation speech, #milifandom, none of which came directly from Central Office. These produced peaks of activity on Twitter that at some points exceeded even discussion of the election itself on the semi-official #GE2015 used by the parties, as the figure below shows. But the party remained aloof from these conversations, fearful of mainstream media mockery.

The Brand interview was agreed to out of desperation and can have made little difference to the vote (partly because Brand endorsed Miliband only after the deadline for voter registration: young voters suddenly overcome by an enthusiasm for participatory democracy after Brand’s public volte face on the utility of voting will have remained disenfranchised). But engaging with the swathes of young people who spend increasing amounts of their time on social media is a strategy for engagement that all parties ought to consider. YouTubers like PewDiePie have tens of millions of subscribers and billions of video views – their videos may seem unbelievably silly to many, but it is here that a good chunk the next generation of voters are to be found.

Use of emergent hashtags on Twitter during the 2015 General Election. Volumes are estimates based on a 10% sample with the exception of #ge2015, which reflects the exact value. All data from Datasift.
Use of emergent hashtags on Twitter during the 2015 General Election. Volumes are estimates based on a 10% sample with the exception of #ge2015, which reflects the exact value. All data from Datasift.

Only one of the leaders had a presence on social media that managed anything like the personal touch and universal reach that Obama achieved in 2008 and 2012 based on sustained engagement with social media – Nicola Sturgeon. The SNP’s use of social media, developed in last September’s referendum on Scottish independence had spawned a whole army of digital activists. All SNP candidates started the campaign with a Twitter account. When we look at the 650 local campaigns waged across the country, by far the most productive in the sense of generating mentions was the SNP; 100 tweets from SNP local candidates generating 10 times more mentions (1,000) than 100 tweets from (for example) the Liberal Democrats.

Scottish Labour’s failure to engage with Scottish peoples in this kind of way illustrates how difficult it is to suddenly develop relationships on social media – followers on all platforms are built up over years, not in the short space of a campaign. In strong contrast, advertising on these platforms as the Conservatives did is instantaneous, and based on the data science understanding (through advertising algorithms) of the platform itself. It doesn’t require huge databases of supporters – it doesn’t build up relationships between the party and supporters – indeed, they may remain anonymous to the party. It’s quick, dirty and effective.

The pollsters’ terrible night

So neither of the two largest parties really did anything with social media, or the huge databases of interactions that their platforms will have generated, to generate long-running engagement with the electorate. The campaigns were disconnected from their supporters, from their grass roots.

But the differing use of social media by the parties could lend a clue to why the opinion polls throughout the campaign got it so wrong, underestimating the Conservative lead by an average of five per cent. The social media data that may be gathered from this or any campaign is a valuable source of information about what the parties are doing, how they are being received, and what people are thinking or talking about in this important space – where so many people spend so much of their time. Of course, it is difficult to read from the outside; Andrew Cooper labeled the Conservatives’ campaign of big data to identify undecided voters, and micro-targeting on social media, as ‘silent and invisible’ and it seems to have been so to the polls.

Many voters were undecided until the last minute, or decided not to vote, which is impossible to predict with polls (bar the exit poll) – but possibly observable on social media, such as the spikes in attention to UKIP on Wikipedia towards the end of the campaign, which may have signaled their impressive share of the vote. As Jim Messina put it to msnbc news following up on his May 8th tweet that UK (and US) polling was ‘completely broken’ – ‘people communicate in different ways now’, arguing that the Miliband campaign had tried to go back to the 1970s.

Surveys – such as polls — give a (hopefully) representative picture of what people think they might do. Social media data provide an (unrepresentative) picture of what people really said or did. Long-running opinion surveys (such as the Ipsos MORI Issues Index) can monitor the hopes and fears of the electorate in between elections, but attention tends to focus on the huge barrage of opinion polls at election time – which are geared entirely at predicting the election result, and which do not contribute to more general understanding of voters. In contrast, social media are a good way to track rapid bursts in mobilization or support, which reflect immediately on social media platforms – and could also be developed to illustrate more long running trends, such as unpopular policies or failing services.

As opinion surveys face more and more challenges, there is surely good reason to supplement them with social media data, which reflect what people are really thinking on an ongoing basis – like, a video in rather than the irregular snapshots taken by polls. As a leading pollster João Francisco Meira, director of Vox Populi in Brazil (which is doing innovative work in using social media data to understand public opinion) put it in conversation with one of the authors in April – ‘we have spent so long trying to hear what people are saying – now they are crying out to be heard, every day’. It is a question of pollsters working out how to listen.

Political big data

Analysts of political behaviour – academics as well as pollsters — need to pay attention to this data. At the OII we gathered large quantities of data from Facebook, Twitter, Wikipedia and YouTube in the lead-up to the election campaign, including mentions of all candidates (as did Demos’s Centre for the Analysis of Social Media). Using this data we will be able, for example, to work out the relationship between local social media campaigns and the parties’ share of the vote, as well as modeling the relationship between social media presence and turnout.

We can already see that the story of the local campaigns varied enormously – while at the start of the campaign some candidates were probably requesting new passwords for their rusty Twitter accounts, some already had an ongoing relationship with their constituents (or potential constituents), which they could build on during the campaign. One of the candidates to take over the Labour party leadership, Chuka Umunna, joined Twitter in April 2009 and now has 100K followers, which will be useful in the forthcoming leadership contest.

Election results inject data into a research field that lacks ‘big data’. Data hungry political scientists will analyse these data in every way imaginable for the next five years. But data in between elections, for example relating to democratic or civic engagement or political mobilization, has traditionally been woefully short in our discipline. Analysis of the social media campaigns in #GE2015 will start to provide a foundation to understand patterns and trends in voting behaviour, particularly when linked to other sources of data, such as the actual constituency-level voting results and even discredited polls — which may yet yield insight, even having failed to achieve their predictive aims. As the OII’s Jonathan Bright and Taha Yasseri have argued, we need ‘a theory-informed model to drive social media predictions, that is based on an understanding of how the data is generated and hence enables us to correct for certain biases’

A political data science

Parties, pollsters and political analysts should all be thinking about these digital disconnects in #GE2015, rather than burying them with their hopes for this election. As I argued in a previous post, let’s use data generated by social media to understand political behaviour and institutions on an ongoing basis. Let’s find a way of incorporating social media analysis into polling models, for example by linking survey datasets to big data of this kind. The more such activity moves beyond the election campaign itself, the more useful social media data will be in tracking the underlying trends and patterns in political behavior.

And for the parties, these kind of ways of understanding and interacting with voters needs to be institutionalized in party structures, from top to bottom. On 8th May, the VP of a policy think-tank tweeted to both Axelrod and Messina ‘Gentlemen, welcome back to America. Let’s win the next one on this side of the pond’. The UK parties are on their own now. We must hope they use the time to build an ongoing dialogue with citizens and voters, learning from the success of the new online interest group barons, such as 38 degrees and Avaaz, by treating all internet contacts as ‘members’ and interacting with them on a regular basis. Don’t wait until 2020!


Helen Margetts is the Director of the OII, and Professor of Society and the Internet. She is a political scientist specialising in digital era governance and politics, investigating political behaviour, digital government and government-citizen interactions in the age of the internet, social media and big data. She has published over a hundred books, articles and major research reports in this area, including Political Turbulence: How Social Media Shape Collective Action (with Peter John, Scott Hale and Taha Yasseri, 2015).

Scott A. Hale is a Data Scientist at the OII. He develops and applies techniques from computer science to research questions in the social sciences. He is particularly interested in the area of human-computer interaction and the spread of information between speakers of different languages online and the roles of bilingual Internet users. He is also interested in collective action and politics more generally.

]]>
Tracing our every move: Big data and multi-method research https://ensr.oii.ox.ac.uk/tracing-our-every-move-big-data-and-multi-method-research/ https://ensr.oii.ox.ac.uk/tracing-our-every-move-big-data-and-multi-method-research/#comments Thu, 30 Apr 2015 09:32:55 +0000 http://blogs.oii.ox.ac.uk/policy/?p=3210
There is a lot of excitement about ‘big data’, but the potential for innovative work on social and cultural topics far outstrips current data collection and analysis techniques. Image by IBM Deutschland.
Using anything digital always creates a trace. The more digital ‘things’ we interact with, from our smart phones to our programmable coffee pots, the more traces we create. When collected together these traces become big data. These collections of traces can become so large that they are difficult to store, access and analyze with today’s hardware and software. But as a social scientist I’m interested in how this kind of information might be able to illuminate something new about societies, communities, and how we interact with one another, rather than engineering challenges.

Social scientists are just beginning to grapple with the technical, ethical, and methodological challenges that stand in the way of this promised enlightenment. Most of us are not trained to write database queries or regular expressions, or even to communicate effectively with those who are trained. Ethical questions arise with informed consent when new analytics are created. Even a data scientist could not know the full implications of consenting to data collection that may be analyzed with currently unknown techniques. Furthermore, social scientists tend to specialize in a particular type of data and analysis, surveys or experiments and inferential statistics, interviews and discourse analysis, participant observation and ethnomethodology, and so on. Collaborating across these lines is often difficult, particularly between quantitative and qualitative approaches. Researchers in these areas tend to ask different questions and accept different kinds of answers as valid.

Yet trace data does not fit into the quantitative / qualitative binary. The trace of a tweet includes textual information, often with links or images and metadata about who sent it, when and sometimes where they were. The traces of web browsing are also largely textual with some audio/visual elements. The quantity of these textual traces often necessitates some kind of initial quantitative filtering, but it doesn’t determine the questions or approach.

The challenges are important to understand and address because the promise of new insight into social life is real. Large-scale patterns become possible to detect, for example according to one study of mobile phone location data one’s future location is 93% predictable (Song, Qu, Blum & Barabási, 2010), despite great variation in the individual patterns. This new finding opens up further possibilities for comparison and understanding the context of these patterns. Are locations more or less predictable among people with different socio-economic circumstances? What are the key differences between the most and least predictable?

Computational social science is often associated with large-scale studies of anonymized users such as the phone location study mentioned above, or participation traces of those who contribute to online discussions. Studies that focus on limited information about a large number of people are only one type, which I call horizontal trace data. Other studies that work in collaboration with informed participants can add context and depth by asking for multiple forms of trace data and involving participants in interpreting them — what I call the vertical trace data approach.

In my doctoral dissertation I took the vertical approach to examining political information gathering during an election, gathering participants’ web browsing data with their informed consent and interviewing them in person about the context (Menchen-Trevino 2012). I found that access to websites with political information was associated with self-reported political interest, but access to election-specific pages was not. The most active election-specific browsing came from those who were undecided on election day, while many of those with high political interest had already decided whom to vote for before the election began. This is just one example of how digging futher into such data can reveal that what is true for larger categories (political information in general) may not be true, and in fact can be misleading for smaller domains (election-specific browsing). Vertical trace data collection is difficult, but it should be an important component of the project of computational social science.

Read the full article: Menchen-Trevino, E. (2013) Collecting vertical trace data: Big possibilities and big challenges for multi-method research. Policy and Internet 5 (3) 328-339.

References

Menchen-Trevino, E. (2013) Collecting vertical trace data: Big possibilities and big challenges for multi-method research. Policy and Internet 5 (3) 328-339.

Menchen-Trevino, E. (2012) Partisans and Dropouts?: News Filtering in the Contemporary Media Environment. Northwestern University, Evanston, Illinois.

Song, C., Qu, Z., Blumm, N., & Barabasi, A.-L. (2010) Limits of Predictability in Human Mobility. Science 327 (5968) 1018–1021.


Erica Menchen-Trevino is an Assistant Professor at Erasmus University Rotterdam in the Media & Communication department. She researches and teaches on topics of political communication and new media, as well as research methods (quantitative, qualitative and mixed).

]]>
https://ensr.oii.ox.ac.uk/tracing-our-every-move-big-data-and-multi-method-research/feed/ 1
How can big data be used to advance dementia research? https://ensr.oii.ox.ac.uk/how-can-big-data-be-used-to-advance-dementia-research/ Mon, 16 Mar 2015 08:00:11 +0000 http://blogs.oii.ox.ac.uk/policy/?p=3186 Caption
Image by K. Kendall of “Sights and Scents at the Cloisters: for people with dementia and their care partners”; a program developed in consultation with the Taub Institute for Research on Alzheimer’s Disease and the Aging Brain, Alzheimer’s Disease Research Center at Columbia University, and the Alzheimer’s Association.

Dementia affects about 44 million individuals, a number that is expected to nearly double by 2030 and triple by 2050. With an estimated annual cost of USD 604 billion, dementia represents a major economic burden for both industrial and developing countries, as well as a significant physical and emotional burden on individuals, family members and caregivers. There is currently no cure for dementia or a reliable way to slow its progress, and the G8 health ministers have set the goal of finding a cure or disease-modifying therapy by 2025. However, the underlying mechanisms are complex, and influenced by a range of genetic and environmental influences that may have no immediately apparent connection to brain health.

Of course medical research relies on access to large amounts of data, including clinical, genetic and imaging datasets. Making these widely available across research groups helps reduce data collection efforts, increases the statistical power of studies and makes data accessible to more researchers. This is particularly important from a global perspective: Swedish researchers say, for example, that they are sitting on a goldmine of excellent longitudinal and linked data on a variety of medical conditions including dementia, but that they have too few researchers to exploit its potential. Other countries will have many researchers, and less data.

‘Big data’ adds new sources of data and ways of analysing them to the repertoire of traditional medical research data. This can include (non-medical) data from online patient platforms, shop loyalty cards, and mobile phones — made available, for example, through Apple’s ResearchKit, just announced last week. As dementia is believed to be influenced by a wide range of social, environmental and lifestyle-related factors (such as diet, smoking, fitness training, and people’s social networks), and this behavioural data has the potential to improve early diagnosis, as well as allow retrospective insights into events in the years leading up to a diagnosis. For example, data on changes in shopping habits (accessible through loyalty cards) may provide an early indication of dementia.

However, there are many challenges to using and sharing big data for dementia research. The technology hurdles can largely be overcome, but there are also deep-seated issues around the management of data collection, analysis and sharing, as well as underlying people-related challenges in relation to skills, incentives, and mindsets. Change will only happen if we tackle these challenges at all levels jointly.

As data are combined from different research teams, institutions and nations — or even from non-medical sources — new access models will need to be developed that make data widely available to researchers while protecting the privacy and other interests of the data originator. Establishing robust and flexible core data standards that make data more sharable by design can lower barriers for data sharing, and help avoid researchers expending time and effort trying to establish the conditions of their use.

At the same time, we need policies that protect citizens against undue exploitation of their data. Consent needs to be understood by individuals — including the complex and far-reaching implications of providing genetic information — and should provide effective enforcement mechanisms to protect them against data misuse. Privacy concerns about digital, highly sensitive data are important and should not be de-emphasised as a subordinate goal to advancing dementia research. Beyond releasing data in a protected environments, allowing people to voluntarily “donate data”, and making consent understandable and enforceable, we also need governance mechanisms that safeguard appropriate data use for a wide range of purposes. This is particularly important as the significance of data changes with its context of use, and data will never be fully anonymisable.

We also need a favourable ecosystem with stable and beneficial legal frameworks, and links between academic researchers and private organisations for exchange of data and expertise. Legislation needs to account of the growing importance of global research communities in terms of funding and making best use of human and data resources. Also important is sustainable funding for data infrastructures, as well as an understanding that funders can have considerable influence on how research data, in particular, are made available. One of the most fundamental challenges in terms of data sharing is that there are relatively few incentives or career rewards that accrue to data creators and curators, so ways to recognise the value of shared data must be built into the research system.

In terms of skills, we need more health-/bioinformatics talent, as well as collaboration with those disciplines researching factors “below the neck”, such as cardiovascular or metabolic diseases, as scientists increasingly find that these may be associated with dementia to a larger extent than previously thought. Linking in engineers, physicists or innovative private sector organisations may prove fruitful for tapping into new skill sets to separate the signal from the noise in big data approaches.

In summary, everyone involved needs to adopt a mindset of responsible data sharing, collaborative effort, and a long-term commitment to building two-way connections between basic science, clinical care and the healthcare in everyday life. Fully capturing the health-related potential of big data requires “out of the box” thinking in terms of how to profit from the huge amounts of data being generated routinely across all facets of our everyday lives. This sort of data offers ways for individuals to become involved, by actively donating their data to research efforts, participating in consumer-led research, or engaging as citizen scientists. Empowering people to be active contributors to science may help alleviate the common feeling of helplessness faced by those whose lives are affected by dementia.

Of course, to do this we need to develop a culture that promotes trust between the people providing the data and those capturing and using it, as well as an ongoing dialogue about new ethical questions raised by collection and use of big data. Technical, legal and consent-related mechanisms to protect individual’s sensitive biomedical and lifestyle-related data against misuse may not always be sufficient, as the recent Nuffield Council on Bioethics report has argued. For example, we need a discussion around the direct and indirect benefits to participants of engaging in research, when it is appropriate for data collected for one purpose to be put to others, and to what extent individuals can make decisions particularly on genetic data, which may have more far-reaching consequences for their own and their family members’ professional and personal lives if health conditions, for example, can be predicted by others (such as employers and insurance companies).

Policymakers and the international community have an integral leadership role to play in informing and driving the public debate on responsible use and sharing of medical data, as well as in supporting the process through funding, incentivising collaboration between public and private stakeholders, creating data sharing incentives (for example, via taxation), and ensuring stability of research and legal frameworks.

Dementia is a disease that concerns all nations in the developed and developing world, and just as diseases have no respect for national boundaries, neither should research into dementia (and the data infrastructures that support it) be seen as a purely national or regional priority. The high personal, societal and economic importance of improving the prevention, diagnosis, treatment and cure of dementia worldwide should provide a strong incentive for establishing robust and safe mechanisms for data sharing.


Read the full report: Deetjen, U., E. T. Meyer and R. Schroeder (2015) Big Data for Advancing Dementia Research. Paris, France: OECD Publishing.

]]>
Don’t knock clickivism: it represents the political participation aspirations of the modern citizen https://ensr.oii.ox.ac.uk/dont-knock-clickivism-it-represents-the-political-participation-aspirations-of-the-modern-citizen/ Sun, 01 Mar 2015 10:44:49 +0000 http://blogs.oii.ox.ac.uk/policy/?p=3140
Following a furious public backlash in 2011, the UK government abandoned plans to sell off 258,000 hectares of state-owned woodland. The public forest campaign by 38 Degrees gathered over half a million signatures.
How do we define political participation? What does it mean to say an action is ‘political’? Is an action only ‘political’ if it takes place in the mainstream political arena; involving government, politicians or voting? Or is political participation something that we find in the most unassuming of places, in sports, home and work? This question, ‘what is politics’ is one that political scientists seem to have a lot of trouble dealing with, and with good reason. If we use an arena definition of politics, then we marginalise the politics of the everyday; the forms of participation and expression that develop between the cracks, through need and ingenuity. However, if we broaden our approach as so to adopt what is usually termed a process definition, then everything can become political. The problem here is that saying that everything is political is akin to saying nothing is political, and that doesn’t help anyone.

Over the years, this debate has plodded steadily along, with scholars on both ends of the spectrum fighting furiously to establish a working understanding. Then, the Internet came along and drew up new battle lines. The Internet is at its best when it provides a home for the disenfranchised, an environment where like-minded individuals can wipe free the dust of societal disassociation and connect and share content. However, the Internet brought with it a shift in power, particularly in how individuals conceptualised society and their role within it. The Internet, in addition to this role, provided a plethora of new and customisable modes of political participation. From the onset, a lot of these new forms of engagement were extensions of existing forms, broadening the everyday citizen’s participatory repertoire. There was a move from voting to e-voting, petitions to e-petitions, face-to-face communities to online communities; the Internet took what was already there and streamlined it, removing those pesky elements of time, space and identity.

Yet, as the Internet continues to develop, and we move into the ultra-heightened communicative landscape of the social web, new and unique forms of political participation take root, drawing upon those customisable environments and organic cyber migrations. The most prominent of these is clicktivism, sometimes also, unfairly, referred to as slacktivism. Clicktivism takes the fundamental features of browsing culture and turns them into a means of political expression. Quite simply, clicktivism refers to the simplification of online participatory processes: one-click online petitions, content sharing, social buttons (e.g. Facebook’s ‘Like’ button) etc.

For the most part, clicktivism is seen in derogatory terms, with the idea that the streamlining of online processes has created a societal disposition towards feel-good, ‘easy’ activism. From this perspective, clicktivism is a lazy or overly-convenient alternative to the effort and legitimacy of traditional engagement. Here, individuals engaging in clicktivism may derive some sense of moral gratification from their actions, but clicktivism’s capacity to incite genuine political change is severely limited. Some would go so far as to say that clicktivism has a negative impact on democratic systems, as it undermines an individual’s desire and need to participate in traditional forms of engagement; those established modes which mainstream political scholars understand as the backbone of a healthy, functioning democracy.

This idea that clicktivism isn’t ‘legitimate’ activism is fuelled by a general lack of understanding about what clicktivism actually involves. As a recent development in observed political action, clicktivism has received its fair share of attention in the political participation literature. However, for the most part, this literature has done a poor job of actually defining clicktivism. As such, clicktivism is not so much a contested notion, as an ill-defined one. The extant work continues to describe clicktivism in broad terms, failing to effectively establish what it does, and does not, involve. Indeed, as highlighted, the mainstream political participation literature saw clicktivism not as a specific form of online action, but rather as a limited and unimportant mode of online engagement.

However, to disregard emerging forms of engagement such as clicktivism because they are at odds with long-held notions of what constitutes meaningful ‘political’ engagement is a misguided and dangerous road to travel. Here, it is important that we acknowledge that a political act, even if it requires limited effort, has relevance for the individual, and, as such, carries worth. And this is where we see clicktivism challenging these traditional notions of political participation. To date, we have looked at clicktivism through an outdated lens; an approach rooted in traditional notions of democracy. However, the Internet has fundamentally changed how people understand politics, and, consequently, it is forcing us to broaden our understanding of the ‘political’, and of what constitutes political participation.

The Internet, in no small part, has created a more reflexive political citizen, one who has been given the tools to express dissatisfaction throughout all facets of their life, not just those tied to the political arena. Collective action underpinned by a developed ideology has been replaced by project orientated identities and connective action. Here, an individual’s desire to engage does not derive from the collective action frames of political parties, but rather from the individual’s self-evaluation of a project’s worth and their personal action frames.

Simply put, people now pick and choose what projects they participate in and feel little generalized commitment to continued involvement. And it is clicktivism which is leading the vanguard here. Clicktivism, as an impulsive, non-committed online political gesture, which can be easily replicated and that does not require any specialized knowledge, is shaped by, and reinforces, this change. It affords the project-oriented individual an efficient means of political participation, without the hassles involved with traditional engagement.

This is not to say, however, that clicktivism serves the same functions as traditional forms. Indeed, much more work is needed to understand the impact and effect that clicktivist techniques can have on social movements and political issues. However, and this is the most important point, clicktivism is forcing us to reconsider what we define as political participation. It does not overtly engage with the political arena, but provides avenues through which to do so. It does not incite genuine political change, but it makes people feel as if they are contributing. It does not politicize issues, but it fuels discursive practices. It may not function in the same way as traditional forms of engagement, but it represents the political participation aspirations of the modern citizen. Clicktivism has been bridging the dualism between the traditional and contemporary forms of political participation, and in its place establishing a participatory duality.

Clicktivism, and similar contemporary forms of engagement, are challenging how we understand political participation, and to ignore them because of what they don’t embody, rather than what they do, is to move forward with eyes closed.

Read the full article: Halupka, M. (2014) Clicktivism: A Systematic Heuristic. Policy and Internet 6 (2) 115-132.


Max Halupka is a PhD candidate at the ANZOG Institute for Governance, University of Canberra. His research interests include youth political participation, e-activism, online engagement, hacktivism, and fluid participatory structures.

]]>
Gender gaps in virtual economies: are there virtual ‘pink’ and ‘blue’ collar occupations? https://ensr.oii.ox.ac.uk/gender-gaps-in-virtual-economies-are-there-virtual-pink-and-blue-collar-occupations/ Thu, 15 Jan 2015 18:32:51 +0000 http://blogs.oii.ox.ac.uk/policy/?p=3057 She could end up earning 11 percent less than her male colleagues .. Image from EVE Online by zcar.300.
She could end up earning 11 percent less than her male colleagues .. Image from EVE Online by zcar.300.

Ed: Firstly, what is a ‘virtual’ economy? And what exactly are people earning or exchanging in these online environments?

Vili: A virtual economy is an economy that revolves around artificially scarce virtual markers, such as Facebook likes or, in this case, virtual items and currencies in an online game. A lot of what we do online today is rewarded with such virtual wealth instead of, say, money.

Ed: In terms of ‘virtual earning power’ what was the relationship between character gender and user gender?

Vili: We know that in national economies, men and women tend to be rewarded differently for the same amount of work; men tend to earn more than women. Since online economies are such a big part of many people’s lives today, we wanted to know if this holds true in those economies as well. Looking at the virtual economies of two massively-multiplayer online games (MMOG), we found that there are indeed some gender differences in how much virtual wealth players accumulate within the same number of hours played. In one game, EVE Online, male players were on average 11 percent wealthier than female players of the same age, character skill level, and time spent playing. We believe that this finding is explained at least in part by the fact that male and female players tend to favour different activities within the game worlds, what we call “virtual pink and blue collar occupations”. In national economies, this is called occupational segregation: jobs perceived as suitable for men are rewarded differently from jobs perceived as suitable for women, resulting in a gender earnings gap.

However, in another game, EverQuest II, we found that male and female players were approximately equally wealthy. This reflects the fact that games differ in what kind of activities they reward. Some provide a better economic return on fighting and exploring, while others make it more profitable to engage in trading and building social networks. In this respect games differ from national economies, which all tend to be biased towards rewarding male-type activities. Going beyond this particular study, fantasy economies could also help illuminate the processes through which particular occupations come to be regarded as suitable for men or for women, because game developers can dream up new occupations with no prior gender expectations attached.

Ed: You also discussed the distinction between user gender and character gender…

Vili: Besides occupational segregation, there are also other mechanisms that could explain economic gender gaps, like differences in performance or outright discrimination in pay negotiations. What’s interesting about game economies is that people can appear in the guise of a gender that differs from their everyday identity: men can play female characters and vice versa. By looking at player gender and character gender separately, we can distinguish between how “being” female and “appearing to be” female are related to economic outcomes.

We found that in EVE Online, using a female character was associated with slightly less virtual wealth, while in EverQuest II, using a female character was associated with being richer on average. Since in our study the players chose the characters themselves instead of being assigned characters at random, we don’t know what the causal relationship between character gender and wealth in these games was, if any. But it’s interesting to note that again the results differed completely between games, suggesting that while gender does matter, its effect has more to do with the mutable “software” of the players and/or the coded environments rather than our immutable “hardware”.

Ed: The dataset you worked with could be considered to be an example of ‘big data’ (ie you had full transactional trace data people interacting in two games) — what can you discover with this sort of data (as opposed to eg user surveys, participant observation, or ethnographies); and how useful or powerful is it?

Vili: Social researchers are used to working with small samples of data, and then looking at measures of statistical significance to assess whether the findings are generalizable to the overall population or whether they’re just a fluke. This focus on statistical significance is sometimes so extreme that people forget to consider the practical significance of the findings: even if the effect is real, is it big enough to make any difference in practice? In contrast, when you are working with big data, almost any relationship is statistically significant, so that becomes irrelevant. As a result, people learn to focus more on practical significance — researchers, peer reviewers, journal editors, funders, as well as the general public. This is a good thing, because it can increase the impact that social research has in society.

In this study, we spent a lot of time thinking about the practical significance of the findings. In any national economy, a 11 percent gap between men and women would be huge. But in virtual economies, overall wealth inequality tends to be orders of magnitude greater than in national economies, so that a 11 percent gap is in fact relatively minuscule. Other factors, like whether one is a casual participant in the economy or a semi-professional, have a much bigger effect, so much so that I’m not sure if participants notice a gender gap themselves. Thus one of the key conclusions of the study was that we also need to look beyond traditional sociodemographic categories like gender to see what new social divisions may be appearing in virtual economies.

Ed: What do you think are the hot topics and future directions in research (and policy) on virtual economies, gaming, microwork, crowd-sourcing etc.?

Vili: Previously, ICT adoption resulted in some people’s jobs being eliminated and others being enhanced. This shift had uneven impacts on men’s and women’s jobs. Today, we are seeing an Internet-fuelled “volunterization” of some types of work — moving the work from paid employees and contractors to crowds and fans compensated with points, likes, and badges rather than money. Social researchers should keep track of how this shift impacts different social categories like men and women: whose work ends up being compensated in play money, and who gets to keep the conventional rewards.

Read the full article: Lehdonvirta, V., Ratan, R. A., Kennedy, T. L., and Williams, D. (2014) Pink and Blue Pixel$: Gender and Economic Disparity in Two Massive Online Games. The Information Society 30 (4) 243-255.


Vili Lehdonvirta is a Research Fellow and DPhil Programme Director at the Oxford Internet Institute, and an editor of the Policy & Internet journal. He is an economic sociologist who studies the social and economic dimensions of new information technologies around the world, with particular expertise in digital markets and crowdsourcing.

Vili Lehdonvirta was talking to blog editor David Sutcliffe.

]]>
Two years after the NYT’s ‘Year of the MOOC’: how much do we actually know about them? https://ensr.oii.ox.ac.uk/two-years-after-the-nyts-year-of-the-mooc-how-much-do-we-actually-know-about-them/ https://ensr.oii.ox.ac.uk/two-years-after-the-nyts-year-of-the-mooc-how-much-do-we-actually-know-about-them/#comments Thu, 13 Nov 2014 08:15:32 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2955 Timeline of the development of MOOCs and open education
Timeline of the development of MOOCs and open education, from: Yuan, Li, and Stephen Powell. MOOCs and Open Education: Implications for Higher Education White Paper. University of Bolton: CETIS, 2013.

Ed: Does research on MOOCs differ in any way from existing research on online learning?

Rebecca: Despite the hype around MOOCs to date, there are many similarities between MOOC research and the breadth of previous investigations into (online) learning. Many of the trends we’ve observed (the prevalence of forum lurking; community formation; etc.) have been studied previously and are supported by earlier findings. That said, the combination of scale, global-reach, duration, and “semi-synchronicity” of MOOCs have made them different enough to inspire this work. In particular, the optional nature of participation among a global-body of lifelong learners for a short burst of time (e.g. a few weeks) is a relatively new learning environment that, despite theoretical ties to existing educational research, poses a new set of challenges and opportunities.

Ed: The MOOC forum networks you modelled seemed to be less efficient at spreading information than randomly generated networks. Do you think this inefficiency is due to structural constraints of the system (or just because inefficiency is not selected against); or is there something deeper happening here, maybe saying something about the nature of learning, and networked interaction?

Rebecca: First off, it’s important to not confuse the structural “inefficiency” of communication with some inherent learning “inefficiency”. The inefficiency in the sub-forums is a matter of information diffusion—i.e., because there are communities that form in the discussion spaces, these communities tend to “trap” knowledge and information instead of promoting the spread of these ideas to a vast array of learners. This information diffusion inefficiency is not necessarily a bad thing, however. It’s a natural human tendency to form communities, and there is much education research that says learning in small groups can be much more beneficial / effective than large-scale learning. The important point that our work hopes to make is that the existence and nature of these communities seems to be influenced by the types of topics that are being discussed (and vice versa)—and that educators may be able to cultivate more isolated or inclusive network dynamics in these course settings by carefully selecting and presenting these different discussion topics to learners.

Ed: Drawing on surveys and learning outcomes you could categorise four ‘learner types’, who tend to behave differently in the network. Could the network be made more efficient by streaming groups by learning objective, or by type of interaction (eg learning / feedback / social)?

Rebecca: Given our network vulnerability analysis, it appears that discussions that focus on problems or issues that are based in real life examples –e.g., those that relate to case studies of real companies and analyses posted by learners of these companies—tend to promote more inclusive engagement and efficient information diffusion. Given that certain types of learners participate in these discussions, one could argue that forming groups around learning preferences and objectives could promote more efficient communications. Still, it’s important to be aware of the potential drawbacks to this, namely, that promoting like-minded / similar people to interact with those they are similar to could further prevent “learning through diverse exposures” that these massive-scale settings can be well-suited to promote.

Ed: In the classroom, the teacher can encourage participation and discussion if it flags: are there mechanisms to trigger or seed interaction if the levels of network activity fall below a certain threshold? How much real-time monitoring tends to occur in these systems?

Rebecca: Yes, it appears that educators may be able to influence or achieve certain types of network patterns. While each MOOC is different (some course staff members tend to be much more engaged than others, learners may have different motivations, etc.), on the whole, there isn’t much real-time monitoring in MOOCs, and MOOC platforms are still in early days where there is little to no automated monitoring or feedback (beyond static analytics dashboards for instructors).

Ed: Does learner participation in these forums improve outcomes? Do the most central users in the interaction network perform better? And do they tend to interact with other very central people?

Rebecca: While we can’t infer causation, we found that when compared to the entire course, a significantly higher percentage of high achievers were also forum participants. The more likely explanation for this is that those who are committed to completing the course and performing well also tend to use the forums—but the plurality of forum participants (44% in one of the courses we analyzed) are actually those that “fail” by traditional marks (receive below 50% in the course). Indeed, many central users tend to be those that are simply auditing the course or who are interested in communicating with others without any intention of completing course assignments. These central users tend to communicate with other central users, but also, with those whose participation is much sparser / “on the fringes”.

Ed: Slightly facetiously: you can identify ‘central’ individuals in the network who spark and sustain interaction. Can you also find people who basically cause interaction to die? Who will cause the network to fall apart? And could you start to predict the strength of a network based on the profiles and proportions of the individuals who make it up?

Rebecca: It is certainly possible to further explore how different people seem. One way this can be achieved is by exploring the temporal dynamics at play—e.g., by visualizing the communication network at any point in time and creating network “snapshots” at every hour or day, or perhaps, with every new participant, to observe how the trends and structures evolve. While this method still doesn’t allow us to identify the exact influence of any given individual’s participation (since there are so many other confounding factors, for example, how far into the course it is, peoples’ schedules / lives outside of the MOOC, etc.), it may provide some insight into their roles. We could of course define some quantitative measure(s) to measure “network strength” based on learner profiles, but caution against overarching or broad claims in doing so due to confounding forces would be essential.

Ed: The majority of my own interactions are mediated by a keyboard: which is actually a pretty inefficient way of communicating, and certainly a terrible way of arguing through a complex point. Is there any sense from MOOCs that text-based communication might be a barrier to some forms of interaction, or learning?

Rebecca: This is an excellent observation. Given the global student body, varying levels of comfort in English (and written language more broadly), differing preferences for communication, etc., there is much reason to believe that a lack of participation could result from a lack of comfort with the keyboard (or written communication more generally). Indeed, in the MOOCs we’ve studied, many learners have attempted to meet up on Google Hangouts or other non-text based media to form and sustain study groups, suggesting that many learners seek to use alternative technologies to interact with others and achieve their learning objectives.

Ed: Based on this data and analysis, are there any obvious design points that might improve interaction efficiency and learning outcomes in these platforms?

Rebecca: As I have mentioned already, open-ended questions that focus on real-life case studies tend to promote the least vulnerable and most “efficient” discussions, which may be of interest to practitioners looking to cultivate these sorts of environments. More broadly, the lack of sustained participation in the forums suggests that there are a number of “forces of disengagement” at play, one of them being that the sheer amount of content being generated in the discussion spaces (one course had over 2,700 threads and 15,600 posts) could be contributing to a sense of “content overload” and helplessness for learners. Designing platforms that help mitigate this problem will be fundamental to the vitality and effectiveness of these learning spaces in the future.

Ed: I suppose there is an inherent tension between making the online environment very smooth and seductive, and the process of learning; which is often difficult and frustrating: the very opposite experience aimed for (eg) by games designers. How do MOOCs deal with this tension? (And how much gamification is common to these systems, if any?)

Rebecca: To date, gamification seems to have been sparse in most MOOCs, although there are some interesting experiments in the works. Indeed, one study (Anderson et al., 2014) used a randomized control trial to add badges (that indicate student engagement levels) next to the names of learners in MOOC discussion spaces in order to determine if and how this affects further engagement. Coursera has also started to publicly display badges next to the names of learners that have signed up for the paid Signature Track of a specific course (presumably, to signal which learners are “more serious” about completing the course than others). As these platforms become more social (and perhaps career advancement-oriented), it’s quite possible that gamification will become more popular. This gamification may not ease the process of learning or make it more comfortable, but rather, offer additional opportunities to mitigate the challenges massive-scale anonymity and lack of information about peers to facilitate more social learning.

Ed: How much of this work is applicable to other online environments that involve thousands of people exploring and interacting together: for example deliberation, crowd production and interactive gaming, which certainly involve quantifiable interactions and a degree of negotiation and learning?

Rebecca: Since MOOCs are so loosely structured and could largely be considered “informal” learning spaces, we believe the engagement dynamics we’ve found could apply to a number of other large-scale informal learning/interactive spaces online. Similar crowd-like structures can be found in a variety of policy and practice settings.

Ed: This project has adopted a mixed methods approach: what have you gained by this, and how common is it in the field?

Rebecca: Combining computational network analysis and machine learning with qualitative content analysis and in-depth interviews has been one of the greatest strengths of this work, and a great learning opportunity for the research team. Often in empirical research, it is important to validate findings across a variety of methods to ensure that they’re robust. Given the complexity of human subjects, we knew computational methods could only go so far in revealing underlying trends; and given the scale of the dataset, we knew there were patterns that qualitative analysis alone would not enable us to detect. A mixed-methods approach enabled us to simultaneously and robustly address these dimensions. MOOC research to date has been quite interdisciplinary, bringing together computer scientists, educationists, psychologists, statisticians, and a number of other areas of expertise into a single domain. The interdisciplinarity of research in this field is arguably one of the most exciting indicators of what the future might hold.

Ed: As well as the network analysis, you also carried out interviews with MOOC participants. What did you learn from them that wasn’t obvious from the digital trace data?

Rebecca: The interviews were essential to this investigation. In addition to confirming the trends revealed by our computational explorations (which revealed the what of the underlying dynamics at play), the interviews, revealed much of the why. In particular, we learned people’s motivations for participating in (or disengaging from) the discussion forums, which provided an important backdrop for subsequent quantitative (and qualitative) investigations. We have also learned a lot more about people’s experiences of learning, the strategies they employ to their support their learning and issues around power and inequality in MOOCs.

Ed: You handcoded more than 6000 forum posts in one of the MOOCs you investigated. What findings did this yield? How would you characterise the learning and interaction you observed through this content analysis?

Rebecca: The qualitative content analysis of over 6,500 posts revealed several key insights. For one, we confirmed (as the network analysis suggested), that most discussion is insignificant “noise”—people looking to introduce themselves or have short-lived discussions about topics that are beyond the scope of the course. In a few instances, however, we discovered the different patterns (and sometimes, cycles) of knowledge construction that can occur within a specific discussion thread. In some cases, we found that discussion threads grew to be so long (with over hundreds of posts), that topics were repeated or earlier posts disregarded because new participants didn’t read and/or consider them before adding their own replies.

Ed: How are you planning to extend this work?

Rebecca: As mentioned already, feelings of helplessness resulting from sheer “content overload” in the discussion forums appear to be a key force of disengagement. To that end, as we now have a preliminary understanding of communication dynamics and learner tendencies within these sorts of learning environments, we now hope to leverage this background knowledge to develop new methods for promoting engagement and the fulfilment of individual learning objectives in these settings—in particular, by trying to mitigate the “content overload” issues in some way. Stay tuned for updates 🙂

References

Anderson, A., Huttenlocher, D., Kleinberg, J. & Leskovec, J., Engaging with Massive Open Online Courses.  In: WWW ’14 Proceedings of the 23rd International World Wide Web Conference, Seoul, Korea. New York: ACM (2014).

Read the full paper: Gillani, N., Yasseri, T., Eynon, R., and Hjorth, I. (2014) Structural limitations of learning in a crowd – communication vulnerability and information diffusion in MOOCs. Scientific Reports 4.


Rebecca Eynon was talking to blog editor David Sutcliffe.

Rebecca Eynon holds a joint academic post between the Oxford Internet Institute (OII) and the Department of Education at the University of Oxford. Her research focuses on education, learning and inequalities, and she has carried out projects in a range of settings (higher education, schools and the home) and life stages (childhood, adolescence and late adulthood).

]]>
https://ensr.oii.ox.ac.uk/two-years-after-the-nyts-year-of-the-mooc-how-much-do-we-actually-know-about-them/feed/ 1
What are the limitations of learning at scale? Investigating information diffusion and network vulnerability in MOOCs https://ensr.oii.ox.ac.uk/what-are-the-limitations-of-learning-at-scale-investigating-information-diffusion-and-network-vulnerability-in-moocs/ Tue, 21 Oct 2014 11:48:51 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2796 Millions of people worldwide are currently enrolled in courses provided on large-scale learning platforms (aka ‘MOOCs’), typically collaborating in online discussion forums with thousands of peers. Current learning theory emphasizes the importance of this group interaction for cognition. However, while a lot is known about the mechanics of group learning in smaller and traditionally organized online classrooms, fewer studies have examined participant interactions when learning “at scale”. Some studies have used clickstream data to trace participant behaviour; even predicting dropouts based on their engagement patterns. However, many questions remain about the characteristics of group interactions in these courses, highlighting the need to understand whether — and how — MOOCs allow for deep and meaningful learning by facilitating significant interactions.

But what constitutes a “significant” learning interaction? In large-scale MOOC forums, with socio-culturally diverse learners with different motivations for participating, this is a non-trivial problem. MOOCs are best defined as “non-formal” learning spaces, where learners pick and choose how (and if) they interact. This kind of group membership, together with the short-term nature of these courses, means that relatively weak inter-personal relationships are likely. Many of the tens of thousands of interactions in the forum may have little relevance to the learning process. So can we actually define the underlying network of significant interactions? Only once we have done this can we explore firstly how information flows through the forums, and secondly the robustness of those interaction networks: in short, the effectiveness of the platform design for supporting group learning at scale.

To explore these questions, we analysed data from 167,000 students registered on two business MOOCs offered on the Coursera platform. Almost 8000 students contributed around 30,000 discussion posts over the six weeks of the courses; almost 30,000 students viewed at least one discussion thread, totalling 321,769 discussion thread views. We first modelled these communications as a social network, with nodes representing students who posted in the discussion forums, and edges (ie links) indicating co-participation in at least one discussion thread. Of course, not all links will be equally important: many exchanges will be trivial (‘hello’, ‘thanks’ etc.). Our task, then, was to derive a “true” network of meaningful student interactions (ie iterative, consistent dialogue) by filtering out those links generated by random encounters (Figure 1; see also full paper for methodology).

Figure 1. Comparison of observed (a; ‘all interactions’) and filtered (b; ‘significant interactions’) communication networks for a MOOC forum. Filtering affects network properties such as modularity score (ie degree of clustering). Colours correspond to the automatically detected interest communities.
One feature of networks that has been studied in many disciplines is their vulnerability to fragmentation when nodes are removed (the Internet, for example, emerged from US Army research aiming to develop a disruption-resistant network for critical communications). While we aren’t interested in the effect of missile strike on MOOC exchanges, from an educational perspective it is still useful to ask which “critical set” of learners is mostly responsible for information flow in a communication network — and what would happen to online discussions if these learners were removed. To our knowledge, this is the first time vulnerability of communication networks has been explored in an educational setting.

Network vulnerability is interesting because it indicates how integrated and inclusive the communication flow is. Discussion forums with fleeting participation will have only a very few vocal participants: removing these people from the network will markedly reduce the information flow between the other participants — as the network falls apart, it simply becomes more difficult for information to travel across it via linked nodes. Conversely, forums that encourage repeated engagement and in-depth discussion among participants will have a larger ‘critical set’, with discussion distributed across a wide range of learners.

To understand the structure of group communication in the two courses, we looked at how quickly our modelled communication network fell apart when: (a) the most central nodes were iteratively disconnected (Figure 2; blue), compared with when (b) nodes were removed at random (ie the ‘neutral’ case; green). In the random case, the network degrades evenly, as expected. When we selectively remove the most central nodes, however, we see rapid disintegration: indicating the presence of individuals who are acting as important ‘bridges’ across the network. In other words, the network of student interactions is not random: it has structure.

Figure 2. Rapid network degradation results from removal of central nodes (blue). This indicates the presence of individuals acting as ‘bridges’ between sub-groups. Removing these bridges results in rapid degradation of the overall network. Removal of random nodes (green) results in a more gradual degradation.
Figure 2. Rapid network degradation results from removal of central nodes (blue). This indicates the presence of individuals acting as ‘bridges’ between sub-groups. Removing these bridges results in rapid degradation of the overall network. Removal of random nodes (green) results in a more gradual degradation.

Of course, the structure of participant interactions will reflect the purpose and design of the particular forum. We can see from Figure 3 that different forums in the courses have different vulnerability thresholds. Forums with high levels of iterative dialogue and knowledge construction — with learners sharing ideas and insights about weekly questions, strategic analyses, or course outcomes — are the least vulnerable to degradation. A relatively high proportion of nodes have to be removed before the network falls apart (rightmost-blue line). Forums where most individuals post once to introduce themselves and then move their discussions to other platforms (such as Facebook) or cease engagement altogether tend to be more vulnerable to degradation (left-most blue line). The different vulnerability thresholds suggest that different topics (and forum functions) promote different levels of forum engagement. Certainly, asking students open-ended questions tended to encourage significant discussions, leading to greater engagement and knowledge construction as they read analyses posted by their peers and commented with additional insights or critiques.

Figure 3 – Network vulnerabilities of different course forums.
Figure 3 – Network vulnerabilities of different course forums.

Understanding something about the vulnerability of a communication or interaction network is important, because it will tend to affect how information spreads across it. To investigate this, we simulated an information diffusion model similar to that used to model social contagion. Although simplistic, the SI model (‘susceptible-infected’) is very useful in analyzing topological and temporal effects on networked communication systems. While the model doesn’t account for things like decaying interest over time or peer influence, it allows us to compare the efficiency of different network topologies.

We compared our (real-data) network model with a randomized network in order to see how well information would flow if the community structures we observed in Figure 2 did not exist. Figure 4 shows the number of ‘infected’ (or ‘reached’) nodes over time for both the real (solid lines) and randomized networks (dashed lines). In all the forums, we can see that information actually spreads faster in the randomised networks. This is explained by the existence of local community structures in the real-world networks: networks with dense clusters of nodes (i.e. a clumpy network) will result in slower diffusion than a network with a more even distribution of communication, where participants do not tend to favor discussions with a limited cohort of their peers.

Figure 4 (a) shows the percentage of infected nodes vs. simulation time for different networks. The solid lines show the results for the original network and the dashed lines for the random networks. (b) shows the time it took for a simulated “information packet” to come into contact with half the network’s nodes.
Figure 4 (a) shows the percentage of infected nodes vs. simulation time for different networks. The solid lines show the results for the original network and the dashed lines for the random networks. (b) shows the time it took for a simulated “information packet” to come into contact with half the network’s nodes.

Overall, these results reveal an important characteristic of student discussion in MOOCs: when it comes to significant communication between learners, there are simply too many discussion topics and too much heterogeneity (ie clumpiness) to result in truly global-scale discussion. Instead, most information exchange, and by extension, any knowledge construction in the discussion forums occurs in small, short-lived groups: with information “trapped” in small learner groups. This finding is important as it highlights structural limitations that may impact the ability of MOOCs to facilitate communication amongst learners that look to learn “in the crowd”.

These insights into the communication dynamics motivate a number of important questions about how social learning can be better supported, and facilitated, in MOOCs. They certainly suggest the need to leverage intelligent machine learning algorithms to support the needs of crowd-based learners; for example, in detecting different types of discussion and patterns of engagement during the runtime of a course to help students identify and engage in conversations that promote individualized learning. Without such interventions the current structural limitations of social learning in MOOCs may prevent the realization of a truly global classroom.

The next post addresses qualitative content analysis and how machine-learning community detection schemes can be used to infer latent learner communities from the content of forum posts.

Read the full paper: Gillani, N., Yasseri, T., Eynon, R., and Hjorth, I. (2014) Structural limitations of learning in a crowd – communication vulnerability and information diffusion in MOOCs. Scientific Reports 4.


Rebecca Eynon holds a joint academic post between the Oxford Internet Institute (OII) and the Department of Education at the University of Oxford. Her research focuses on education, learning and inequalities, and she has carried out projects in a range of settings (higher education, schools and the home) and life stages (childhood, adolescence and late adulthood).

]]>
Facebook and the Brave New World of Social Research using Big Data https://ensr.oii.ox.ac.uk/facebook-and-the-brave-new-world-of-social-research-using-big-data/ Mon, 30 Jun 2014 14:01:02 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2752 Reports about the Facebook study ‘Experimental evidence of massive-scale emotional contagion through social networks’ have resulted in something of a media storm. Yet it can be predicted that ultimately this debate will result in the question: so what’s new about companies and academic researchers doing this kind of research to manipulate peoples’ behaviour? Isn’t that what a lot of advertising and marketing research does already – changing peoples’ minds about things? And don’t researchers sometimes deceive subjects in experiments about their behaviour? What’s new?

This way of thinking about the study has a serious defect, because there are three issues raised by this research: The first is the legality of the study, which, as the authors correctly point out, falls within Facebook users’ giving informed consent when they sign up to the service. Laws or regulation may be required here to prevent this kind of manipulation, but may also be difficult, since it will be hard to draw a line between this experiment and other forms of manipulating peoples’ responses to media. However, Facebook may not want to lose users, for whom this way of manipulating them via their service may ‘cause anxiety’ (as the first author of the study, Adam Kramer, acknowledged in a blog post response to the outcry). In short, it may be bad for business, and hence Facebook may abandon this kind of research (but we’ll come back to this later). But this – companies using techniques that users don’t like, so they are forced to change course – is not new.

The second issue is academic research ethics. This study was carried out by two academic researchers (the other two authors of the study). In retrospect, it is hard to see how this study would have received approval from an institutional review board (IRB), the boards at which academic institutions check the ethics of studies. Perhaps stricter guidelines are needed here since a) big data research is becoming much more prominent in the social sciences and is often based on social media like Facebook, Twitter, and mobile phone data, and b) much – though not all (consider Wikipedia) – of this research therefore entails close relations with the social media companies who provide access to these data, and to being able to experiment with the platforms, as in this case. Here, again, the ethics of academic research may need to be tightened to provide new guidelines for academic collaboration with commercial platforms. But this is not new either.

The third issue, which is the new and important one, is the increasing power that social research using big data has over our lives. This is of course even more difficult to pin down than the first two points. Where does this power come from? It comes from having access to data of a scale and scope that is a leap or step change from what was available before, and being able to perform computational analysis on these data. This is my definition of ‘big data’ (see note 1), and clearly applies in this case, as in other cases we have documented: almost 700000 users’ Facebook newsfeeds were changed in order to perform this experiment, and more than 3 million posts containing more than 122 million words were analysed. The result: it was found that more positive words in Facebook Newsfeeds led to more positive posts by users, and the reverse for negative words.

What is important here are the implications of this powerful new knowledge. To be sure, as the authors point, this was a study that is valuable for social science in showing that emotions may be transmitted online via words, not just in face-to-face situations. But secondly, it also provides Facebook with knowledge that it can use to further manipulate users’ moods; for example, making their moods more positive so that users will come to its – rather than a competitor’s – website. In other words, social science knowledge, produced partly by academic social scientists, enables companies to manipulate peoples’ hearts and minds.

This not the Orwellian world of the Snowden revelations about phone tapping that have been in the news recently. It’s the Huxleyan Brave New World where companies and governments are able to play with peoples’ minds, and do so in a way whereby users may buy into it: after all, who wouldn’t like to have their experience on Facebook improved in a positive way? And of course that’s Facebook’s reply to criticisms of the study: the motivation of the research is that we’re just trying to improve your experience, as Kramer says in his blogpost response cited above. Similarly, according to The Guardian newspaper, ‘A Facebook spokeswoman said the research…was carried out “to improve our services and to make the content people see on Facebook as relevant and engaging as possible”’. But improving experience and services could also just mean selling more stuff.

This is scary, and academic social scientists should think twice before producing knowledge that supports this kind of impact. But again, we can’t pinpoint this impact without understanding what’s new: big data is a leap in how data can be used to manipulate people in more powerful ways. This point has been lost by those who criticize big data mainly on the grounds of the epistemological conundrums involved (as with boy and Crawford’s widely cited paper, see note 2). No, it’s precisely because knowledge is more scientific that it enables more manipulation. Hence, we need to identify the point or points at which we should put a stop to sliding down a slippery slope of increasing manipulation of our behaviours. Further, we need to specify when access to big data on a new scale enables research that affects many people without their knowledge, and regulate this type of research.

Which brings us back to the first point: true, Facebook may stop this kind of research, but how would we know? And have academics therefore colluded in research that encourages this kind of insidious use of data? We can only hope for a revolt against this kind of Huxleyan conditioning, but as in Brave New World, perhaps the outlook is rather gloomy in this regard: we may come to like more positive reinforcement of our behaviours online…

Notes

1. Schroeder, R. 2014. ‘Big Data: Towards a More Scientific Social Science and Humanities?’, in Graham, M., and Dutton, W. H. (eds.), Society and the Internet. Oxford: Oxford University Press, pp.164-76.

2. boyd, D. and Crawford, K. (2012). ‘Critical Questions for big data: Provocations for a cultural, technological and scholarly phenomenon’, Information, Communication and Society, 15(5), 662-79.


Professor Ralph Schroeder has interests in virtual environments, social aspects of e-Science, sociology of science and technology, and has written extensively about virtual reality technology. He is a researcher on the OII project Accessing and Using Big Data to Advance Social Science Knowledge, which follows ‘big data’ from its public and private origins through open and closed pathways into the social sciences, and documents and shapes the ways they are being accessed and used to create new knowledge about the social world.

]]>
Past and Emerging Themes in Policy and Internet Studies https://ensr.oii.ox.ac.uk/past-and-emerging-themes-in-policy-and-internet-studies/ Mon, 12 May 2014 09:24:59 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2673 Caption
We can’t understand, analyze or make public policy without understanding the technological, social and economic shifts associated with the Internet. Image from the (post-PRISM) “Stop Watching Us” Berlin Demonstration (2013) by mw238.

In the journal’s inaugural issue, founding Editor-in-Chief Helen Margetts outlined what are essentially two central premises behind Policy & Internet’s launch. The first is that “we cannot understand, analyze or make public policy without understanding the technological, social and economic shifts associated with the Internet” (Margetts 2009, 1). It is simply not possible to consider public policy today without some regard for the intertwining of information technologies with everyday life and society. The second premise is that the rise of the Internet is associated with shifts in how policy itself is made. In particular, she proposed that impacts of Internet adoption would be felt in the tools through which policies are effected, and the values that policy processes embody.

The purpose of the Policy and Internet journal was to take up these two challenges: the public policy implications of Internet-related social change, and Internet-related changes in policy processes themselves. In recognition of the inherently multi-disciplinary nature of policy research, the journal is designed to act as a meeting place for all kinds of disciplinary and methodological approaches. Helen predicted that methodological approaches based on large-scale transactional data, network analysis, and experimentation would turn out to be particularly important for policy and Internet studies. Driving the advancement of these methods was therefore the journal’s third purpose. Today, the journal has reached a significant milestone: over one hundred high-quality peer-reviewed articles published. This seems an opportune moment to take stock of what kind of research we have published in practice, and see how it stacks up against the original vision.

At the most general level, the journal’s articles fall into three broad categories: the Internet and public policy (48 articles), the Internet and policy processes (51 articles), and discussion of novel methodologies (10 articles). The first of these categories, “the Internet and public policy,” can be further broken down into a number of subcategories. One of the most prominent of these streams is fundamental rights in a mediated society (11 articles), which focuses particularly on privacy and freedom of expression. Related streams are children and child protection (six articles), copyright and piracy (five articles), and general e-commerce regulation (six articles), including taxation. A recently emerged stream in the journal is hate speech and cybersecurity (four articles). Of course, an enduring research stream is Internet governance, or the regulation of technical infrastructures and economic institutions that constitute the material basis of the Internet (seven articles). In recent years, the research agenda in this stream has been influenced by national policy debates around broadband market competition and network neutrality (Hahn and Singer 2013). Another enduring stream deals with the Internet and public health (eight articles).

Looking specifically at “the Internet and policy processes” category, the largest stream is e-participation, or the role of the Internet in engaging citizens in national and local government policy processes, through methods such as online deliberation, petition platforms, and voting advice applications (18 articles). Two other streams are e-government, or the use of Internet technologies for government service provision (seven articles), and e-politics, or the use of the Internet in mainstream politics, such as election campaigning and communications of the political elite (nine articles). Another stream that has gained pace during recent years, is online collective action, or the role of the Internet in activism, ‘clicktivism,’ and protest campaigns (16 articles). Last year the journal published a special issue on online collective action (Calderaro and Kavada 2013), and the next forthcoming issue includes an invited article on digital civics by Ethan Zuckerman, director of MIT’s Center for Civic Media, with commentary from prominent scholars of Internet activism. A trajectory discernible in this stream over the years is a movement from discussing mere potentials towards analyzing real impacts—including critical analyses of the sometimes inflated expectations and “democracy bubbles” created by digital media (Shulman 2009; Karpf 2012; Bryer 2012).

The final category, discussion of novel methodologies, consists of articles that develop, analyze, and reflect critically on methodological innovations in policy and Internet studies. Empirical articles published in the journal have made use of a wide range of conventional and novel research methods, from interviews and surveys to automated content analysis and advanced network analysis methods. But of those articles where methodology is the topic rather than merely the tool, the majority deal with so-called “big data,” or the use of large-scale transactional data sources in research, commerce, and evidence-based public policy (nine articles). The journal recently devoted a special issue to the potentials and pitfalls of big data for public policy (Margetts and Sutcliffe 2013), based on selected contributions to the journal’s 2012 big data conference: Big Data, Big Challenges? In general, the notion of data science and public policy is a growing research theme.

This brief analysis suggests that research published in the journal over the last five years has indeed followed the broad contours of the original vision. The two challenges, namely policy implications of Internet-related social change and Internet-related changes in policy processes, have both been addressed. In particular, research has addressed the implications of the Internet’s increasing role in social and political life. The journal has also furthered the development of new methodologies, especially the use of online network analysis techniques and large-scale transactional data sources (aka ‘big data’).

As expected, authors from a wide range of disciplines have contributed their perspectives to the journal, and engaged with other disciplines, while retaining the rigor of their own specialisms. The geographic scope of the contributions has been truly global, with authors and research contexts from six continents. I am also pleased to note that a characteristic common to all the published articles is polish; this is no doubt in part due to the high level of editorial support that the journal is able to afford to authors, including copyediting. The justifications for the journal’s establishment five years ago have clearly been borne out, so that the journal now performs an important function in fostering and bringing together research on the public policy implications of an increasingly Internet-mediated society.

And what of my own research interests as an editor? In the inaugural editorial, Helen Margetts highlighted work, finance, exchange, and economic themes in general as being among the prominent areas of Internet-related social change that are likely to have significant future policy implications. I think for the most part, these implications remain to be addressed, and this is an area that the journal can encourage authors to tackle better. As an editor, I will work to direct attention to this opportunity, and welcome manuscript submissions on all aspects of Internet-enabled economic change and its policy implications. This work will be kickstarted by the journal’s 2014 conference (26-27 September), which this year focuses on crowdsourcing and online labor.

Our published articles will continue to be highlighted here in the journal’s blog. Launched last year, we believe this blog will help to expand the reach and impact of research published in Policy and Internet to the wider academic and practitioner communities, promote discussion, and increase authors’ citations. After all, publication is only the start of an article’s public life: we want people reading, debating, citing, and offering responses to the research that we, and our excellent reviewers, feel is important, and worth publishing.

Read the full editorial:  Lehdonvirta, V. (2014) Past and Emerging Themes in Policy and Internet Studies. Policy & Internet 6(2): 109-114.

References

Bryer, T.A. (2011) Online Public Engagement in the Obama Administration: Building a Democracy Bubble? Policy & Internet 3 (4).

Calderaro, A. and Kavada, A. (2013) Challenges and Opportunities of Online Collective Action for Policy Change. Policy & Internet (5) 1.

Hahn, R. and Singer, H. (2013) Is the U.S. Government’s Internet Policy Broken? Policy & Internet 5 (3) 340-363.

Karpf, D. (2012) Online Political Mobilization from the Advocacy Group’s Perspective: Looking Beyond Clicktivism. Policy & Internet 2 (4) 7-41.

Margetts, H. (2009) The Internet and Public Policy. Policy and Internet 1 (1).

Margetts, H. and Sutcliffe, D. (2013) Addressing the Policy Challenges and Opportunities of ‘Big Data.’ Policy & Internet 5 (2) 139-146.

Shulman, S.W. (2009) The Case Against Mass E-mails: Perverse Incentives and Low Quality Public Participation in U.S. Federal Rulemaking. Policy & Internet 1 (1) 23-53.

]]>
Mapping collective public opinion in the Russian blogosphere https://ensr.oii.ox.ac.uk/mapping-collective-public-opinion-in-the-russian-blogosphere/ Mon, 10 Feb 2014 11:30:05 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2372 Caption
Widely reported as fraudulent, the 2011 Russian Parliamentary elections provoked mass street protest action by tens of thousands of people in Moscow and cities and towns across Russia. Image by Nikolai Vassiliev.

Blogs are becoming increasingly important for agenda setting and formation of collective public opinion on a wide range of issues. In countries like Russia where the Internet is not technically filtered, but where the traditional media is tightly controlled by the state, they may be particularly important. The Russian language blogosphere counts about 85 million blogs – an amount far beyond the capacities of any government to control – and the Russian search engine Yandex, with its blog rating service, serves as an important reference point for Russia’s educated public in its search of authoritative and independent sources of information. The blogosphere is thereby able to function as a mass medium of “public opinion” and also to exercise influence.

One topic that was particularly salient over the period we studied concerned the Russian Parliamentary elections of December 2011. Widely reported as fraudulent, they provoked immediate and mass street protest action by tens of thousands of people in Moscow and cities and towns across Russia, as well as corresponding activity in the blogosphere. Protesters made effective use of the Internet to organize a movement that demanded cancellation of the parliamentary election results, and the holding of new and fair elections. These protests continued until the following summer, gaining widespread national and international attention.

Most of the political and social discussion blogged in Russia is hosted on the blog platform LiveJournal. Some of these bloggers can claim a certain amount of influence; the top thirty bloggers have over 20,000 “friends” each, representing a good circulation for the average Russian newspaper. Part of the blogosphere may thereby resemble the traditional media; the deeper into the long tail of average bloggers, however, the more it functions as more as pure public opinion. This “top list” effect may be particularly important in societies (like Russia’s) where popularity lists exert a visible influence on bloggers’ competitive behavior and on public perceptions of their significance. Given the influence of these top bloggers, it may be claimed that, like the traditional media, they act as filters of issues to be thought about, and as definers of their relative importance and salience.

Gauging public opinion is of obvious interest to governments and politicians, and opinion polls are widely used to do this, but they have been consistently criticized for the imposition of agendas on respondents by pollsters, producing artefacts. Indeed, the public opinion literature has tended to regard opinion as something to be “extracted” by pollsters, which inevitably pre-structures the output. This literature doesn’t consider that public opinion might also exist in the form of natural language texts, such as blog posts, that have not been pre-structured by external observers.

There are two basic ways to detect topics in natural language texts: the first is manual coding of texts (ie by traditional content analysis), and the other involves rapidly developing techniques of automatic topic modeling or text clustering. The media studies literature has relied heavily on traditional content analysis; however, these studies are inevitably limited by the volume of data a person can physically process, given there may be hundreds of issues and opinions to track — LiveJournal’s 2.8 million blog accounts, for example, generate 90,000 posts daily.

For large text collections, therefore, only the second approach is feasible. In our article we explored how methods for topic modeling developed in computer science may be applied to social science questions – such as how to efficiently track public opinion on particular (and evolving) issues across entire populations. Specifically, we demonstrate how automated topic modeling can identify public agendas, their composition, structure, the relative salience of different topics, and their evolution over time without prior knowledge of the issues being discussed and written about. This automated “discovery” of issues in texts involves division of texts into topically — or more precisely, lexically — similar groups that can later be interpreted and labeled by researchers. Although this approach has limitations in tackling subtle meanings and links, experiments where automated results have been checked against human coding show over 90 percent accuracy.

The computer science literature is flooded with methodological papers on automatic analysis of big textual data. While these methods can’t entirely replace manual work with texts, they can help reduce it to the most meaningful and representative areas of the textual space they help to map, and are the only means to monitor agendas and attitudes across multiple sources, over long periods and at scale. They can also help solve problems of insufficient and biased sampling, when entire populations become available for analysis. Due to their recentness, as well as their mathematical and computational complexity, these approaches are rarely applied by social scientists, and to our knowledge, topic modeling has not previously been applied for the extraction of agendas from blogs in any social science research.

The natural extension of automated topic or issue extraction involves sentiment mining and analysis; as Gonzalez-Bailon, Kaltenbrunner, and Banches (2012) have pointed out, public opinion doesn’t just involve specific issues, but also encompasses the state of public emotion about these issues, including attitudes and preferences. This involves extracting opinions on the issues/agendas that are thought to be present in the texts, usually by dividing sentences into positive and negative. These techniques are based on human-coded dictionaries of emotive words, on algorithmic construction of sentiment dictionaries, or on machine learning techniques.

Both topic modeling and sentiment analysis techniques are required to effectively monitor self-generated public opinion. When methods for tracking attitudes complement methods to build topic structures, a rich and powerful map of self-generated public opinion can be drawn. Of course this mapping can’t completely replace opinion polls; rather, it’s a new way of learning what people are thinking and talking about; a method that makes the vast amounts of user-generated content about society – such as the 65 million blogs that make up the Russian blogosphere — available for social and policy analysis.

Naturally, this approach to public opinion and attitudes is not free of limitations. First, the dataset is only representative of the self-selected population of those who have authored the texts, not of the whole population. Second, like regular polled public opinion, online public opinion only covers those attitudes that bloggers are willing to share in public. Furthermore, there is still a long way to go before the relevant instruments become mature, and this will demand the efforts of the whole research community: computer scientists and social scientists alike.

Read the full paper: Olessia Koltsova and Sergei Koltcov (2013) Mapping the public agenda with topic modeling: The case of the Russian livejournal. Policy and Internet 5 (2) 207–227.

Also read on this blog: Can text mining help handle the data deluge in public policy analysis? by Aude Bicquelet.

References

González-Bailón, S., A. Kaltenbrunner, and R.E. Banches. 2012. “Emotions, Public Opinion and U.S. Presidential Approval Rates: A 5 Year Analysis of Online Political Discussions,” Human Communication Research 38 (2): 121–43.

]]>
Edit wars! Measuring and mapping society’s most controversial topics https://ensr.oii.ox.ac.uk/edit-wars-measuring-mapping-societys-most-controversial-topics/ Tue, 03 Dec 2013 08:21:43 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2339 Ed: How did you construct your quantitative measure of ‘conflict’? Did you go beyond just looking at content flagged by editors as controversial?

Taha: Yes we did … actually, we have shown that controversy measures based on “controversial” flags are not inclusive at all and although they might have high precision, they have very low recall. Instead, we constructed an automated algorithm to locate and quantify the editorial wars taking place on the Wikipedia platform. Our algorithm is based on reversions, i.e. when editors undo each other’s contributions. We focused specifically on mutual reverts between pairs of editors and we assigned a maturity score to each editor, based on the total volume of their previous contributions. While counting the mutual reverts, we used more weight for those ones committed by/on editors with higher maturity scores; as a revert between two experienced editors indicates a more serious problem. We always validated our method and compared it with other methods, using human judgement on a random selection of articles.

Ed: Was there any discrepancy between the content deemed controversial by your own quantitative measure, and what the editors themselves had flagged?

Taha: We were able to capture all the flagged content, but not all the articles found to be controversial by our method are flagged. And when you check the editorial history of those articles, you soon realise that they are indeed controversial but for some reason have not been flagged. It’s worth mentioning that the flagging process is not very well implemented in smaller language editions of Wikipedia. Even if the controversy is detected and flagged in English Wikipedia, it might not be in the smaller language editions. Our model is of course independent of the size and editorial conventions of different language editions.

Ed: Were there any differences in the way conflicts arose / were resolved in the different language versions?

Taha: We found the main differences to be the topics of controversial articles. Although some topics are globally debated, like religion and politics, there are many topics which are controversial only in a single language edition. This reflects the local preferences and importances assigned to topics by different editorial communities. And then the way editorial wars initiate and more importantly fade to consensus is also different in different language editions. In some languages moderators interfere very soon, while in others the war might go on for a long time without any moderation.

Ed: In general, what were the most controversial topics in each language? And overall?

Taha: Generally, religion, politics, and geographical places like countries and cities (sometimes even villages) are the topics of debates. But each language edition has also its own focus, for example football in Spanish and Portuguese, animations and TV series in Chinese and Japanese, sex and gender-related topics in Czech, and Science and Technology related topics in French Wikipedia are very often behind editing wars.

Ed: What other quantitative studies of this sort of conflict -ie over knowledge and points of view- are there?

Taha: My favourite work is one by researchers from Barcelona Media Lab. In their paper Jointly They Edit: Examining the Impact of Community Identification on Political Interaction in Wikipedia they provide quantitative evidence that editors interested in political topics identify themselves more significantly as Wikipedians than as political activists, even though they try hard to reflect their opinions and political orientations in the articles they contribute to. And I think that’s the key issue here. While there are lots of debates and editorial wars between editors, at the end what really counts for most of them is Wikipedia as a whole project, and the concept of shared knowledge. It might explain how Wikipedia really works despite all the diversity among its editors.

Ed: How would you like to extend this work?

Taha: Of course some of the controversial topics change over time. While Jesus might stay a controversial figure for a long time, I’m sure the article on President (W) Bush will soon reach a consensus and most likely disappear from the list of the most controversial articles. In the current study we examined the aggregated data from the inception of each Wikipedia-edition up to March 2010. One possible extension that we are working on now is to study the dynamics of these controversy-lists and the positions of topics in them.

Read the full paper: Yasseri, T., Spoerri, A., Graham, M. and Kertész, J. (2014) The most controversial topics in Wikipedia: A multilingual and geographical analysis. In: P.Fichman and N.Hara (eds) Global Wikipedia: International and cross-cultural issues in online collaboration. Scarecrow Press.


Taha was talking to blog editor David Sutcliffe.

Taha Yasseri is the Big Data Research Officer at the OII. Prior to coming to the OII, he spent two years as a Postdoctoral Researcher at the Budapest University of Technology and Economics, working on the socio-physical aspects of the community of Wikipedia editors, focusing on conflict and editorial wars, along with Big Data analysis to understand human dynamics, language complexity, and popularity spread. He has interests in analysis of Big Data to understand human dynamics, government-society interactions, mass collaboration, and opinion dynamics.

]]>
The physics of social science: using big data for real-time predictive modelling https://ensr.oii.ox.ac.uk/physics-of-social-science-using-big-data-for-real-time-predictive-modelling/ Thu, 21 Nov 2013 09:49:27 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2320 Ed: You are interested in analysis of big data to understand human dynamics; how much work is being done in terms of real-time predictive modelling using these data?

Taha: The socially generated transactional data that we call “big data” have been available only very recently; the amount of data we now produce about human activities in a year is comparable to the amount that used to be produced in decades (or centuries). And this is all due to recent advancements in ICTs. Despite the short period of availability of big data, the use of them in different sectors including academia and business has been significant. However, in many cases, the use of big data is limited to monitoring and post hoc analysis of different patterns. Predictive models have been rarely used in combination with big data. Nevertheless, there are very interesting examples of using big data to make predictions about disease outbreaks, financial moves in the markets, social interactions based on human mobility patterns, election results, etc.

Ed: What were the advantages of using Wikipedia as a data source for your study — as opposed to Twitter, blogs, Facebook or traditional media, etc.?

Taha: Our results have shown that the predictive power of Wikipedia page view and edit data outperforms similar box office-prediction models based on Twitter data. This can partially be explained by considering the different nature of Wikipedia compared to social media sites. Wikipedia is now the number one source of online information, and Wikipedia article page view statistics show how much Internet users have been interested in knowing about a specific movie. And the edit counts — even more importantly — indicate the level of interest of the editors in sharing their knowledge about the movies with others. Both indicators are much stronger than what you could measure on Twitter, which is mainly the reaction of the users after watching or reading about the movie. The cost of participation in Wikipedia’s editorial process makes the activity data more revealing about the potential popularity of the movies.

Another advantage is the sheer availability of Wikipedia data. Twitter streams, by comparison, are limited in both size and time. Gathering Facebook data is also problematic, whereas all the Wikipedia editorial activities and page views are recorded in full detail — and made publicly available.

Ed: Could you briefly describe your method and model?

Taha: We retrieved two sets of data from Wikipedia, the editorial activity and the page views relating to our set of 312 movies. The former indicates the popularity of the movie among the Wikipedia editors and the latter among Wikipedia readers. We then defined different measures based on these two data streams (eg number of edits, number of unique editors, etc.) In the next step we combined these data into a linear model that assumes the more popular the movie is, the larger the size of these parameters. However this model needs both training and calibration. We calibrated the model based on the IMBD data on the financial success of a set of ‘training’ movies. After calibration, we applied the model to a set of “test” movies and (luckily) saw that the model worked very well in predicting the financial success of the test movies.

Ed: What were the most significant variables in terms of predictive power; and did you use any content or sentiment analysis?

Taha: The nice thing about this method is that you don’t need to perform any content or sentiment analysis. We deal only with volumes of activities and their evolution over time. The parameter that correlated best with financial success (and which was therefore the best predictor) was the number of page views. I can easily imagine that these days if someone wants to go to watch a movie, they most likely turn to the Internet and make a quick search. Thanks to Google, Wikipedia is going to be among the top results and it’s very likely that the click will go to the Wikipedia article about the movie. I think that’s why the page views correlate to the box office takings so significantly.

Ed: Presumably people are picking up on signals, ie Wikipedia is acting like an aggregator and normaliser of disparate environmental signals — what do you think these signals might be, in terms of box office success? ie is it ultimately driven by the studio media machine?

Taha: This is a very difficult question to answer. There are numerous factors that make a movie (or a product in general) popular. Studio marketing strategies definitely play an important role, but the quality of the movie, the collective mood of the public, herding effects, and many other hidden variables are involved as well. I hope our research serves as a first step in studying popularity in a quantitative framework, letting us answer such questions. To fully understand a system the first thing you need is a tool to monitor and observe it very well quantitatively. In this research we have shown that (for example) Wikipedia is a nice window and useful tool to observe and measure popularity and its dynamics; hopefully leading to a deep understanding of the underlying mechanisms as well.

Ed: Is there similar work / approaches to what you have done in this study?

Taha: There have been other projects using socially generated data to make predictions on the popularity of movies or movement in financial markets, however to the best of my knowledge, it’s been the first time that Wikipedia data have been used to feed the models. We were positively surprised when we observed that these data have stronger predictive power than previously examined datasets.

Ed: If you have essentially shown that ‘interest on Wikipedia’ tracks ‘real-world interest’ (ie box office receipts), can this be applied to other things? eg attention to legislation, political scandal, environmental issues, humanitarian issues: ie Wikipedia as “public opinion monitor”?

Taha: I think so. Now I’m running two other projects using a similar approach; one to predict election outcomes and the other one to do opinion mining about the new policies implemented by governing bodies. In the case of elections, we have observed very strong correlations between changes in the information seeking rates of the general public and the number of ballots cast. And in the case of new policies, I think Wikipedia could be of great help in understanding the level of public interest in searching for accurate information about the policies, and how this interest is satisfied by the information provided online. And more interestingly, how this changes overtime as the new policy is fully implemented.

Ed: Do you think there are / will be practical applications of using social media platforms for prediction, or is the data too variable?

Taha: Although the availability and popularity of social media are recent phenomena, I’m sure that social media data are already being used by different bodies for predictions in various areas. We have seen very nice examples of using these data to predict disease outbreaks or the arrival of earthquake waves. The future of this field is very promising, considering both the advancements in the methodologies and also the increase in popularity and use of social media worldwide.

Ed: How practical would it be to generate real-time processing of this data — rather than analysing databases post hoc?

Taha: Data collection and analysis could be done instantly. However the challenge would be the calibration. Human societies and social systems — similarly to most complex systems — are non-stationary. That means any statistical property of the system is subject to abrupt and dramatic changes. That makes it a bit challenging to use a stationary model to describe a continuously changing system. However, one could use a class of adaptive models or Bayesian models which could modify themselves as the system evolves and more data are available. All these could be done in real time, and that’s the exciting part of the method.

Ed: As a physicist; what are you learning in a social science department? And what does physicist bring to social science and the study of human systems?

Taha: Looking at complicated phenomena in a simple way is the art of physics. As Einstein said, a physicist always tries to “make things as simple as possible, but not simpler”. And that works very well in describing natural phenomena, ranging from sub-atomic interactions all the way to cosmology. However, studying social systems with the tools of natural sciences can be very challenging, and sometimes too much simplification makes it very difficult to understand the real underlying mechanisms. Working with social scientists, I’m learning a lot about the importance of the individual attributes (and variations between) the elements of the systems under study, outliers, self-awarenesses, ethical issues related to data, agency and self-adaptation, and many other details that are mostly overlooked when a physicist studies a social system.

At the same time, I try to contribute the methodological approaches and quantitative skills that physicists have gained during two centuries of studying complex systems. I think statistical physics is an amazing example where statistical techniques can be used to describe the macro-scale collective behaviour of billions and billions of atoms with a single formula. I should admit here that humans are way more complicated than atoms — but the dialogue between natural scientists and social scientists could eventually lead to multi-scale models which could help us to gain a quantitative understanding of social systems, thereby facilitating accurate predictions of social phenomena.

Ed: What database would you like access to, if you could access anything?

Taha: I have day dreams about the database of search queries from all the Internet users worldwide at the individual level. These data are being collected continuously by search engines and technically could be accessed, but due to privacy policy issues it’s impossible to get a hold on; even if only for research purposes. This is another difference between social systems and natural systems. An atom never gets upset being watched through a microscope all the time, but working on social systems and human-related data requires a lot of care with respect to privacy and ethics.

Read the full paper: Mestyán, M., Yasseri, T., and Kertész, J. (2013) Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data. PLoS ONE 8 (8) e71226.


Taha Yasseri was talking to blog editor David Sutcliffe.

Taha Yasseri is the Big Data Research Officer at the OII. Prior to coming to the OII, he spent two years as a Postdoctoral Researcher at the Budapest University of Technology and Economics, working on the socio-physical aspects of the community of Wikipedia editors, focusing on conflict and editorial wars, along with Big Data analysis to understand human dynamics, language complexity, and popularity spread. He has interests in analysis of Big Data to understand human dynamics, government-society interactions, mass collaboration, and opinion dynamics.

]]>
Five recommendations for maximising the relevance of social science research for public policy-making in the big data era https://ensr.oii.ox.ac.uk/five-recommendations-for-maximising-the-relevance-of-social-science-research-for-public-policy-making-in-the-big-data-era/ https://ensr.oii.ox.ac.uk/five-recommendations-for-maximising-the-relevance-of-social-science-research-for-public-policy-making-in-the-big-data-era/#comments Mon, 04 Nov 2013 10:30:30 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2196 As I discussed in a previous post on the promises and threats of big data for public policy-making, public policy making has entered a period of dramatic change. Widespread use of digital technologies, the Internet and social media means citizens and governments leave digital traces that can be harvested to generate big data. This increasingly rich data environment poses both promises and threats to policy-makers.

So how can social scientists help policy-makers in this changed environment, ensuring that social science research remains relevant? Social scientists have a good record on having policy influence, indeed in the UK better than other academic fields, including medicine, as recent research from the LSE Public Policy group has shown. Big data hold major promise for social science, which should enable us to further extend our record in policy research. We have access to a cornucopia of data of a kind which is more like that traditionally associated with so-called ‘hard’ science. Rather than being dependent on surveys, the traditional data staple of empirical social science, social media such as Wikipedia, Twitter, Facebook, and Google Search present us with the opportunity to scrape, generate, analyse and archive comparative data of unprecedented quantity. For example, at the OII over the last four years we have been generating a dataset of all petition signing in the UK and US, which contains the joining rate (updated every hour) for the 30,000 petitions created in the last three years. As a political scientist, I am very excited by this kind of data (up to now, we have had big data like this only for voting, and that only at election time), which will allow us to create a complete ecology of petition signing, one of the more popular acts of political participation in the UK. Likewise, we can look at the entire transaction history of online organizations like Wikipedia, or map the link structure of government’s online presence.

But big data holds threats for social scientists too. The technological challenge is ever present. To generate their own big data, researchers and students must learn to code, and for some that is an alien skill. At the OII we run a course on Digital Social Research that all our postgraduate students can take; but not all social science departments could either provide such a course, or persuade their postgraduate students that they needed it. Ours, who study the social science of the Internet, are obviously predisposed to do so. And big data analysis requires multi-disciplinary expertise. Our research team working on petitions data includes a computer scientist (Scott Hale), a physicist (Taha Yasseri) and a political scientist (myself). I can’t imagine doing this sort of research without such technical expertise, and as a multi-disciplinary department we are (reasonably) free to recruit these type of research faculty. But not all social science departments can promise a research career for computer scientists, or physicists, or any of the other disciplinary specialists that might be needed to tackle big data problems.

Five Recommendations for Social Scientists

So, how can social scientists overcome these challenges, and thereby be in a good position to aid policy-makers tackle their own barriers to making the most of the possibilities afforded by big data? Here are five recommendations:

Accept that multi-disciplinary research teams are going to become the norm for social science research, extending beyond social science disciplines into the life sciences, mathematics, physics, and engineering. At Policy and Internet’s 2012 Big Data conference, the keynote speaker Duncan Watts (physicist turned sociologist) called for a ‘dating agency’ for engineers and social scientists – with the former providing the technological expertise, and the latter identifying the important research questions. We need to make sure that forums exist where social scientists and technologists meet and discuss big data research at the earliest stages, so that research projects and programmes incorporate the core competencies of both.

We need to provide the normative and ethical basis for policy decisions in the big data era. That means bringing in normative political theorists and philosophers of information into our research teams. The government has committed £65 million to big data research funding, but it seems likely that any successful research proposals will have a strong ethics component embedded in the research programme, rather than an ethics add on or afterthought.

Training in data science. Many leading US universities are now admitting undergraduates to data science courses, but lack social science input. Of the 20 US masters courses in big data analytics compiled by Information Week, nearly all came from computer science or informatics departments. Social science research training needs to incorporate coding and analysis skills of the kind these courses provide, but with a social science focus. If we as social scientists leave the training to computer scientists, we will find that the new cadre of data scientists tend to leave out social science concerns or questions.

Bringing policy makers and academic researchers together to tackle the challenges that big data present. Last month the OII and Policy and Internet convened a workshop in Harvard on Responsible Research Agendas for Public Policy in the Big Data Era, which included various leading academic researchers in the government and big data field, and government officials from the Census Bureau, the Federal Reserve Board, the Bureau of Labor Statistics, and the Office of Management and Budget (OMB). The discussions revealed that there is continual procession of major events on big data in Washington DC (usually with a corporate or scientific research focus) to which US federal officials are invited, but also how few were really dedicated to tackling the distinctive issues that face government agencies such as those represented around the table.

Taking forward theoretical development in social science, incorporating big data insights. I recently spoke at the Oxford Analytica Global Horizons conference, at a session on Big Data. One of the few policy-makers (in proportion to corporate representatives) in the audience asked the panel “where is the theory”? As social scientists, we need to respond to that question, and fast.


This post is based on discussions at the workshop on Responsible Research Agendas for Public Policy in the era of Big Data workshop and the Political Studies Association Why Universities Matter: How Academic Social Science Contributes to Public Policy Impact, held at the LSE on 26 September 2013.

Helen Margetts is the Director of the OII, and Professor of Society and the Internet. She is a political scientist specialising in e-government and digital era governance and politics, investigating the nature and implications of relationships between governments, citizens and the Internet and related digital technologies in the UK and internationally.

]]>
https://ensr.oii.ox.ac.uk/five-recommendations-for-maximising-the-relevance-of-social-science-research-for-public-policy-making-in-the-big-data-era/feed/ 1
The promises and threats of big data for public policy-making https://ensr.oii.ox.ac.uk/promises-threats-big-data-for-public-policy-making/ https://ensr.oii.ox.ac.uk/promises-threats-big-data-for-public-policy-making/#comments Mon, 28 Oct 2013 15:07:29 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2299 The environment in which public policy is made has entered a period of dramatic change. Widespread use of digital technologies, the Internet and social media means both citizens and governments leave digital traces that can be harvested to generate big data. Policy-making takes place in an increasingly rich data environment, which poses both promises and threats to policy-makers.

On the promise side, such data offers a chance for policy-making and implementation to be more citizen-focused, taking account of citizens’ needs, preferences and actual experience of public services, as recorded on social media platforms. As citizens express policy opinions on social networking sites such as Twitter and Facebook; rate or rank services or agencies on government applications such as NHS Choices; or enter discussions on the burgeoning range of social enterprise and NGO sites, such as Mumsnet, 38 degrees and patientopinion.org, they generate a whole range of data that government agencies might harvest to good use. Policy-makers also have access to a huge range of data on citizens’ actual behaviour, as recorded digitally whenever citizens interact with government administration or undertake some act of civic engagement, such as signing a petition.

Data mined from social media or administrative operations in this way also provide a range of new data which can enable government agencies to monitor – and improve – their own performance, for example through log usage data of their own electronic presence or transactions recorded on internal information systems, which are increasingly interlinked. And they can use data from social media for self-improvement, by understanding what people are saying about government, and which policies, services or providers are attracting negative opinions and complaints, enabling identification of a failing school, hospital or contractor, for example. They can solicit such data via their own sites, or those of social enterprises. And they can find out what people are concerned about or looking for, from the Google Search API or Google trends, which record the search patterns of a huge proportion of internet users.

As for threats, big data is technologically challenging for government, particularly those governments which have always struggled with large-scale information systems and technology projects. The UK government has long been a world leader in this regard and recent events have only consolidated its reputation. Governments have long suffered from information technology skill shortages and the complex skill sets required for big data analytics pose a particularly acute challenge. Even in the corporate sector, over a third of respondents to a recent survey of business technology professionals cited ‘Big data expertise is scarce and expensive’ as their primary concern about using big data software.

And there are particular cultural barriers to government in using social media, with the informal style and blurring of organizational and public-private boundaries which they engender. And gathering data from social media presents legal challenges, as companies like Facebook place barriers to the crawling and scraping of their sites.

More importantly, big data presents new moral and ethical dilemmas to policy makers. For example, it is possible to carry out probabilistic policy-making, where policy is made on the basis of what a small segment of individuals will probably do, rather than what they have done. Predictive policing has had some success particularly in California, where robberies declined by a quarter after use of the ‘PredPol’ policing software, but can lead to a “feedback loop of injustice” as one privacy advocacy group put it, as policing resources are targeted at increasingly small socio-economic groups. What responsibility does the state have to devote disproportionately more – or less – resources to the education of those school pupils who are, probabilistically, almost certain to drop out of secondary education? Such challenges are greater for governments than corporations. We (reasonably) happily trade privacy to allow Tesco and Facebook to use our data on the basis it will improve their products, but if government tries to use social media to understand citizens and improve its own performance, will it be accused of spying on its citizenry in order to quash potential resistance.

And of course there is an image problem for government in this field – discussion of big data and government puts the word ‘big’ dangerously close to the word ‘government’ and that is an unpopular combination. Policy-makers’ responses to Snowden’s revelations of the US Tempora and UK Prism programmes have done nothing to improve this image, with their focus on the use of big data to track down individuals and groups involved in acts of terrorism and criminality – rather than on anything to make policy-making better, or to use the wealth of information that these programmes collect for the public good.

However, policy-makers have no choice but to tackle some of these challenges. Big data has been the hottest trend in the corporate world for some years now, and commentators from IBM to the New Yorker are starting to talk about the big data ‘backlash’. Government has been far slower to recognize the advantages for policy-making and services. But in some policy sectors, big data poses very fundamental questions which call for an answer; how should governments conduct a census, for or produce labour statistics, for example, in the age of big data? Policy-makers will need to move fast to beat the backlash.


This post is based on discussions at the workshop on Responsible Research Agendas for Public Policy in the era of Big Data workshop.

Helen Margetts is the Director of the OII, and Professor of Society and the Internet. She is a political scientist specialising in digital era governance and politics.

]]>
https://ensr.oii.ox.ac.uk/promises-threats-big-data-for-public-policy-making/feed/ 1
Can Twitter provide an early warning function for the next pandemic? https://ensr.oii.ox.ac.uk/can-twitter-provide-an-early-warning-function-for-the-next-flu-pandemic/ Mon, 14 Oct 2013 08:00:41 +0000 http://blogs.oii.ox.ac.uk/policy/?p=1241 Image by .
Communication of risk in any public health emergency is a complex task for healthcare agencies; a task made more challenging when citizens are bombarded with online information. Mexico City, 2009. Image by Eneas.

 

Ed: Could you briefly outline your study?

Patty: We investigated the role of Twitter during the 2009 swine flu pandemics from two perspectives. Firstly, we demonstrated the role of the social network to detect an upcoming spike in an epidemic before the official surveillance systems – up to week in the UK and up to 2-3 weeks in the US – by investigating users who “self-diagnosed” themselves posting tweets such as “I have flu / swine flu”. Secondly, we illustrated how online resources reporting the WHO declaration of “pandemics” on 11 June 2009 were propagated through Twitter during the 24 hours after the official announcement [1,2,3].

Ed: Disease control agencies already routinely follow media sources; are public health agencies  aware of social media as another valuable source of information?

Patty:  Social media are providing an invaluable real-time data signal complementing well-established epidemic intelligence (EI) systems monitoring online media, such as MedISys and GPHIN. While traditional surveillance systems will remain the pillars of public health, online media monitoring has added an important early-warning function, with social media bringing  additional benefits to epidemic intelligence: virtually real-time information available in the public domain that is contributed by users themselves, thus not relying on the editorial policies of media agencies.

Public health agencies (such as the European Centre for Disease Prevention and Control) are interested in social media early warning systems, but more research is required to develop robust social media monitoring solutions that are ready to be integrated with agencies’ EI services.

Ed: How difficult is this data to process? Eg: is this a full sample, processed in real-time?

Patty:  No, obtaining all Twitter search query results is not possible. In our 2009 pilot study we were accessing data from Twitter using a search API interface querying the database every minute (the number of results was limited to 100 tweets). Currently, only 1% of the ‘Firehose’ (massive real-time stream of all public tweets) is made available using the streaming API. The searches have to be performed in real-time as historical Twitter data are normally available only through paid services. Twitter analytics methods are diverse; in our study, we used frequency calculations, developed algorithms for geo-location, automatic spam and duplication detection, and applied time series and cross-correlation with surveillance data [1,2,3].

Ed: What’s the relationship between traditional and social media in terms of diffusion of health information? Do you have a sense that one may be driving the other?

Patty: This is a fundamental question. “Does media coverage of certain topic causes buzz on social media or does social media discussion causes media frenzy?” This was particularly important to investigate for the 2009 swine flu pandemic, which experienced unprecedented media interest. While it could be assumed that disease cases preceded media coverage, or that media discussion sparked public interest causing Twitter debate, neither proved to be the case in our experiment. On some days, media coverage for flu was higher, and on others Twitter discussion was higher; but peaks seemed synchronized – happening on the same days.

Ed: In terms of communicating accurate information, does the Internet make the job easier or more difficult for health authorities?

Patty: The communication of risk in any public health emergencies is a complex task for government and healthcare agencies; this task is made more challenging when citizens are bombarded with online information, from a variety of sources that vary in accuracy. This has become even more challenging with the increase in users accessing health-related information on their mobile phones (17% in 2010 and 31% in 2012, according to the US Pew Internet study).

Our findings from analyzing Twitter reaction to online media coverage of the WHO declaration of swine flu as a “pandemic” (stage 6) on 11 June 2009, which unquestionably was the most media-covered event during the 2009 epidemic, indicated that Twitter does favour reputable sources (such as the BBC, which was by far the most popular) but also that bogus information can still leak into the network.

Ed: What differences do you see between traditional and social media, in terms of eg bias / error rate of public health-related information?

Patty: Fully understanding quality of media coverage of health topics such as the 2009 swine flu pandemics in terms of bias and medical accuracy would require a qualitative study (for example, one conducted by Duncan in the EU [4]). However, the main role of social media, in particular Twitter due to the 140 character limit, is to disseminate media coverage by propagating links rather than creating primary health information about a particular event. In our study around 65% of tweets analysed contained a link.

Ed: Google flu trends (which monitors user search terms to estimate worldwide flu activity) has been around a couple of years: where is that going? And how useful is it?

Patty: Search companies such as Google have demonstrated that online search queries for keywords relating to flu and its symptoms can serve as a proxy for the number of individuals who are sick (Google Flu Trends), however, in 2013 the system “drastically overestimated peak flu levels”, as reported by Nature. Most importantly, however, unlike Twitter, Google search queries remain proprietary and are therefore not useful for research or the construction of non-commercial applications.

Ed: What are implications of social media monitoring for countries that may want to suppress information about potential pandemics?

Patty: The importance of event-based surveillance and monitoring social media for epidemic intelligence is of particular importance in countries with sub-optimal surveillance systems and those lacking the capacity for outbreak preparedness and response. Secondly, the role of user-generated information on social media is also of particular importance in counties with limited freedom of press or those that actively try to suppress information about potential outbreaks.

Ed: Would it be possible with this data to follow spread geographically, ie from point sources, or is population movement too complex to allow this sort of modelling?

Patty: Spatio-temporal modelling is technically possible as tweets are time-stamped and there is a support for geo-tagging. However, the location of all tweets can’t be precisely identified; however, early warning systems will improve in accuracy as geo-tagging of user generated content becomes widespread. Mathematical modelling of the spread of diseases and population movements are very topical research challenges (undertaken by, for example, by Colliza et al. [5]) but modelling social media user behaviour during health emergencies to provide a robust baseline for early disease detection remains a challenge.

Ed: A strength of monitoring social media is that it follows what people do already (eg search / Tweet / update statuses). Are there any mobile / SNS apps to support collection of epidemic health data? eg a sort of ‘how are you feeling now’ app?

Patty: The strength of early warning systems using social media is exactly in the ability to piggy-back on existing users’ behaviour rather than having to recruit participants. However, there are a growing number of participatory surveillance systems that ask users to provide their symptoms (web-based such as Flusurvey in the UK, and “Flu Near You” in the US that also exists as a mobile app). While interest in self-reporting systems is growing, challenges include their reliability, user recruitment and long-term retention, and integration with public health services; these remain open research questions for the future. There is also a potential for public health services to use social media two-ways – by providing information over the networks rather than only collect user-generated content. Social media could be used for providing evidence-based advice and personalized health information directly to affected citizens where they need it and when they need it, thus effectively engaging them in active management of their health.

References

[1.] M Szomszor, P Kostkova, C St Louis: Twitter Informatics: Tracking and Understanding Public Reaction during the 2009 Swine Flu Pandemics, IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology 2011, WI-IAT, Vol. 1, pp.320-323.

[2.]  Szomszor, M., Kostkova, P., de Quincey, E. (2010). #swineflu: Twitter Predicts Swine Flu Outbreak in 2009. M Szomszor, P Kostkova (Eds.): ehealth 2010, Springer Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering LNICST 69, pages 18-26, 2011.

[3.] Ed de Quincey, Patty Kostkova Early Warning and Outbreak Detection Using Social Networking Websites: the Potential of Twitter, P Kostkova (Ed.): ehealth 2009, Springer Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering LNICST 27, pages 21-24, 2010.

[4.] B Duncan. How the Media reported the first day of the pandemic H1N1) 2009: Results of EU-wide Media Analysis. Eurosurveillance, Vol 14, Issue 30, July 2009

[5.] Colizza V, Barrat A, Barthelemy M, Valleron AJ, Vespignani A (2007) Modeling the worldwide spread of pandemic influenza: Baseline case an containment interventions. PloS Med 4(1): e13. doi:10.1371/journal. pmed.0040013

Further information on this project and related activities, can be found at: BMJ-funded scientific film: http://www.youtube.com/watch?v=_JNogEk-pnM ; Can Twitter predict disease outbreaks? http://www.bmj.com/content/344/bmj.e2353 ; 1st International Workshop on Public Health in the Digital Age: Social Media, Crowdsourcing and Participatory Systems (PHDA 2013): http://www.digitalhealth.ws/ ; Social networks and big data meet public health @ WWW 2013: http://www2013.org/2013/04/25/social-networks-and-big-data-meet-public-health/


Patty Kostkova was talking to blog editor David Sutcliffe.

Dr Patty Kostkova is a Principal Research Associate in eHealth at the Department of Computer Science, University College London (UCL) and held a Research Scientist post at the ISI Foundation in Italy. Until 2012, she was the Head of the City eHealth Research Centre (CeRC) at City University, London, a thriving multidisciplinary research centre with expertise in computer science, information science and public health. In recent years, she was appointed a consultant at WHO responsible for the design and development of information systems for international surveillance.

Researchers who were instrumental in this project include Ed de Quincey, Martin Szomszor and Connie St Louis.

]]>
Who represents the Arab world online? https://ensr.oii.ox.ac.uk/arab-world/ Tue, 01 Oct 2013 07:09:58 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2190 Caption
Editors from all over the world have played some part in writing about Egypt; in fact, only 13% of all edits actually originate in the country (38% are from the US). More: Who edits Wikipedia? by Mark Graham.

Ed: In basic terms, what patterns of ‘information geography’ are you seeing in the region?

Mark: The first pattern that we see is that the Middle East and North Africa are relatively under-represented in Wikipedia. Even after accounting for factors like population, Internet access, and literacy, we still see less contact than would be expected. Second, of the content that exists, a lot of it is in European and French rather than in Arabic (or Farsi or Hebrew). In other words, there is even less in local languages.

And finally, if we look at contributions (or edits), not only do we also see a relatively small number of edits originating in the region, but many of those edits are being used to write about other parts of the word rather than their own region. What this broadly seems to suggest is that the participatory potentials of Wikipedia aren’t yet being harnessed in order to even out the differences between the world’s informational cores and peripheries.

Ed: How closely do these online patterns in representation correlate with regional (offline) patterns in income, education, language, access to technology (etc.) Can you map one to the other?

Mark: Population and broadband availability alone explain a lot of the variance that we see. Other factors like income and education also play a role, but it is population and broadband that have the greatest explanatory power here. Interestingly, it is most countries in the MENA region that fail to fit well to those predictors.

Ed: How much do you think these patterns result from the systematic imposition of a particular view point – such as official editorial policies – as opposed to the (emergent) outcome of lots of users and editors acting independently?

Mark: Particular modes of governance in Wikipedia likely do play a factor here. The Arabic Wikipedia, for instance, to combat vandalism has a feature whereby changes to articles need to be reviewed before being made public. This alone seems to put off some potential contributors. Guidelines around sourcing in places where there are few secondary sources also likely play a role.

Ed: How much discussion (in the region) is there around this issue? Is this even acknowledged as a fact or problem?

Mark: I think it certainly is recognised as an issue now. But there are few viable alternatives to Wikipedia. Our goal is hopefully to identify problems that lead to solutions, rather than simply discouraging people from even using the platform.

Ed: This work has been covered by the Guardian, Wired, the Huffington Post (etc.) How much interest has there been from the non-Western press or bloggers in the region?

Mark: There has been a lot of coverage from the non-Western press, particularly in Latin America and Asia. However, I haven’t actually seen that much coverage from the MENA region.

Ed: As an academic, do you feel at all personally invested in this, or do you see your role to be simply about the objective documentation and analysis of these patterns?

Mark: I don’t believe there is any such thing as ‘objective documentation.’ All research has particular effects in and on the world, and I think it is important to be aware of the debates, processes, and practices surrounding any research project. Personally, I think Wikipedia is one of humanity’s greatest achievements. No previous single platform or repository of knowledge has ever even come close to Wikipedia in terms of its scale or reach. However, that is all the more reason to critically investigate what exactly is, and isn’t, contained within this fantastic resource. By revealing some of the biases and imbalances in Wikipedia, I hope that we’re doing our bit to improving it.

Ed: What factors do you think would lead to greater representation in the region? For example: is this a matter of voices being actively (or indirectly) excluded, or are they maybe just not all that bothered?

Mark: This is certainly a complicated question. I think the most important step would be to encourage participation from the region, rather than just representation of the region. Some of this involves increasing some of the enabling factors that are the prerequisites for participation; factors like: increasing broadband access, increasing literacy, encouraging more participation from women and minority groups.

Some of it is then changing perceptions around Wikipedia. For instance, many people that we spoke to in the region framed Wikipedia as an American our outside project rather than something that is locally created. Unfortunately we seem to be currently stuck in a vicious cycle in which few people from the region participate, therefore fulfilling the very reason why some people think that they shouldn’t participate. There is also the issue of sources. Not only does Wikipedia require all assertions to be properly sourced, but secondary sources themselves can be a great source of raw informational material for Wikipedia articles. However, if few sources about a place exist, then it adds an additional burden to creating content about that place. Again, a vicious cycle of geographic representation.

My hope is that by both working on some of the necessary conditions to participation, and engaging in a diverse range of initiatives to encourage content generation, we can start to break out of some of these vicious cycles.

Ed: The final moonshot question: How would you like to extend this work; time and money being no object?

Mark: Ideally, I’d like us to better understand the geographies of representation and participation outside of just the MENA region. This would involve mixed-methods (large scale big data approaches combined with in-depth qualitative studies) work focusing on multiple parts of the world. More broadly, I’m trying to build a research program that maintains a focus on a wide range of Internet and information geographies. The goal here is to understand participation and representation through a diverse range of online and offline platforms and practices and to share that work through a range of publicly accessible media: for instance the ‘Atlas of the Internet’ that we’re putting together.


Mark Graham was talking to blog editor David Sutcliffe.

Mark Graham is a Senior Research Fellow at the OII. His research focuses on Internet and information geographies, and the overlaps between ICTs and economic development.

]]>
Responsible research agendas for public policy in the era of big data https://ensr.oii.ox.ac.uk/responsible-research-agendas-for-public-policy-in-the-era-of-big-data/ Thu, 19 Sep 2013 15:17:01 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2164 Last week the OII went to Harvard. Against the backdrop of a gathering storm of interest around the potential of computational social science to contribute to the public good, we sought to bring together leading social science academics with senior government agency staff to discuss its public policy potential. Supported by the OII-edited journal Policy and Internet and its owners, the Washington-based Policy Studies Organization (PSO), this one-day workshop facilitated a thought-provoking conversation between leading big data researchers such as David Lazer, Brooke Foucault-Welles and Sandra Gonzalez-Bailon, e-government experts such as Cary Coglianese, Helen Margetts and Jane Fountain, and senior agency staff from US federal bureaus including Labor Statistics, Census, and the Office for the Management of the Budget.

It’s often difficult to appreciate the impact of research beyond the ivory tower, but what this productive workshop demonstrated is that policy-makers and academics share many similar hopes and challenges in relation to the exploitation of ‘big data’. Our motivations and approaches may differ, but insofar as the youth of the ‘big data’ concept explains the lack of common language and understanding, there is value in mutual exploration of the issues. Although it’s impossible to do justice to the richness of the day’s interactions, some of the most pertinent and interesting conversations arose around the following four issues.

Managing a diversity of data sources. In a world where our capacity to ask important questions often exceeds the availability of data to answer them, many participants spoke of the difficulties of managing a diversity of data sources. For agency staff this issue comes into sharp focus when available administrative data that is supposed to inform policy formulation is either incomplete or inadequate. Consider, for example, the challenge of regulating an economy in a situation of fundamental data asymmetry, where private sector institutions track, record and analyse every transaction, whilst the state only has access to far more basic performance metrics and accounts. Such asymmetric data practices also affect academic research, where once again private sector tech companies such as Google, Facebook and Twitter often offer access only to portions of their data. In both cases participants gave examples of creative solutions using merged or blended data sources, which raise significant methodological and also ethical difficulties which merit further attention. The Berkman Center’s Rob Faris also noted the challenges of combining ‘intentional’ and ‘found’ data, where the former allow far greater certainty about the circumstances of their collection.

Data dictating the questions. If participants expressed the need to expend more effort on getting the most out of available but diverse data sources, several also canvassed against the dangers of letting data availability dictate the questions that could be asked. As we’ve experienced at the OII, for example, the availability of Wikipedia or Twitter data means that questions of unequal digital access (to political resources, knowledge production etc.) can often be addressed through the lens of these applications or platforms. But these data can provide only a snapshot, and large questions of great social or political importance may not easily be answered through such proxy measurements. Similarly, big data may be very helpful in providing insights into policy-relevant patterns or correlations, such as identifying early indicators of seasonal diseases or neighbourhood decline, but seem ill-suited to answer difficult questions regarding say, the efficacy of small-scale family interventions. Just because the latter are harder to answer using currently vogue-ish tools doesn’t mean we should cease to ask these questions.

Ethics. Concerns about privacy are frequently raised as a significant limitation of the usefulness of big data. Given that with two or more data sets even supposedly anonymous data subjects may be identified, the general consensus seems to be that ‘privacy is dead’. Whilst all participants recognised the importance of public debate around this issue, several academics and policy-makers expressed a desire to get beyond this discussion to a more nuanced consideration of appropriate ethical standards. Accountability and transparency are often held up as more realistic means of protecting citizens’ interests, but one workshop participant also suggested it would be helpful to encourage more public debate about acceptable and unacceptable uses of our data, to determine whether some uses might simply be deemed ‘off-limits’, whilst other uses could be accepted as offering few risks.

Accountability. Following on from this debate about the ethical limits of our uses of big data, discussion exposed the starkly differing standards to which government and academics (to say nothing of industry) are held accountable. As agency officials noted on several occasions it matters less what they actually do with citizens’ data, than what they are perceived to do with it, or even what it’s feared they might do. One of the greatest hurdles to be overcome here concerns the fundamental complexity of big data research, and the sheer difficulty of communicating to the public how it informs policy decisions. Quite apart from the opacity of the algorithms underlying big data analysis, the explicit focus on correlation rather than causation or explanation presents a new challenge for the justification of policy decisions, and consequently, for public acceptance of their legitimacy. As Greg Elin of Gitmachines emphasised, policy decisions are still the result of explicitly normative political discussion, but the justifiability of such decisions may be rendered more difficult given the nature of the evidence employed.

We could not resolve all these issues over the course of the day, but they served as pivot points for honest and productive discussion amongst the group. If nothing else, they demonstrate the value of interaction between academics and policy-makers in a research field where the stakes are set very high. We plan to reconvene in Washington in the spring.

*We are very grateful to the Policy Studies Organization (PSO) and the American Public University for their generous support of this workshop. The workshop “Responsible Research Agendas for Public Policy in the Era of Big Data” was held at the Harvard Faculty Club on 13 September 2013.

Also read: Big Data and Public Policy Workshop by Eric Meyer, workshop attendee and PI of the OII project Accessing and Using Big Data to Advance Social Science Knowledge.


Victoria Nash received her M.Phil in Politics from Magdalen College in 1996, after completing a First Class BA (Hons) Degree in Politics, Philosophy and Economics, before going on to complete a D.Phil in Politics from Nuffield College, Oxford University in 1999. She was a Research Fellow at the Institute of Public Policy Research prior to joining the OII in 2002. As Research and Policy Fellow at the OII, her work seeks to connect OII research with policy and practice, identifying and communicating the broader implications of OII’s research into Internet and technology use.

]]>
Predicting elections on Twitter: a different way of thinking about the data https://ensr.oii.ox.ac.uk/predicting-elections-on-twitter-a-different-way-of-thinking-about-the-data/ Sun, 04 Aug 2013 11:43:52 +0000 http://blogs.oii.ox.ac.uk/policy/?p=1498 GOP presidential nominee Mitt Romney
GOP presidential nominee Mitt Romney, centre, waving to crowd, after delivering his acceptance speech on the final night of the 2012 Republican National Convention. Image by NewsHour.

Recently, there has been a lot of interest in the potential of social media as a means to understand public opinion. Driven by an interest in the potential of so-called “big data”, this development has been fuelled by a number of trends. Governments have been keen to create techniques for what they term “horizon scanning”, which broadly means searching for the indications of emerging crises (such as runs on banks or emerging natural disasters) online, and reacting before the problem really develops. Governments around the world are already committing massive resources to developing these techniques. In the private sector, big companies’ interest in brand management has fitted neatly with the potential of social media monitoring. A number of specialised consultancies now claim to be able to monitor and quantify reactions to products, interactions or bad publicity in real time.

It should therefore come as little surprise that, like other research methods before, these new techniques are now crossing over into the competitive political space. Social media monitoring, which in theory can extract information from tweets and Facebook posts and quantify positive and negative public reactions to people, policies and events has an obvious utility for politicians seeking office. Broadly, the process works like this: vast datasets relating to an election, often running into millions of items, are gathered from social media sites such as Twitter. These data are then analysed using natural language processing software, which automatically identifies qualities relating to candidates or policies and attributes a positive or negative sentiment to each item. Finally, these sentiments and other properties mined from the text are totalised, to produce an overall figure for public reaction on social media.

These techniques have already been employed by the mainstream media to report on the 2010 British general election (when the country had its first leaders debate, an event ripe for this kind of research) and also in the 2012 US presidential election. This growing prominence led my co-author Mike Jensen of the University of Canberra and myself to question: exactly how useful are these techniques for predicting election results? In order to answer this question, we carried out a study on the Republican nomination contest in 2012, focused on the Iowa Caucus and Super Tuesday. Our findings are published in the current issue of Policy and Internet.

There are definite merits to this endeavour. US candidate selection contests are notoriously hard to predict with traditional public opinion measurement methods. This is because of the unusual and unpredictable make-up of the electorate. Voters are likely (to greater or lesser degrees depending on circumstances in a particular contest and election laws in the state concerned) to share a broadly similar outlook, so the electorate is harder for pollsters to model. Turnout can also vary greatly from one cycle to the next, adding an additional layer of unpredictability to the proceedings.

However, as any professional opinion pollster will quickly tell you, there is a big problem with trying to predict elections using social media. The people who use it are simply not like the rest of the population. In the case of the US, research from Pew suggests that only 16 per cent of internet users use Twitter, and while that figure goes up to 27 per cent of those aged 18-29, only 2 per cent of over 65s use the site. The proportion of the electorate voting for within those categories, however, is the inverse: over 65s vote at a relatively high rate compared to the 18-29 cohort. furthermore, given that we know (from research such as Matthew Hindman’s The Myth of Digital Democracy) that the only a very small proportion of people online actually create content on politics, those who are commenting on elections become an even more unusual subset of the population.

Thus (and I can say this as someone who does use social media to talk about politics!) we are looking at an unrepresentative sub-set (those interested in politics) of an unrepresentative sub-set (those using social media) of the population. This is hardly a good omen for election prediction, which relies on modelling the voting population as closely as possible. As such, it seems foolish to suggest that a simply culmination of individual preferences can simply be equated to voting intentions.

However, in our article we suggest a different way of thinking about social media data, more akin to James Surowiecki’s idea of The Wisdom of Crowds. The idea here is that citizens commenting on social media should not be treated like voters, but rather as commentators, seeking to understand and predict emerging political dynamics. As such, the method we operationalized was more akin to an electoral prediction market, such as the Iowa Electronic Markets, than a traditional opinion poll.

We looked for two things in our dataset: sudden changes in the number of mentions of a particular candidate and also words that indicated momentum for a particular candidate, such as “surge”. Our ultimate finding was that this turned out to be a strong predictor. We found that the former measure had a good relationship with Rick Santorum’s sudden surge in the Iowa caucus, although it did also tend to disproportionately-emphasise a lot of the less successful candidates, such as Michelle Bachmann. The latter method, on the other hand, picked up the Santorum surge without generating false positives, a finding certainly worth further investigation.

Our aim in the paper was to present new ways of thinking about election prediction through social media, going beyond the paradigm established by the dominance of opinion polling. Our results indicate that there may be some value in this approach.


Read the full paper: Michael J. Jensen and Nick Anstead (2013) Psephological investigations: Tweets, votes, and unknown unknowns in the republican nomination process. Policy and Internet 5 (2) 161–182.

Dr Nick Anstead was appointed as a Lecturer in the LSE’s Department of Media and Communication in September 2010, with a focus on Political Communication. His research focuses on the relationship between existing political institutions and new media, covering such topics as the impact of the Internet on politics and government (especially e-campaigning), electoral competition and political campaigns, the history and future development of political parties, and political mobilisation and encouraging participation in civil society.

Dr Michael Jensen is a Research Fellow at the ANZSOG Institute for Governance (ANZSIG), University of Canberra. His research spans the subdisciplines of political communication, social movements, political participation, and political campaigning and elections. In the last few years, he has worked particularly with the analysis of social media data and other digital artefacts, contributing to the emerging field of computational social science.

]]>
Seeing like a machine: big data and the challenges of measuring Africa’s informal economies https://ensr.oii.ox.ac.uk/seeing-like-a-machine-big-data-and-the-challenges-of-measuring-africas-informal-economies/ Mon, 22 Jul 2013 12:01:11 +0000 http://blogs.oii.ox.ac.uk/policy/?p=1878
The Juba Archives
State research capacity has been weakened since the 1980s. It is now hoped that the ‘big data’ generated by mobile phone use can shed light on African economic and social issues, but we must pay attention to what new technologies are doing to the bigger research environment. Image by Nicki Kindersley.

As Linnet Taylor’s recent post on this blog has argued, researchers are gaining interest in Africa’s big data. Linnet’s excellent post focused on what the profusion of big data might mean for privacy concerns and frameworks for managing personal data. My own research focuses on the implications of big (and open) data on knowledge about Africa; specifically, economic knowledge.

As an introduction, it might be helpful to reflect on the French colonial concepts of l’Afrique utile and l’Afrique inutile (concepts most recently re-invoked by William Reno in 1999 and James Ferguson in 2005). L’Afrique utile, or usable Africa represented parts of Africa over which private actors felt they could exercise a degree of governance and control, and therefore extract profit. L’Afrique inutile, on the other hand, was the no-go area: places deemed too risky, too opaque and too wild for commercial profit. Until recently, it was difficult to convince multinationals to view Africa as usable and profitable because much economic activity took place in the unaccounted informal economy. With the exception of a few oil, gas and mineral installations and some export commodities like cocoa, cotton, tobacco, rubber, coffee, and tea, multinationals stayed out of the continent. Likewise, within the accounts of national public policy-making institutions, it was only the very narrow formal and recordable parts of the economy that were recorded. In a similar way that economists have traditionally excluded unpaid domestic labour from national accounts, most African states only scratched the surface of their populations’ true economic lives.

The mobile phone has undoubtedly changed the way private companies and public bodies view African economies. Firstly, the mobile phone has demonstrated that Africans can be voracious consumers at the bottom of the pyramid (paving the way for the distribution of other low-cost items such as soap, sanitary pads, soft drinks, etc.). While the colonial scramble for Africa focused on what lay in Africa’s lands and landscapes, the new scramble is focused on its people and markets (and workers; as the growing interest in business process outsourcing demonstrates).

Secondly, mobile phones (and other kinds of information and communication technologies) have created new channels of information about Africans and African markets, particularly in the informal sector. In an era where so much of the apparatus for measuring Africa’s economies has been weakened, this kind of data reaps enormous potential. One might say that the mobile phone and the internet have made former parts of l’Afrique inutile into l’Afrique utile — open for business, profit, analysis, and perhaps, control.

The ‘scramble for Africa’s data‘ is taking place within a particular historical trajectory of knowledge production. Africa has always been a laboratory for Western scientists and researchers, with local knowledge production often influenced by foreign powers and foreign ideas (think back to the early reliance on primary products for export, to which the entire colonial system of economic measurement and development planning was geared). Within the contemporary context of ever-expanding higher education and dwindling finances for local research, African academics and researchers have been forced to take on more and more consultancies and private contracts.

This ‘extraversion’ of African institutions of higher education has contributed to a re-orientation of the apparatus for academic research towards questions posed from outside. Within state bodies, similar processes are underway. Weakened by corruption, Structural Adjustment Policies (SAP), and pervasive informal economic activity, management of the economy has migrated from state institutions into the better paid offices of NGOs, consultancies and private companies. State capacity to measure and model is presently very weak, and African governments are therefore being encouraged to ‘open’ up their own records to non-state researchers. It is into this research context that big data emerges as a new source of ‘legibility’.

ICTs offer obvious benefits to economic researchers. They have often been heralded as offering potentially more democratic and participatory kinds of ‘legibility’. Their potential partly lies in the way that ICTs activate ‘social networks’ into infrastructures through which external actors can deliver and extract information. This ‘sociability’ makes them particularly suitable for studying informal economic networks. ICTs also offer the potential to modernise existing streams of data collection and broaden intra-institutional coordination, leading to better collaboration and more targeted public policy. In our project on the economic impacts of fibre optic broadband in East Africa, we have seen how institutions such as the Kenya Tea Board and the Rwandan Health Ministry are better integrating their information systems in order to gain a better national picture, and thereby contribute to industrial upgrading in the case of tea or better public services in the case of health. Nevertheless, big data is not accessible to all, and researchers must often prove commercial or strategic value in order to gain access.

Use of ‘big data’ is still a growing field, born within the discipline of computer science. My initial interviews with big data researchers working on Africa indicate they are still figuring out what kinds of questions can be answered with big data and how they might justify themselves and their methodologies to mainstream economics. Big data’s potential for hypothesis-building is somewhat at odds with the tradition of hypothesis-testing in economics. Big data researchers start with the question, ‘Where can this data lead me?’ There is also the question of how restricted access might frame research design. To date, the researchers that have been most successful in gaining access to African big data have worked with private companies, banks and financial institutions. It is therefore the incorporation and integration of poor people into private sector understandings that big data currently seems to offer.

This vision of development fits into a broader trend. Just as Hernando de Soto has argued that development is hampered by the exclusion of poor people from formalised property rights, proponents of microcredit have likewise argued it is the poor’s exclusion from financial institutions that limit their ability to develop self-sustaining enterprises. Researchers are therefore encouraged to use big data to model poor peoples’ actions and credit worthiness to incorporate them into financial systems, thereby transforming them from invisible selves into visible selves.

Critics of microfinance have cautioned that incorporating poor people into globalised structures of finance makes them more vulnerable to state interference in the form of taxes and to debt and international financial crises. It is also unclear what the drift into the private sector might do to wider understandings of poverty. While national measures situate citizens as members of national or collective groups, mobile financial innovations often focus on the individual’s financial records and credit worthiness. It remains to be seen whether this change of focus might move us away from more social definitions for poverty towards more individual or private explanations.

Likewise the flow of digital information across geographical space has the potential to change the nature of collaboration. As Mahmoud Mamdani has cautioned, “The global market tends to relegate Africa to providing raw material (“data”) to outside academics who process it and then re-export their theories back to Africa. Research proposals are increasingly descriptive accounts of data collection and the methods used to collate data, collaboration is reduced to assistance, and there is a general impoverishment of theory and debate”. This problem could potentially be exacerbated by open data initiatives that seek to get more people using publicly collected data. As Morten Jerven writes in his recent book, Poor Numbers, interactions between African data producers and users are currently limited, with users often unable to effectively assess the source and methods used to collect the original data. Nevertheless, such numbers are often taken at face value, with dubious policy recommendations formed as a result. While multiple sources of data (from the public and private sector) can help increase the precision of research and lead to better conclusions, we do not understand how big data (and open data) will impact the overall research environment in Africa.

My next project will examine these issues in relation to economic studies of unemployment in Egypt and financial inclusion in Uganda. The key objectives will be to improve our understanding of how data is being collected, how data is being communicated across groups and within systems, how new models of the economy are being formed, and what these changes are doing to political and economic relationships on the ground. Specifically, the project poses six interrelated questions: Where is economic intelligence and expertise currently located? What is being measured by whom, and how, and why? How do different tools of measurement change the way researchers understand economic truth and construct their models? How does more ‘legibility’ over African economies change power relations? What resistance or critical thinking exists within these new configurations of expertise? How can we combine approaches to assemble a fuller picture of economic understanding? The project will emphasise how economics, as a discipline, does not merely measure external reality, but helps to shape and influence that reality.

How we measure economies matters, particularly in the context of ever increasing evidence-based policy-making and with increasing interest from the private sector in Africa. Measurement often changes and shapes our realities of the external world. As Timothy Mitchell writes: “the practices that form the economy operate, in part, to establish equivalences, contain circulations, identify social actors or agents, make quantities and performances measurable, and designate relations of control and command”. In other words, researchers cannot make sense of an economy without first establishing a research infrastructure through which subjects are measured and incorporated. The particular shape, tools and technologies of that research infrastructure help frame and construct economic models and truth.

Such frames also have political implications, as control over information often strengthens one group over others. Indeed, as James C. Scott’s work Seeing Like a State has shown, the struggle to establish legibility over societies is inherently political. Elites have always attempted to standardise and regularise more marginal groups in an effort to draw them into dominant political and economic orders. However, legibility does not have be ‘top-down’. Weaker groups suffer most from illegible societies, and can benefit from more legibility. As information and trust become more deeply embedded within stronger ties and within transnational networks of skill and expertise, marginalised ‘out groups’ are particularly disadvantaged.

While James C. Scott’s work highlighted the dangers of a high modernist ‘legibility’, the very absence of legibility can also disempower marginal groups. It is the kind of legibility at stake that is important. While big data offers enormous potential for economists to better understand what is going on in Africa’s informal economies, economic sociologists, anthropologists and historians must remind them how our tools and measurements influence systems of knowledge production and change our understandings and beliefs about the external world. Africa might be becoming ‘more usable’ and ‘more legible,’ but we need to ask, for whom, by whom, and for what purpose?


Dr Laura Mann is a Postdoctoral Researcher at the Oxford Internet Institute, University of Oxford. Her research focuses on the political economy of markets and value chains in Africa. Her current research examines the effects of broadband internet on the tea, tourism and outsourcing value chains of Kenya and Rwanda. From January 2014 she will be based at the African Studies Centre at Leiden University. Read Laura’s blog.

]]>
The scramble for Africa’s data https://ensr.oii.ox.ac.uk/the-scramble-for-africas-data/ https://ensr.oii.ox.ac.uk/the-scramble-for-africas-data/#comments Mon, 08 Jul 2013 09:21:02 +0000 http://blogs.oii.ox.ac.uk/policy/?p=1230 Mobile phone advert in Zomba, Malawi
Africa is in the midst of a technological revolution, and the current wave of digitisation has the potential to make the continent’s citizens a rich mine of data. Intersection in Zomba, Malawi. Image by john.duffell.

 

After the last decade’s exponential rise in ICT use, Africa is fast becoming a source of big data. Africans are increasingly emitting digital information with their mobile phone calls, internet use and various forms of digitised transactions, while on a state level e-government starts to become a reality. As Africa goes digital, the challenge for policymakers becomes what the WRR, a Dutch policy organisation, has identified as ‘i-government’: moving from digitisation to managing and curating digital data in ways that keep people’s identities and activities secure.

On one level, this is an important development for African policymakers, given that accurate information on their populations has been notoriously hard to come by and, where it exists, has not been shared. On another, however, it represents a tremendous challenge. The WRR has pointed out the unpreparedness of European governments, who have been digitising for decades, for the age of i-government. How are African policymakers, as relative newcomers to digital data, supposed to respond?

There are two possible scenarios. One is that systems will develop for the release and curation of Africans’ data by corporations and governments, and that it will become possible, in the words of the UN’s Global Pulse initiative, to use it as a ‘public good’ – an invaluable tool for development policies and crisis response. The other is that there will be a new scramble for Africa: a digital resource grab that may have implications as great as the original scramble amongst the colonial powers in the late 19th century.

We know that African data is not only valuable to Africans. The current wave of digitisation has the potential to make the continent’s citizens a rich mine of data about health interventions, human mobility, conflict and violence, technology adoption, communication dynamics and financial behaviour, with the default mode being for this to happen without their consent or involvement, and without ethical and normative frameworks to ensure data protection or to weigh the risks against the benefits. Orange’s recent release of call data from Cote d’Ivoire both represents an example of the emerging potential of African digital data, but also the challenge of understanding the kind of anonymisation and ethical challenge that it represents.

I have heard various arguments as to why data protection is not a problem for Africans. One is that people in African countries don’t care about their privacy because they live in a ‘collective society’. (Whatever that means.) Another is that they don’t yet have any privacy to protect because they are still disconnected from the kinds of system that make data privacy important. Another more convincing and evidence-based argument is that the ends may justify the means (as made here by the ICRC in a thoughtful post by Patrick Meier about data privacy in crisis situations), and that if significant benefits can be delivered using African big data these outweigh potential or future threats to privacy. The same argument is being made by Global Pulse, a UN initiative which aims to convince corporations to release data on developing countries as a public good for use in devising development interventions.

There are three main questions: what can incentivise African countries’ citizens and policymakers to address privacy in parallel with the collection of massive amounts of personal data, rather than after abuses occur? What are the models that might be useful in devising privacy frameworks for groups with restricted technological access and sophistication? And finally, how can such a system be participatory enough to be relevant to the needs of particular countries or populations?

Regarding the first question, this may be a lost cause. The WRR’s i-government work suggests that only public pressure due to highly publicised breaches of data security may spur policymakers to act. The answer to the second question is being pursued, among others, by John Clippinger and Alex Pentland at MIT (with their work on the social stack); by the World Economic Forum, which is thinking about the kinds of rules that should govern personal data worldwide; by the aforementioned Global Pulse, which has a strong interest in building frameworks which make it safe for corporations to share people’s data; by Microsoft, which is doing some serious thinking about differential privacy for large datasets; by independent researchers such as Patrick Meier, who is looking at how crowdsourced data about crises and human rights abuses should be handled; and by the Oxford Internet Institute’s new M-Data project which is devising privacy guidelines for collecting and using mobile connectivity data.

Regarding the last question, participatory systems will require African country activists, scientists and policymakers to build them. To be relevant, they will also need to be made enforceable, which may be an even greater challenge. Privacy frameworks are only useful if they are made a living part of both governance and citizenship: there must be the institutional power to hold offenders accountable (in this case extremely large and powerful corporations, governments and international institutions), and awareness amongst ordinary people about the existence and use of their data. This, of course, has not really been achieved in developed countries, so doing it in Africa may not exactly be a piece of cake.

Notwithstanding these challenges, the region offers an opportunity to push researchers and policymakers – local and worldwide – to think clearly about the risks and benefits of big data, and to make solutions workable, enforceable and accessible. In terms of data privacy, if it works in Burkina Faso, it will probably work in New York, but the reverse is unlikely to be true. This makes a strong argument for figuring it out in Burkina Faso.

Some may contend that this discussion only points out the massive holes in the governance of technology that prevail in Africa – and in fact a whole other level of problems regarding accountability and power asymmetries. My response: Yes. Absolutely.


Linnet Taylor’s research focuses on social and economic aspects of the diffusion of the internet in Africa, and human mobility as a factor in technology adoption (.. read her blog). Her doctoral research was on Ghana, where she looked at mobility’s influence on the formation and viability of internet cafes in poor and remote areas, networking amongst Ghanaian technology professionals and ICT4D policy. At the OII she works on a Sloan Foundation funded project on Accessing and Using Big Data to Advance Social Science Knowledge.

]]>
https://ensr.oii.ox.ac.uk/the-scramble-for-africas-data/feed/ 1
Uncovering the structure of online child exploitation networks https://ensr.oii.ox.ac.uk/uncovering-the-structure-of-online-child-exploitation-networks/ https://ensr.oii.ox.ac.uk/uncovering-the-structure-of-online-child-exploitation-networks/#comments Thu, 07 Feb 2013 10:11:17 +0000 http://blogs.oii.ox.ac.uk/policy/?p=661 The Internet has provided the social, individual, and technological circumstances needed for child pornography to flourish. Sex offenders have been able to utilize the Internet for dissemination of child pornographic content, for social networking with other pedophiles through chatrooms and newsgroups, and for sexual communication with children. A 2009 estimate by the United Nations estimates that there are more than four million websites containing child pornography, with 35 percent of them depicting serious sexual assault [1]. Even if this report or others exaggerate the true prevalence of those websites by a wide margin, the fact of the matter is that those websites are pervasive on the world wide web.

Despite large investments of law enforcement resources, online child exploitation is nowhere near under control, and while there are numerous technological products to aid in finding child pornography online, they still require substantial human intervention. Despite this, steps can be taken to increase the automation process of these searches, to reduce the amount of content police officers have to examine, and increase the time they can spend on investigating individuals.

While law enforcement agencies will aim for maximum disruption of online child exploitation networks by targeting the most connected players, there is a general lack of research on the structural nature of these networks; something we aimed to address in our study, by developing a method to extract child exploitation networks, map their structure, and analyze their content. Our custom-written Child Exploitation Network Extractor (CENE) automatically crawls the Web from a user-specified seed page, collecting information about the pages it visits by recursively following the links out of the page; the result of the crawl is a network structure containing information about the content of the websites, and the linkages between them [2].

We chose ten websites as starting points for the crawls; four were selected from a list of known child pornography websites while the other six were selected and verified through Google searches using child pornography search terms. To guide the network extraction process we defined a set of 63 keywords, which included words commonly used by the Royal Canadian Mounted Police to find illegal content; most of them code words used by pedophiles. Websites included in the analysis had to contain at least seven of the 63 unique keywords, on a given web page; manual verification showed us that seven keywords distinguished well between child exploitation web pages and regular web pages. Ten sports networks were analyzed as a control.

The web crawler was found to be able to properly identify child exploitation websites, with a clear difference found in the hardcore content hosted by child exploitation and non-child exploitation websites. Our results further suggest that a ‘network capital’ measure — which takes into account network connectivity, as well as severity of content — could aid in identifying the key players within online child exploitation networks. These websites are the main concern of law enforcement agencies, making the web crawler a time saving tool in target prioritization exercises. Interestingly, while one might assume that website owners would find ways to avoid detection by a web crawler of the type we have used, these websites — despite the fact that much of the content is illegal — turned out to be easy to find. This fits with previous research that has found that only 20-25 percent of online child pornography arrestees used sophisticated tools for hiding illegal content [3,4].

As mentioned earlier, the huge amount of content found on the Internet means that the likelihood of eradicating the problem of online child exploitation is nil. As the decentralized nature of the Internet makes combating child exploitation difficult, it becomes more important to introduce new methods to address this. Social network analysis measurements, in general, can be of great assistance to law enforcement investigating all forms of online crime—including online child exploitation. By creating a web crawler that reduces the amount of hours officers need to spend examining possible child pornography websites, and determining whom to target, we believe that we have touched on a method to maximize the current efforts by law enforcement. An automated process has the added benefit of aiding to keep officers in the department longer, as they would not be subjugated to as much traumatic content.

There are still areas for further research; the first step being to further refine the web crawler. Despite being a considerable improvement over a manual analysis of 300,000 web pages, it could be improved to allow for efficient analysis of larger networks, bringing us closer to the true size of the full online child exploitation network, but also, we expect, to some of the more hidden (e.g., password/membership protected) websites. This does not negate the value of researching publicly accessible websites, given that they may be used as starting locations for most individuals.

Much of the law enforcement to date has focused on investigating images, with the primary reason being that databases of hash values (used to authenticate the content) exists for images, and not for videos. Our web crawler did not distinguish between the image content, but utilizing known hash values would help improve the validity of our severity measurement. Although it would be naïve to suggest that online child exploitation can be completely eradicated, the sorts of social network analysis methods described in our study provide a means of understanding the structure (and therefore key vulnerabilities) of online networks; in turn, greatly improving the effectiveness of law enforcement.

[1] Engeler, E. 2009. September 16. UN Expert: Child Porn on Internet Increases. The Associated Press.

[2] Westlake, B.G., Bouchard, M., and Frank, R. 2012. Finding the Key Players in Online Child Exploitation Networks. Policy and Internet 3 (2).

[3] Carr, J. 2004. Child Abuse, Child Pornography and the Internet. London: NCH.

[4] Wolak, J., D. Finkelhor, and K.J. Mitchell. 2005. “Child Pornography Possessors Arrested in Internet-Related Crimes: Findings from the National Juvenile Online Victimization Study (NCMEC 06–05–023).” Alexandria, VA: National Center for Missing and Exploited Children.


Read the full paper: Westlake, B.G., Bouchard, M., and Frank, R. 2012. Finding the Key Players in Online Child Exploitation Networks. Policy and Internet 3 (2).

]]>
https://ensr.oii.ox.ac.uk/uncovering-the-structure-of-online-child-exploitation-networks/feed/ 2