Methods – The Policy and Internet Blog https://ensr.oii.ox.ac.uk Understanding public policy online Mon, 07 Dec 2020 14:25:48 +0000 en-GB hourly 1 P-values are widely used in the social sciences, but often misunderstood: and that’s a problem. https://ensr.oii.ox.ac.uk/many-of-us-scientists-dont-understand-p-values-and-thats-a-problem/ https://ensr.oii.ox.ac.uk/many-of-us-scientists-dont-understand-p-values-and-thats-a-problem/#comments Mon, 07 Mar 2016 18:53:29 +0000 http://blogs.oii.ox.ac.uk/policy/?p=3604 P-values are widely used in the social sciences, especially ‘big data’ studies, to calculate statistical significance. Yet they are widely criticized for being easily hacked, and for not telling us what we want to know. Many have argued that, as a result, research is wrong far more often than we realize. In their recent article P-values: Misunderstood and Misused OII Research Fellow Taha Yasseri and doctoral student Bertie Vidgen argue that we need to make standards for interpreting p-values more stringent, and also improve transparency in the academic reporting process, if we are to maximise the value of statistical analysis.

“Significant”: an illustration of selective reporting and statistical significance from XKCD. Available online at http://xkcd.com/882/
“Significant”: an illustration of selective reporting and
statistical significance from XKCD. Available online at
http://xkcd.com/882/

In an unprecedented move, the American Statistical Association recently released a statement (March 7 2016) warning against how p-values are currently used. This reflects a growing concern in academic circles that whilst a lot of attention is paid to the huge impact of big data and algorithmic decision-making, there is considerably less focus on the crucial role played by statistics in enabling effective analysis of big data sets, and making sense of the complex relationships contained within them. Because much as datafication has created huge social opportunities, it has also brought to the fore many problems and limitations with current statistical practices. In particular, the deluge of data has made it crucial that we can work out whether studies are ‘significant’. In our paper, published three days before the ASA’s statement, we argued that the most commonly used tool in the social sciences for calculating significance – the p-value – is misused, misunderstood and, most importantly, doesn’t tell us what we want to know.

The basic problem of ‘significance’ is simple: it is simply unpractical to repeat an experiment an infinite number of times to make sure that what we observe is “universal”. The same applies to our sample size: we are often unable to analyse a “whole population” sample and so have to generalize from our observations on a limited size sample to the whole population. The obvious problem here is that what we observe is based on a limited number of experiments (sometimes only one experiment) and from a limited size sample, and as such could have been generated by chance rather than by an underlying universal mechanism! We might find it impossible to make the same observation if we were to replicate the same experiment multiple times or analyse a larger sample. If this is the case then we will mischaracterise what is happening – which is a really big problem given the growing importance of ‘evidence-based’ public policy. If our evidence is faulty or unreliable then we will create policies, or intervene in social settings, in an equally faulty way.

The way that social scientists have got round this problem (that samples might not be representative of the population) is through the ‘p-value’. The p-value tells you the probability of making a similar observation in a sample with the same size and in the same number of experiments, by pure chance In other words,  it is actually telling you is how likely it is that you would see the same relationship between X and Y even if no relationship exists between them. On the face of it this is pretty useful, and in the social sciences we normally say that a p-value of 1 in 20 means the results are significant. Yet as the American Statistical Association has just noted, even though they are incredibly widespread many researchers mis-interpret what p-values really mean.

In our paper we argued that p-values are misunderstood and misused because people think the p-value tells you much more than it really does. In particular, people think the p-value tells you (i) how likely it is that a relationship between X and Y really exists and (ii) the percentage of all findings that are false (which is actually something different called the False Discovery Rate). As a result, we are far too confident that academic studies are correct. Some commentators have argued that at least 30% of studies are wrong because of problems related to p-values: a huge figure. One of the main problems is that p-values can be ‘hacked’ and as such easily manipulated to show significance when none exists.

If we are going to base public policy (and as such public funding) on ‘evidence’ then we need to make sure that the evidence used is reliable. P-values need to be used far more rigorously, with significance levels of 0.01 or 0.001 seen as standard. We also need to start being more open and transparent about how results are recorded. It is a fine line between data exploration (a legitimate academic exercise) and ‘data dredging’ (where results are manipulated in order to find something noteworthy). Only if researchers are honest about what they are doing will we be able to maximise the potential benefits offered by Big Data. Luckily there are some great initiatives – like the Open Science Framework – which improve transparency around the research process, and we fully endorse researchers making use of these platforms.

Scientific knowledge advances through corroboration and incremental progress, and it is crucial that we use and interpret statistics appropriately to ensure this progress continues. As our knowledge and use of big data methods increase, we need to ensure that our statistical tools keep pace.

Read the full paper: Vidgen, B. and Yasseri, T., (2016) P-values: Misunderstood and Misused, Frontiers in Physics, 4:6. http://dx.doi.org/10.3389/fphy.2016.00006


Bertie Vidgen is a doctoral student at the Oxford Internet Institute researching far-right extremism in online contexts. He is supervised by Dr Taha Yasseri, a research fellow at the Oxford Internet Institute interested in how Big Data can be used to understand human dynamics, government-society interactions, mass collaboration, and opinion dynamics.

]]>
https://ensr.oii.ox.ac.uk/many-of-us-scientists-dont-understand-p-values-and-thats-a-problem/feed/ 1
Facts and figures or prayers and hugs: how people with different health conditions support each other online https://ensr.oii.ox.ac.uk/facts-and-figures-or-prayers-and-hugs-how-people-with-different-health-conditions-support-each-other-online/ Mon, 07 Mar 2016 09:49:29 +0000 http://blogs.oii.ox.ac.uk/policy/?p=3575 Online support groups are being used increasingly by individuals who suffer from a wide range of medical conditions. OII DPhil Student Ulrike Deetjen‘s recent article with John PowellInformational and emotional elements in online support groups: a Bayesian approach to large-scale content analysis uses machine learning to examine the role of online support groups in the healthcare process. They categorise 40,000 online posts from one of the most well-used forums to show how users with different conditions receive different types of support.

Online forums are important means of people living with health conditions to obtain both emotional and informational support from this in a similar situation. Pictured: The Alzheimer Society of B.C. unveiled three life-size ice sculptures depicting important moments in life. The ice sculptures will melt, representing the fading of life memories on the dementia journey. Image: bcgovphotos (Flickr)
Online forums are important means of people living with health conditions to obtain both emotional and informational support from this in a similar situation. Pictured: The Alzheimer Society of B.C. unveiled three life-size ice sculptures depicting important moments in life. The ice sculptures will melt, representing the fading of life memories on the dementia journey. Image: bcgovphotos (Flickr)

Online support groups are one of the major ways in which the Internet has fundamentally changed how people experience health and health care. They provide a platform for health discussions formerly restricted by time and place, enable individuals to connect with others in similar situations, and facilitate open, anonymous communication.

Previous studies have identified that individuals primarily obtain two kinds of support from online support groups: informational (for example, advice on treatments, medication, symptom relief, and diet) and emotional (for example, receiving encouragement, being told they are in others’ prayers, receiving “hugs”, or being told that they are not alone). However, existing research has been limited as it has often used hand-coded qualitative approaches to contrast both forms of support, thereby only examining relatively few posts (<1,000) for one or two conditions.

In contrast, our research employed a machine-learning approach suitable for uncovering patterns in “big data”. Using this method a computer (which initially has no knowledge of online support groups) is given examples of informational and emotional posts (2,000 examples in our study). It then “learns” what words are associated with each category (emotional: prayers, sorry, hugs, glad, thoughts, deal, welcome, thank, god, loved, strength, alone, support, wonderful, sending; informational: effects, started, weight, blood, eating, drink, dose, night, recently, taking, side, using, twice, meal). The computer then uses this knowledge to assess new posts, and decide whether they contain more emotional or informational support.

With this approach we were able to determine the emotional or informational content of 40,000 posts across 14 different health conditions (breast cancer, prostate cancer, lung cancer, depression, schizophrenia, Alzheimer’s disease, multiple sclerosis, cystic fibrosis, fibromyalgia, heart failure, diabetes type 2, irritable bowel syndrome, asthma, and chronic obstructive pulmonary disease) on the international support group forum Dailystrength.org.

Our research revealed a slight overall tendency towards emotional posts (58% of posts were emotionally oriented). Across all diseases, those who write more also tend to write more emotional posts—we assume that as people become more involved and build relationships with other users they tend to provide more emotional support, instead of simply providing information in one-off interactions. At the same time, we also observed that older people write more informational posts. This may be explained by the fact that older people more generally use the Internet to find information, that they become experts in their chronic conditions over time, and that with increasing age health conditions may have less emotional impact as they are relatively more expected.

The demographic prevalence of the condition may also be enmeshed with the disease-related tendency to write informational or emotional posts. Our analysis suggests that content differs across the 14 conditions: mental health or brain-related conditions (such as depression, schizophrenia, and Alzheimer’s disease) feature more emotionally oriented posts, with around 80% of posts primarily containing emotional support. In contrast, nonterminal physical conditions (such as irritable bowel syndrome, diabetes, asthma) rather focus on informational support, with around 70% of posts providing advice about symptoms, treatments, and medication.

Finally, there was no gender difference across conditions with respect to the amount of posts that were informational versus emotional. That said, prostate cancer forums are oriented towards informational support, whereas breast cancer forums feature more emotional support. Apart from the generally different nature of both conditions, one explanation may lie in the nature of single-gender versus mixed-gender groups: an earlier meta-study found that women write more emotional content than men when talking among others of the same gender – but interestingly, in mixed-gender discussions, these differences nearly disappeared.

Our research helped to identify factors that determine whether online content is informational or emotional, and demonstrated how posts differ across conditions. In addition to theoretical insights about patient needs, this research will help practitioners to better understand the role of online support groups for different patients, and to provide advice to patients about the value of online support.

The results also suggest that online support groups should be integrated into the digital health strategies of the UK and other nations. At present the UK plan for “Personalised Health and Care 2020” is centred around digital services provided within the health system, and does not yet reflect the value of person-generated health data from online support groups to patients. Our research substantiates that it would benefit from considering the instrumental role that online support groups can play in the healthcare process.

Read the full paper: Deetjen, U. and J. A. Powell (2016) Informational and emotional elements in online support groups: a Bayesian approach to large-scale content analysis. Journal of the American Medical Informatics Association. http://dx.doi.org/10.1093/jamia/ocv190


Ulrike Deetjen (née Rauer) is a doctoral student at the Oxford Internet Institute researching the influence of the Internet on healthcare provision and health outcomes.

]]>
What explains the worldwide patterns in user-generated geographical content? https://ensr.oii.ox.ac.uk/what-explains-the-worldwide-patterns-in-user-generated-geographical-content/ Mon, 08 Sep 2014 07:20:05 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2908 The geographies of codified knowledge have always been uneven, affording some people and places greater voice and visibility than others. While the rise of the geosocial Web seemed to promise a greater diversity of voices, opinions, and narratives about places, many regions remain largely absent from the websites and services that represent them to the rest of the world. These highly uneven geographies of codified information matter because they shape what is known and what can be known. As geographic content and geospatial information becomes increasingly integral to our everyday lives, places that are left off the ‘map of knowledge’ will be absent from our understanding of, and interaction with, the world.

We know that Wikipedia is important to the construction of geographical imaginations of place, and that it has immense power to augment our spatial understandings and interactions (Graham et al. 2013). In other words, the presences and absences in Wikipedia matter. If a person’s primary free source of information about the world is the Persian or Arabic or Hebrew Wikipedia, then the world will look fundamentally different from the world presented through the lens of the English Wikipedia. The capacity to represent oneself to outsiders is especially important in those parts of the world that are characterized by highly uneven power relationships: Brunn and Wilson (2013) and Graham and Zook (2013) have already demonstrated the power of geospatial content to reinforce power in a South African township and Jerusalem, respectively.

Until now, there has been no large-scale empirical analysis of the factors that explain information geographies at the global scale; this is something we have aimed to address in this research project on Mapping and measuring local knowledge production and representation in the Middle East and North Africa. Using regression models of geolocated Wikipedia data we have identified what are likely to be the necessary conditions for representation at the country level, and have also identified the outliers, i.e. those countries that fare considerably better or worse than expected. We found that a large part of the variation could be explained by just three factors: namely, (1) country population, (2) availability of broadband Internet, and (3) the number of edits originating in that country. [See the full paper for an explanation of the data and the regression models.]

But how do we explain the significant inequalities in the geography of user-generated information that remain after adjusting for differing conditions using our regression model? While these three variables help to explain the sparse amount of content written about much of Sub-Saharan Africa, most of the Middle East and North Africa have quantities of geographic information below their expected values. For example, despite high levels of wealth and connectivity, Qatar and the United Arab Emirates have far fewer articles than we might expect from the model.

These three factors independently matter, but they will also be subject to a number of constraints. A country’s population will probably affect the number of human sites, activities, and practices of interest; ie the number of things one might want to write about. The size of the potential audience might also be influential, encouraging editors in denser-populated regions and those writing in major languages. However, societal attitudes towards learning and information sharing will probably also affect the propensity of people in some places to contribute content. Factors discouraging the number of edits to local content might include a lack of local Wikimedia chapters, the attractiveness of writing content about other (better-represented) places, or contentious disputes in local editing communities that divert time into edit wars and away from content generation.

We might also be seeing a principle of increasing informational poverty. Not only is a broader base of traditional source material (such as books, maps, and images) needed for the generation of any Wikipedia article, but it is likely that the very presence of content itself is a generative factor behind the production of further content. This makes information produced about information-sparse regions most useful for people in informational cores — who are used to integrating digital information into their everyday practices — rather than those in informational peripheries.

Various practices and procedures of Wikipedia editing likely amplify this effect. There are strict guidelines on how knowledge can be created and represented in Wikipedia, including a ban on original research, and the need to source key assertions. Editing incentives and constraints probably also encourage work around existing content (which is relatively straightforward to edit) rather than creation of entirely new material. In other words, the very policies and norms that govern the encyclopedia’s structure make it difficult to populate the white space with new geographic content. In addressing these patterns of increasing informational poverty, we need to recognize that no one of these three conditions can ever be sufficient for the generation of geographic knowledge. As well as highlighting the presences and absences in user-generated content, we also need to ask what factors encourage or limit production of that content.

In interpreting our model, we have come to a stark conclusion: increasing representation doesn’t occur in a linear fashion, but it accelerates in a virtuous cycle, benefitting those with strong editing cultures in local languages. For example, Britain, Sweden, Japan and Germany are extensively georeferenced on Wikipedia, whereas much of the MENA region has not kept pace, even accounting for their levels of connectivity, population, and editors. Thus, while some countries are experiencing the virtuous cycle of more edits and broadband begetting more georeferenced content, those on the periphery of these information geographies might fail to reach a critical mass of editors, or even dismiss Wikipedia as a legitimate site for user-generated geographic content: a problem that will need to be addressed if Wikipedia is indeed to be considered as the “sum of all human knowledge”.

Read the full paper: Graham, M., Hogan, B., Straumann, R.K., and Medhat, A. (2014) Uneven Geographies of User-Generated Information: Patterns of Increasing Informational Poverty. Annals of the Association of American Geographers.

References

Brunn S. D., and M. W. Wilson. 2013. Cape Town’s million plus black township of Khayelitsha: Terrae incognitae and the geographies and cartographies of silence, Habitat International. 39 284-294.

Graham M., and M. Zook. (2013) Augmented Realities and Uneven Geographies: Exploring the Geolinguistic Contours of the Web. Environment and Planning A 45(1): 77–99.

Graham M, M. Zook, and A. Boulton. 2013. Augmented Reality in the Urban Environment: Contested Content and the Duplicity of Code. Transactions of the Institute of British Geographers. 38(3) 464-479.


Mark Graham is a Senior Research Fellow at the OII. His research focuses on Internet and information geographies, and the overlaps between ICTs and economic development.

]]>
What is stopping greater representation of the MENA region? https://ensr.oii.ox.ac.uk/what-is-stopping-greater-representation-of-the-mena-region/ Wed, 06 Aug 2014 08:35:52 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2575 Caption
Negotiating the wider politics of Wikipedia can be a daunting task, particularly when in it comes to content about the MENA region. Image of the Dome of the Rock (Qubbat As-Sakhrah), Jerusalem, by 1yen

Wikipedia has famously been described as a project that “ works great in practice and terrible in theory”. One of the ways in which it succeeds is through its extensive consensus-based governance structure. While this has led to spectacular success –over 4.5 million articles in the English Wikipedia alone — the governance structure is neither obvious nor immediately accessible, and can present a barrier for those seeking entry. Editing Wikipedia can be a tough challenge – an often draining and frustrating task, involving heated disputes and arguments where it is often the most tenacious, belligerent, or connected editor who wins out in the end.

Broadband access and literacy are not the only pre-conditions for editing Wikipedia; ‘digital literacy’ is also crucial. This includes the ability to obtain and critically evaluate online sources, locate Wikipedia’s editorial and governance policies, master Wiki syntax, and confidently articulate and assert one’s views about an article or topic. Experienced editors know how to negotiate the rules, build a consensus with some editors to block others, and how to influence administrators during dispute resolution. This strict adherence to the word (if not the spirit) of Wikipedia’s ‘law’ can lead to marginalization or exclusion of particular content, particularly when editors are scared off by unruly mobs who ‘weaponize’ policies to fit a specific agenda.

Governing such a vast collaborative platform as Wikipedia obviously presents a difficult balancing act between being open enough to attract volume of contributions, and moderated enough to ensure their quality. Many editors consider Wikipedia’s governance structure (which varies significantly between the different language versions) essential to ensuring the quality of its content, even if it means that certain editors can (for example) arbitrarily ban other users, lock down certain articles, and exclude moderate points of view. One of the editors we spoke to noted that: “A number of articles I have edited with quality sources, have been subjected to editors cutting information that doesn’t fit their ideas […] I spend a lot of time going back to reinstate information. Today’s examples are in the ‘Battle of Nablus (1918)’ and the ‘Third Transjordan attack’ articles. Bullying does occur from time to time […] Having tried the disputes process I wouldn’t recommend it.” Community building might help support MENA editors faced with discouragement or direct opposition as they try to build content about the region, but easily locatable translations of governance materials would also help. Few of the extensive Wikipedia policy discussions have been translated into Arabic, leading to replication of discussions or ambiguity surrounding correct dispute resolution.

Beyond arguments with fractious editors over minutiae (something that comes with the platform), negotiating the wider politics of Wikipedia can be a daunting task, particularly when in it comes to content about the MENA region. It would be an understatement to say that the Middle East is a politically sensitive region, with more than its fair share of apparently unresolvable disputes, competing ideologies (it’s the birthplace of three world religions…), repressive governments, and ongoing and bloody conflicts. Editors shared stories with us about meddling from state actors (eg Tunisia, Iran) and a lack of trust with a platform that is generally considered to be a foreign, and sometimes explicitly American, tool. Rumors abound that several states (eg Israel, Iran) have concerted efforts to work on Wikipedia content, creating a chilling effect for new editors who might feel that editing certain pages might prove dangerous, or simply frustrating or impossible. Some editors spoke of being asked by Syrian government officials for advice on how to remove critical content, or how to identify the editors responsible for putting it there. Again: the effect is chilling.

A lack of locally produced and edited content about the region clearly can’t be blamed entirely on ‘outsiders’. Many editors in the Arabic Wikipedia have felt snubbed by the creation of an explicitly “Egyptian Arabic” Wikipedia, which has not only forked the content and editorial effort, but also stymied any ‘pan-Arab’ identity on the platform. There is a culture of administrators deleting articles they do not think are locally appropriate; often relating to politically (or culturally) sensitive topics. Due to Arabic Wikipedia’s often vicious edit wars, it is heavily moderated (unlike for example the English version), and anonymous edits do not appear instantly.

Some editors at the workshops noted other systemic and cultural issues, for example complaining of an education system that encourages rote learning, reinforcing the notion that only experts should edit (or moderate) a topic, rather than amateurs with local familiarity. Editors also noted the notable gender disparities on the site; a longstanding issue for other Wikipedia versions as well. None of these discouragements are helped by what some editors noted as a larger ‘image problem’ with editing in the Arabic Wikipedia, given it would always be overshadowed by the dominant English Wikipedia, one editor commenting that: “the English Wikipedia is vastly larger than its Arabic counterpart, so it is not unthinkable that there is more content, even about Arab-world subjects, in English. From my (unscientific) observation, many times, content in Arabic about a place or a tribe is not very encyclopedic, but promotional, and lacks citations”. Translating articles into Arabic might be seen as menial and unrewarding work, when the exciting debates about an article are happening elsewhere.

When we consider the coming-together of all of these barriers, it might be surprising that Wikipedia is actually as large as it is. However, the editors we spoke with were generally optimistic about the site, considering it an important activity that serves the greater good. Wikipedia is without doubt one of the most significant cultural and political forces on the Internet. Wikipedians are remarkably generous with their time, and it’s their efforts that are helping to document, record, and represent much of the world – including places where documentation is scarce. Most of the editors at our workshop ultimately considered Wikipedia a path to a more just society; through not just consensus, voting, and an aspiration to record certain truths — seeing it not just as a site of conflict, but also a site of regional (and local) pride. When asked why he writes geographic content, one editor simply replied: “It’s my own town”.


Mark Graham is a Senior Research Fellow at the OII. His research focuses on Internet and information geographies, and the overlaps between ICTs and economic development.

]]>
How well represented is the MENA region in Wikipedia? https://ensr.oii.ox.ac.uk/how-well-represented-is-the-mena-region-in-wikipedia/ Tue, 22 Jul 2014 08:13:02 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2811
There are more Wikipedia articles in English than Arabic about almost every Arabic speaking country in the Middle East. Image of rock paintings in the Tadrart Acacus region of Libya by Luca Galuzzi.
There are more Wikipedia articles in English than Arabic about almost every Arabic speaking country in the Middle East. Image of rock paintings in the Tadrart Acacus region of Libya by Luca Galuzzi.
Wikipedia is often seen to be both an enabler and an equalizer. Every day hundreds of thousands of people collaborate on an (encyclopaedic) range of topics; writing, editing and discussing articles, and uploading images and video content. This structural openness combined with Wikipedia’s tremendous visibility has led some commentators to highlight it as “a technology to equalize the opportunity that people have to access and participate in the construction of knowledge and culture, regardless of their geographic placing” (Lessig 2003). However, despite Wikipedia’s openness, there are also fears that the platform is simply reproducing worldviews and knowledge created in the Global North at the expense of Southern viewpoints (Graham 2011; Ford 2011). Indeed, there are indications that global coverage in the encyclopaedia is far from ‘equal’, with some parts of the world heavily represented on the platform, and others largely left out (Hecht and Gergle 2009; Graham 2011, 2013, 2014).

These second-generation digital divides are not merely divides of Internet access (so discussed in the late 1990s), but gaps in representation and participation (Hargittai and Walejko 2008). Whereas most Wikipedia articles written about most European and East Asian countries are written in their dominant languages, for much of the Global South we see a dominance of articles written in English. These geographic differences in the coverage of different language versions of Wikipedia matter, because fundamentally different narratives can be (and are) created about places and topics in different languages (Graham and Zook 2013; Graham 2014).

If we undertake a ‘global analysis’ of this pattern by examining the number of geocoded articles (ie about a specific place) across Wikipedia’s main language versions (Figure 1), the first thing we can observe is the incredible human effort that has gone into describing ‘place’ in Wikipedia. The second is the clear and highly uneven geography of information, with Europe and North America home to 84% of all geolocated articles. Almost all of Africa is poorly represented in the encyclopaedia — remarkably, there are more Wikipedia articles written about Antarctica (14,959) than any country in Africa, and more geotagged articles relating to Japan (94,022) than the entire MENA region (88,342). In Figure 2 it is even more obvious that Europe and North America lead in terms of representation on Wikipedia.

Figure 1. Total number of geotagged Wikipedia articles across all 44 surveyed languages.
Figure 1. Total number of geotagged Wikipedia articles across all 44 surveyed languages.
Figure 2. Number of regional geotagged articles and population.
Figure 2. Number of regional geotagged articles and population.

Knowing how many articles describe a place only tells a part of the ‘representation story’. Figure 3 adds the linguistic element, showing the dominant language of Wikipedia articles per country. The broad pattern is that some countries largely define themselves in their own languages, and others appear to be largely defined from outside. For instance, almost all European countries have more articles about themselves in their dominant language; that is, most articles about the Czech Republic are written in Czech. Most articles about Germany are written in German (not English).

Figure 3. Language with the most geocoded articles by country (across 44 top languages on Wikipedia).
Figure 3. Language with the most geocoded articles by country (across 44 top languages on Wikipedia).

We do not see this pattern across much of the South, where English dominates across much of Africa, the Middle East, South and East Asia, and even parts of South and Central America. French dominates in five African countries, and German is dominant in one former German colony (Namibia) and a few other countries (e.g. Uruguay, Bolivia, East Timor).

The scale of these differences is striking. Not only are there more Wikipedia articles in English than Arabic about almost every Arabic speaking country in the Middle East, but there are more English articles about North Korea than there are Arabic articles about Saudi Arabia, Libya, and the UAE. Not only do we see most of the world’s content written about global cores, but it is largely dominated by a relatively few languages.

Figure 4 shows the total number of geotagged Wikipedia articles in English per country. The sheer density of this layer of information over some parts of the world is astounding (with 928,542 articles about places in English), nonetheless, in this layer of geotagged English content, only 3.23% of the articles are about Africa, and 1.67% are about the MENA region.

Figure 4. Number of geotagged articles in the English Wikipedia by country.
Figure 4. Number of geotagged articles in the English Wikipedia by country.

We see a somewhat different pattern when looking at the global geography of the 22,548 geotagged articles of the Arabic Wikipedia (Figure 5). Algeria and Syria are both defined by a relatively high number of articles in Arabic (as are the US, Italy, Spain, Russia and Greece). These information densities are substantially greater than what we see for many other MENA countries in which Arabic is an official language (such as Egypt, Morocco, and Saudi Arabia). This is even more surprising when we realise that the Italian and Spanish populations are smaller than the Egyptian, but there are nonetheless far more geotagged articles in Arabic about Italy (2,428) and Spain (1,988) than about Egypt (433).

Figure 5. Total number of geotagged articles in the Arabic Wikipedia by country.
Figure 5. Total number of geotagged articles in the Arabic Wikipedia by country.

By mapping the geography of Wikipedia articles in both global and regional languages, we can begin to examine the layers of representation that ‘augment’ the world we live in. We have seen that, notable exceptions aside (e.g. ‘Iran’ in Farsi and ‘Israel’ in Hebrew) the MENA region tends to be massively underrepresented — not just in major world languages, but also in its own: Arabic. Clearly, much is being left unsaid about that part of the world. Although we entered the project anticipating that the MENA region would be under-represented in English, we did not anticipate the degree to which it is under-represented in Arabic.

References

Ford, H. (2011) The Missing Wikipedians. In Critical Point of View: A Wikipedia Reader, ed. G. Lovink and N. Tkacz, 258-268. Amsterdam: Institute of Network Cultures.

Graham, M. (2014) The Knowledge Based Economy and Digital Divisions of Labour. In Companion to Development Studies, 3rd edition, eds v. Desai, and R. Potter. Hodder, pp. 189-195.

Graham, M. (2013) The Virtual Dimension. In Global City Challenges: Debating a Concept, Improving the Practice. Eds. Acuto, M. and Steele, W. London: Palgrave.

Graham, M. (2011) Wiki Space: Palimpsests and the Politics of Exclusion. In Critical Point of View: A Wikipedia Reader. Eds. Lovink, G. and Tkacz, N. Amsterdam: Institute of Network Cultures, pp. 269-282.

Graham M., and M. Zook (2013) Augmented Realities and Uneven Geographies: Exploring the Geolinguistic Contours of the Web. Environment and Planning A 45 (1) 77–99.

Hargittai, E. and G. Walejko (2008) The Participation Divide: Content Creation and Sharing in the Digital Age. Information, Communication and Society 11 (2) 239–256.

Hecht B., and D. Gergle (2009) Measuring self-focus bias in community-maintained knowledge repositories. In Proceedings of the 4th International Conference on Communities and Technologies, Penn State University, 2009, pp. 11–20. New York: ACM.

Lessig, L. (2003) An Information Society: Free or Feudal. Talk given at the World Summit on the Information Society, Geneva, 2003.


Mark Graham is a Senior Research Fellow at the OII. His research focuses on Internet and information geographies, and the overlaps between ICTs and economic development.

]]>
The sum of (some) human knowledge: Wikipedia and representation in the Arab World https://ensr.oii.ox.ac.uk/the-sum-of-some-human-knowledge-wikipedia-and-representation-in-the-arab-world/ Mon, 14 Jul 2014 09:00:14 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2555 Caption
Arabic is one of the least represented major world languages on Wikipedia: few languages have more speakers and fewer articles than Arabic. Image of the Umayyad Mosque (Damascus) by Travel Aficionado

Wikipedia currently contains over 9 million articles in 272 languages, far surpassing any other publicly available information repository. Being the first point of contact for most general topics (therefore an effective site for framing any subsequent representations) it is an important platform from which we can learn whether the Internet facilitates increased open participation across cultures — or reinforces existing global hierarchies and power dynamics. Because the underlying political, geographic and social structures of Wikipedia are hidden from users, and because there have not been any large scale studies of the geography of these structures and their relationship to online participation, entire groups of people (and regions) may be marginalized without their knowledge.

This process is important to understand, for the simple reason that Wikipedia content has begun to form a central part of services offered elsewhere on the Internet. When you look for information about a place on Facebook, the description of that place (including its geographic coordinates) comes from Wikipedia. If you want to “check in” to a museum in Doha to signify you were there to their friends, the place you check in to was created with Wikipedia data. When you Google “House of Saud” you are presented not only with a list of links (with Wikipedia at the top) but also with a special ‘card’ summarising the House. This data comes from Wikipedia. When you look for people or places, Google now has these terms inside its ‘knowledge graph’, a network of related concepts with data coming directly from Wikipedia. Similarly, on Google maps, Wikipedia descriptions for landmarks are presented as part of the default information.

Ironically, Wikipedia editorship is actually on a slow and steady decline, even as its content and readership increases year on year. Since 2007 and the introduction of significant devolution of administrative powers to volunteers, Wikipedia has not been able to effectively retain newcomers, something which has been noted as a concern by many at the Wikimedia Foundation. Some think Wikipedia might be levelling off because there’s only so much to write about. This is extremely far from the truth; there are still substantial gaps in geographic content in English and overwhelming gaps in other languages. Wikipedia often brands itself as aspiring to contain “the sum of human knowledge”, but behind this mantra lie policy pitfalls, tedious editor debates and delicate sourcing issues that hamper greater representation of the region. Of course these challenges form part of Wikipedia’s continuing evolution as the de facto source for online reference information, but they also (disturbingly) act to entrench particular ways of “knowing” — and ways of validating what is known.

There are over 260,000 articles in Arabic, receiving 240,000 views per hour. This actually translates as one of the least represented major world languages on Wikipedia: few languages have more speakers and fewer articles than Arabic. This relative lack of MENA voice and representation means that the tone and content of this globally useful resource, in many cases, is being determined by outsiders with a potential misunderstanding of the significance of local events, sites of interest and historical figures. In an area that has seen substantial social conflict and political upheaval, greater participation from local actors would help to ensure balance in content about contentious issues. Unfortunately, most research on MENA’s Internet presence has so far been drawn from anecdotal evidence, and no comprehensive studies currently exist.

In this project we wanted to understand where place-based content comes from, to explain reasons for the relative lack of Wikipedia articles in Arabic and about the MENA region, and to understand which parts of the region are particularly underrepresented. We also wanted to understand the relationship between Wikipedia’s administrative structure and the treatment of new editors; in particular, we wanted to know whether editors from the MENA region have less of a voice than their counterparts from elsewhere, and whether the content they create is considered more or less legitimate, as measured through the number of reverts; ie the overriding of their work by other editors.

Our practical objectives involved a consolidation of Middle Eastern Wikipedians though a number of workshops focusing on how to create more equitable and representative content, with the ultimate goal of making Wikipedia a more generative and productive site for reference information about the region. Capacity building among key Wikipedians can create greater understanding of barriers to participation and representation and offset much of the (often considerable) emotional labour required to sustain activity on the site in the face of intense arguments and ideological biases. Potential systematic structures of exclusion that could be a barrier to participation include such competitive practices as content deletion, indifference to content produced by MENA authors, and marginalization through bullying and dismissal.

However, a distinct lack of sources — owing both to a lack of legitimacy for MENA journalism and a paucity of open access government documents — is also inhibiting further growth of content about the region. When inclusion of a topic is contested by editors it is typically because there is not enough external source material about it to establish “notability”. As Ford (2011) has already discussed, notability is often culturally mediated. For example, a story in Al Jazeera would not have been considered a sufficient criterion of notability a couple of years ago. However, this has changed dramatically since its central role in reporting on the Arab Spring.

Unfortunately, notability can create a feedback loop. If an area of the world is underreported, there are no sources. If there are no sources, then journalists do not always have enough information to report about that part of the world. ‘Correct’ sourcing trumps personal experience on Wikipedia; even if an author is from a place, and is watching a building being destroyed, their Wikipedia edit will not be accepted by the community unless the event is discussed in another ‘official’ medium. Often the edit will either be branded with a ‘citation needed’ tag, eliminated, or discussed in the talk page. Particularly aggressive editors and administrators will nominate the page for ‘speedy deletion’ (ie deletion without discussion), a practice that makes responses from an author difficult

Why does any of this matter in practical terms? For the simple reason that biases, absences and contestations on Wikipedia spill over into numerous other domains that are in regular and everyday use (Graham and Zook, 2013). If a place is not on Wikipedia, this might have a chilling effect on business and stifle journalism; if a place is represented poorly on Wikipedia this can lead to misunderstandings about the place. Wikipedia is not a legislative body. However, in the court of public opinion, Wikipedia represents one of the world’s strongest forces, as it quietly inserts itself into representations of place worldwide (Graham et. al 2013; Graham 2013).

Wikipedia is not merely a site of reference information, but is rapidly becoming the de facto site for representing the world to itself. We need to understand more about that representation.

Further Reading

Allagui, I., Graham, M., and Hogan, B. 2014. Wikipedia Arabe et la Construction Collective du Savoir In Wikipedia, objet scientifique non identifie. eds. Barbe, L., and Merzeau, L. Paris: Presses Universitaries du Paris Ouest (in press).

Graham, M., Hogan, B., Straumann, R. K., and Medhat, A. 2014. Uneven Geographies of User-Generated Information: Patterns of Increasing Informational Poverty. Annals of the Association of American Geographers (forthcoming).

Graham, M. 2012. Die Welt in Der Wikipedia Als Politik der Exklusion: Palimpseste des Ortes und selective Darstellung. In Wikipedia. eds. S. Lampe, and P. Bäumer. Bundeszentrale für politische Bildung/bpb, Bonn.

Graham, M. 2011. Wiki Space: Palimpsests and the Politics of Exclusion. In Critical Point of View: A Wikipedia Reader. Eds. Lovink, G. and Tkacz, N. Amsterdam: Institute of Network Cultures, 269-282.

References

Ford, H. (2011) The Missing Wikipedians. In Geert Lovink and Nathaniel Tkacz (eds), Critical Point of View: A Wikipedia Reader, Amsterdam: Institute of Network Cultures, 2011. ISBN: 978-90-78146-13-1.

Graham, M., M. Zook., and A. Boulton. 2013. Augmented Reality in the Urban Environment: contested content and the duplicity of code. Transactions of the Institute of British Geographers. 38(3), 464-479.

Graham, M and M. Zook. 2013. Augmented Realities and Uneven Geographies: Exploring the Geo-linguistic Contours of the Web. Environment and Planning A 45(1) 77-99.

Graham, M. 2013. The Virtual Dimension. In Global City Challenges: debating a concept, improving the practice. eds. M. Acuto and W. Steele. London: Palgrave. 117-139.


Mark Graham is a Senior Research Fellow at the OII. His research focuses on Internet and information geographies, and the overlaps between ICTs and economic development.

]]>
Mapping the Local Geographies of Digital Inequality in Britain https://ensr.oii.ox.ac.uk/mapping-the-local-geographies-of-digital-inequality-in-britain/ https://ensr.oii.ox.ac.uk/mapping-the-local-geographies-of-digital-inequality-in-britain/#comments Fri, 27 Jun 2014 11:48:00 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2730 Britain has one of the largest Internet economies in the industrial world. The Internet contributes an estimated 8.3% to Britain’s GDP (Dean et al. 2012), and strongly supports domestic job and income growth by enabling access to new customers, markets and ideas. People benefit from better communications, and businesses are more likely to locate in areas with good digital access, thereby boosting local economies (Malecki & Moriset 2008). While the Internet brings clear benefits, there is also a marked inequality in its uptake and use (the so-called ‘digital divide’). We already know from the Oxford Internet Surveys (OxIS) that Internet use in Britain is strongly stratified by age, by income and by education; and yet we know almost nothing about local patterns of Internet use across the country.

A problem with national sample surveys (the usual source of data about Internet use and non-use), is that the sample sizes become too small to allow accurate generalization at smaller, sub-national areas. No one knows, for example, the proportion of Internet users in Glasgow, because national surveys simply won’t have enough respondents to make reliable city-level estimates. We know that Internet use is not evenly distributed at the regional level; Ofcom reports on broadband speeds and penetration at the county level (Ofcom 2011), and we know that London and the southeast are the most wired part of the country (Dean et al. 2012). But given the importance of the Internet, the lack of knowledge about local patterns of access and use in Britain is surprising. This is a problem because without detailed information about small areas we can’t identify where would benefit most from policy intervention to encourage Internet use and improve access.

We have begun to address this lack of information by combining two important but separate datasets — the 2011 national census, and the 2013 OxIS surveys — using the technique of small area estimation. By definition, census data are available for very small areas, and because it reaches (basically) everyone, there will be no sampling issues. Unfortunately, it is extremely expensive to collect this data, so it doesn’t collect many variables (it has no data on Internet use, for example). The second dataset, the OII’s Oxford Internet Survey (OxIS), is a very rich dataset of all kinds of Internet activity, measured with a random sample of more than 2,000 individuals across Britain. Because OxIS is unable to survey everyone in Britain, it is based on a random sample of people living in geographical ‘Output Areas’ (OAs). These areas (generally of 40-250 households) represent the fundamental building block of the national census, being the smallest geographical area for which it reports data.

Because OxIS and the census (happily) use the same OAs, we can combine national-level data on Internet use (from OxIS) with local-level demographic information (from the census) to map estimated Internet use across Britain for the first time. We can do this because we can estimate from OxIS the likelihood of an individual using the Internet just from basic demographic data (age, income, education etc.). And because the census records these demographics for everyone in each OA, we can go on to estimate the likely proportion of Internet users in each of these areas. By combining the richness of OxIS survey data with the comprehensive small area coverage of the census we can use the strengths of one to offset the gaps in the other.

Of course, this procedure assumes that people in small areas will generally match national patterns of Internet use; ie that those who are better educated, employed, and young, are more likely to use the Internet. We assume that this pattern isn’t affected by cultural or social factors (e.g. ‘Northerners just like the Internet more’), or by anything unusual about a particular group of households that makes it buck national trends (eg ‘the young people of Wytham Street, Oxford just prefer not to use the Internet’).

So what do we see when we combine the two datasets? What are the local-level patterns of Internet use across Britain? We can see from the figure that the highest estimated Internet use (88-89%) is concentrated in the south east, with London dominating. Bristol, Southampton, and Nottingham also have high levels of use, as well as the rest of the south (interestingly, including rural Cornwall) with estimated usage levels of 78-83%. Leeds, York and Manchester are also in this category. In the lowest category (59-70% use) we find the entire North East region. Cities show much the same pattern, with southern cities having the highest estimated Internet use, and Newcastle and Middlesbrough having the lowest.

There isn’t room in this post to explore and discuss all the patterns (or to speculate on the underlying reasons), but there are clear policy implications from this work. The Internet has made an enormous difference in our social life, culture, and economy; this is why it is important to bring people online, to encourage them all to participate and benefit. However, despite the importance of the Internet in Britain today, we still know very little about who is, and isn’t connected. We hope this approach (and this data) can help pinpoint the areas of greatest need. For example, the North East is striking — even the cities don’t seem to stand out from the surrounding rural areas. Allocating resources to improve use in the North East would probably be valuable, with rural areas as a secondary priority. Interestingly, Cornwall (despite being very rural) is actually above average in terms of likely Internet users, and is also the recipient of a major European Regional Development Fund effort to extend their broadband.

Actually getting access via fibre-optic cable is just one part of the story of Internet use (and one we don’t cover in this post); but this is the first time we have been estimate the likely use at a local level, based on the known characteristics of the people who live there. Using these small area estimation techniques opens a whole new area for social media research and policy-making around local patterns of digital participation. Going forward, we intend to expand the model to include urban-rural differences, the index of multiple deprivation, occupation, and socio-economic status. But there’s already much more we can do with these data.

References

Dean, D., DiGrande, S., Field, D., Lundmark, A., O’Day, J., Pineda, J., Zwillenberg, P. (2012) The connected world: The Internet economy in the G-20. Boston: Boston Consulting Group.

Malecki, E.J. & Moriset, B. (2008) The digital economy: Business organization, production processes and regional developments. London: Routledge.

Ofcom (2011) Communications infrastructure report: Fixed broadband data. [accessed on 23/9/2013 from http://stakeholders.ofcom.org.uk/binaries/research/broadband-research/Fixed_Broadband_June_2011.pdf ]

Read the full paper: Blank, G., Graham, M., and Calvino, C. (2014) Mapping the Local Geographies of Digital Inequality. [contact the authors for the paper and citation details]


Grant Blank is a Survey Research Fellow at the OII. He is a sociologist who studies the social and cultural impact of the Internet and other new communication media. He is principal investigator on the OII’s Geography of Digital Inequality project, which combines OxIS and census data to produce the first detailed geographic estimates of Internet use across the UK.

]]>
https://ensr.oii.ox.ac.uk/mapping-the-local-geographies-of-digital-inequality-in-britain/feed/ 2
How easy is it to research the Chinese web? https://ensr.oii.ox.ac.uk/how-easy-is-it-to-research-the-chinese-web/ Tue, 18 Feb 2014 11:05:57 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2418 Chinese Internet Cafe
Access to data from the Chinese Web, like other Web data, depends on platform policies, the level of data openness, and the availability of data intermediary and tools. Image of a Chinese Internet cafe by Hal Dick.

Ed: How easy is it to request or scrape data from the “Chinese Web”? And how much of it is under some form of government control?

Han-Teng: Access to data from the Chinese Web, like other Web data, depends on the policies of platforms, the level of data openness, and the availability of data intermediary and tools. All these factors have direct impacts on the quality and usability of data. Since there are many forms of government control and intentions, increasingly not just the websites inside mainland China under Chinese jurisdiction, but also the Chinese “soft power” institutions and individuals telling the “Chinese story” or “Chinese dream” (as opposed to “American dreams”), it requires case-by-case research to determine the extent and level of government control and interventions. Based on my own research on Chinese user-generated encyclopaedias and Chinese-language twitter and Weibo, the research expectations seem to be that control and intervention by Beijing will be most likely on political and cultural topics, not likely on economic or entertainment ones.

This observation is linked to how various forms of government control and interventions are executed, which often requires massive data and human operations to filter, categorise and produce content that are often based on keywords. It is particularly true for Chinese websites in mainland China (behind the Great Firewall, excluding Hong Kong and Macao), where private website companies execute these day-to-day operations under the directives and memos of various Chinese party and government agencies.

Of course there is some extra layer of challenges if researchers try to request content and traffic data from the major Chinese websites for research, especially regarding censorship. Nonetheless, since most Web content data is open, researchers such as Professor Fu in Hong Kong University manage to scrape data sample from Weibo, helping researchers like me to access the data more easily. These openly collected data can then be used to measure potential government control, as has been done for previous research on search engines (Jiang and Akhtar 2011; Zhu et al. 2011) and social media (Bamman et al. 2012; Fu et al. 2013; Fu and Chau 2013; King et al. 2012; Zhu et al. 2012).

It follows that the availability of data intermediary and tools will become important for both academic and corporate research. Many new “public opinion monitoring” companies compete to provide better tools and datasets as data intermediaries, including the Online Public Opinion Monitoring and Measuring Unit (人民网舆情监测室) of the People’s Net (a Party press organ) with annual revenue near 200 million RMB. Hence, in addition to the on-going considerations on big data and Web data research, we need to factor in how these private and public Web data intermediaries shape the Chinese Web data environment (Liao et al. 2013).

Given the fact that the government’s control of information on the Chinese Web involves not only the marginalization (as opposed to the traditional censorship) of “unwanted” messages and information, but also the prioritisation of propaganda or pro-government messages (including those made by paid commentators and “robots”), I would add that the new challenges for researchers include the detection of paid (and sometimes robot-generated) comments. Although these challenges are not exactly the same as data access, researchers need to consider them for data collection.

Ed: How much of the content and traffic is identifiable or geolocatable by region (eg mainland vs Hong Kong, Taiwan, abroad)?

Han-Teng: Identifying geographic information from Chinese Web data, like other Web data, can be largely done by geo-IP (a straightforward IP to geographic location mapping service), domain names (.cn for China; .hk for Hong Kong; .tw for Taiwan), and language preferences (simplified Chinese used by mainland Chinese users; traditional Chinese used by Hong Kong and Taiwan). Again, like the question of data access, the availability and quality of such geographic and linguistic information depends on the policies, openness, and the availability of data intermediary and tools.

Nonetheless, there exist research efforts on using geographic and/or linguistic information of Chinese Web data to assess the level and extent of convergence and separation of Chinese information and users around the world (Etling et al. 2009; Liao 2008; Taneja and Wu 2013). Etling and colleagues (2009) concluded their mapping of Chinese blogsphere research with the interpretation of five “attentive spaces” roughly corresponding to five clusters or zones in the network map: on one side, two clusters of “Pro-state” and “Business” bloggers, and on the other, two clusters of “Overseas” bloggers (including Hong Kong and Taiwan) and “Culture”. Situated between the three clusters of “Pro-state”, “Overseas” and “Culture” (and thus at the centre of the network map) is the remaining cluster they call the “critical discourse” cluster, which is at the intersection of the two sides (albeit more on the “blocked” side of the Great Firewall).

I myself found distinct geographic focus and linguistic preferences between the online citations in Baidu Baike and Chinese Wikipedia (Liao 2008). Other research based on a sample of traffic data shows the existence of a “Chinese” cluster as an instance of a “culturally defined market”, regardless of their geographic and linguistic differences (Taneja and Wu 2013). Although I found their argument that the Great Firewall has very limited impacts on such a single “Chinese” cluster, they demonstrate the possibility of extracting geographic and linguistic information on Chinese Web data for better understanding the dynamics of Chinese online interactions; which are by no means limited within China or behind the Great Firewall.

Ed: In terms of online monitoring of public opinion, is it possible to identify robots / “50 cent party” — that is, what proportion of the “opinion” actually has a government source?

Han-Teng: There exist research efforts in identifying robot comments by analysing the patterns and content of comments, and their profile relationship with other accounts. It is more difficult to prove the direct footprint of government sources. Nonetheless, if researchers take another approach such as narrative analysis for well-defined propaganda research (such as the pro- and anti-Falun opinions), it might be easier to categorise and visualise the dynamics and then trace back to the origins of dominant keywords and narratives to identify the sources of loud messages. I personally think such research and analytical efforts require deep knowledge on both technical and cultural-political understanding of Chinese Web data, preferably with an integrated mixed method research design that incorporates both the quantitative and qualitative methods required for the data question at hand.

Ed: In terms of censorship, ISPs operate within explicit governmental guidelines; do the public (who contribute content) also have explicit rules about what topics and content are ‘acceptable’, or do they have to work it out by seeing what gets deleted?

Han-Teng: As a general rule, online censorship works better when individual contributors are isolated. Most of the time, contributors experience technical difficulties when using Beijing’s unwanted keywords or undesired websites, triggering self-censorship behaviours to avoid such difficulties. I personally believe such tacit learning serves as the most relevant psychological and behaviour mechanism (rather than explicit rules). In a sense, the power of censorship and political discipline is the fact that the real rules of engagement are never explicit to users, thereby giving more power to technocrats to exercise power in a more arbitrary fashion. I would describe the general situation as follows. Directives are given to both ISPs and ICPs about certain “hot terms”, some dynamic and some constant. Users “learn” them through encountering various forms of “technical difficulties”. Thus, while ISPs and ICPs may not enforce the same directives in the same fashion (some overshoot while others undershoot), the general tacit knowledge about the “red line” is thus delivered.

Nevertheless, there are some efforts where users do share their experiences with one another, so that they have a social understanding of what information and which category of users is being disciplined. There are also constant efforts outside mainland China, especially institutions in Hong Kong and Berkeley to monitor what is being deleted. However, given the fact that data is abundant for Chinese users, I have become more worried about the phenomenon of “marginalization of information and/or narratives”. It should be noted that censorship or deletion is just one of the tools of propaganda technocrats and that the Chinese Communist Party has had its share of historical lessons (and also victories) against its past opponents, such as the Chinese Nationalist Party and the United States during the Chinese Civil War and the Cold War. I strongly believe that as researchers we need better concepts and tools to assess the dynamics of information marginalization and prioritisation, treating censorship and data deletion as one mechanism of information marginalization in the age of data abundance and limited attention.

Ed: Has anyone tried to produce a map of censorship: ie mapping absence of discussion? For a researcher wanting to do this, how would they get hold of the deleted content?

Han-Teng: Mapping censorship has been done through experiment (MacKinnon 2008; Zhu et al. 2011) and by contrasting datasets (Fu et al. 2013; Liao 2013; Zhu et al. 2012). Here the availability of data intermediaries such as the WeiboScope in Hong Kong University, and unblocked alternative such as Chinese Wikipedia, serve as direct and indirect points of comparison to see what is being or most likely to be deleted. As I am more interested in mapping information marginalization (as opposed to prioritisation), I would say that we need more analytical and visualisation tools to map out the different levels and extent of information censorship and marginalization. The research challenges then shift to the questions of how and why certain content has been deleted inside mainland China, and thus kept or leaked outside China. As we begin to realise that the censorship regime can still achieve its desired political effects by voicing down the undesired messages and voicing up the desired ones, researchers do not necessarily have to get hold of the deleted content from the websites inside mainland China. They can simply reuse plenty of Chinese Web data available outside the censorship and filtering regime to undertake experiments or comparative study.

Ed: What other questions are people trying to explore or answer with data from the “Chinese Web”? And what are the difficulties? For instance, are there enough tools available for academics wanting to process Chinese text?

Han-Teng: As Chinese societies (including mainland China, Hong Kong, Taiwan and other overseas diaspora communities) go digital and networked, it’s only a matter of time before Chinese Web data becomes the equivalent of English Web data. However, there are challenges in processing Chinese language texts, although several of the major challenges become manageable as digital and network tools go multilingual. In fact, Chinese-language users and technologies have been the major goal and actors for a multi-lingual Internet (Liao 2009a,b). While there is technical progress in basic tools, we as Chinese Internet researchers still lack data and tool intermediaries that are designed to process Chinese texts smoothly. For instance, many analytical software and tools depend on or require the use of space characters as word boundaries, a condition that does not apply to Chinese texts.

In addition, since there exist some technical and interpretative challenges in analysing Chinese text datasets with mixed scripts (e.g. simplified and traditional Chinese) or with other foreign languages. Mandarin Chinese language is not the only language inside China; there are indications that the Cantonese and Shanghainese languages have a significant presence. Minority languages such as Tibetan, Mongolian, Uyghur, etc. are also still used by official Chinese websites to demonstrate the cultural inclusiveness of the Chinese authorities. Chinese official and semi-official diplomatic organs have also tried to tell “Chinese stories” in various of the world’s major languages, sometimes in direct competition with its political opponents such as Falun Gong.

These areas of the “Chinese Web” data remain unexplored territory for systematic research, which will require more tools and methods that are similar to the toolkits of multi-lingual Internet researchers. Hence I would say the basic data and tool challenges are not particular to the “Chinese Web”, but are rather a general challenge to the “Web” that is becoming increasingly multilingual by the day. We Chinese Internet researchers do need more collaboration when it comes to sharing data and tools, and I am hopeful that we will have more trustworthy and independent data intermediaries, such as Weiboscope and others, for a better future of the Chinese Web data ecology.

References

Bamman, D., O’Connor, B., & Smith, N. (2012). Censorship and deletion practices in Chinese social media. First Monday, 17(3-5).

Etling, B., Kelly, J., & Faris, R. (2009). Mapping Chinese Blogosphere. In 7th Annual Chinese Internet Research Conference (CIRC 2009). Annenberg School for Communication, University of Pennsylvania, Philadelphia, US.

Fu, K., Chan, C., & Chau, M. (2013). Assessing Censorship on Microblogs in China: Discriminatory Keyword Analysis and Impact Evaluation of the “Real Name Registration” Policy. IEEE Internet Computing, 17(3), 42–50.

Fu, K., & Chau, M. (2013). Reality Check for the Chinese Microblog Space: a random sampling approach. PLOS ONE, 8(3), e58356.

Jiang, M., & Akhtar, A. (2011). Peer into the Black Box of Chinese Search Engines: A Comparative Study of Baidu, Google, and Goso. Presented at the The 9th Chinese Internet Research Conference (CIRC 2011), Washington, D.C.: Institute for the Study of Diplomacy. Georgetown University.

King, G., Pan, J., & Roberts, M. (2012). How censorship in China allows government criticism but silences collective expression. In APSA 2012 Annual Meeting Paper.

Liao, H.-T. (2008). A webometric comparison of Chinese Wikipedia and Baidu Baike and its implications for understanding the Chinese-speaking Internet. In 9th annual Internet Research Conference: Rethinking Community, Rethinking Place. Copenhagen.

Liao, H.-T. (2009a). Are Chinese characters not modern enough? An essay on their role online. GLIMPSE: the art + science of seeing, 2(1), 16–24.

Liao, H.-T. (2009b). Conflict and Consensus in the Chinese version of Wikipedia. IEEE Technology and Society Magazine, 28(2), 49–56. doi:10.1109/MTS.2009.932799

Liao, H.-T. (2013, August 5). How do Baidu Baike and Chinese Wikipedia filter contribution? A case study of network gatekeeping. To be presented at the Wikisym 2013: The Joint International Symposium on Open Collaboration, Hong Kong.

Liao, H.-T., Fu, K., Jiang, M., & Wang, N. (2013, June 15). Chinese Web Data: Definition, Uses, and Scholarship. (Accepted). To be presented at the 11th Annual Chinese Internet Research Conference (CIRC 2013), Oxford, UK.

MacKinnon, R. (2008). Flatter world and thicker walls? Blogs, censorship and civic discourse in China. Public Choice, 134(1), 31–46. doi:10.1007/s11127-007-9199-0

Taneja, H., & Wu, A. X. (2013). How Does the Great Firewall of China Affect Online User Behavior? Isolated “Internets” as Culturally Defined Markets on the WWW. Presented at the 11th Annual Chinese Internet Research Conference (CIRC 2013), Oxford, UK.

Zhu, T., Bronk, C., & Wallach, D. S. (2011). An Analysis of Chinese Search Engine Filtering. arXiv:1107.3794.

Zhu, T., Phipps, D., Pridgen, A., Crandall, J. R., & Wallach, D. S. (2012). Tracking and Quantifying Censorship on a Chinese Microblogging Site. arXiv:1211.6166.


Han-Teng was talking to blog editor David Sutcliffe.

Han-Teng Liao is an OII DPhil student whose research aims to reconsider the role of keywords (as in understanding “keyword advertising” using knowledge from sociolinguistics and information science) and hyperlinks (webometrics) in shaping the sense of “fellow users” in digital networked environments. Specifically, his DPhil project is a comparative study of two major user-contributed Chinese encyclopedias, Chinese Wikipedia and Baidu Baike.

]]>
Mapping collective public opinion in the Russian blogosphere https://ensr.oii.ox.ac.uk/mapping-collective-public-opinion-in-the-russian-blogosphere/ Mon, 10 Feb 2014 11:30:05 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2372 Caption
Widely reported as fraudulent, the 2011 Russian Parliamentary elections provoked mass street protest action by tens of thousands of people in Moscow and cities and towns across Russia. Image by Nikolai Vassiliev.

Blogs are becoming increasingly important for agenda setting and formation of collective public opinion on a wide range of issues. In countries like Russia where the Internet is not technically filtered, but where the traditional media is tightly controlled by the state, they may be particularly important. The Russian language blogosphere counts about 85 million blogs – an amount far beyond the capacities of any government to control – and the Russian search engine Yandex, with its blog rating service, serves as an important reference point for Russia’s educated public in its search of authoritative and independent sources of information. The blogosphere is thereby able to function as a mass medium of “public opinion” and also to exercise influence.

One topic that was particularly salient over the period we studied concerned the Russian Parliamentary elections of December 2011. Widely reported as fraudulent, they provoked immediate and mass street protest action by tens of thousands of people in Moscow and cities and towns across Russia, as well as corresponding activity in the blogosphere. Protesters made effective use of the Internet to organize a movement that demanded cancellation of the parliamentary election results, and the holding of new and fair elections. These protests continued until the following summer, gaining widespread national and international attention.

Most of the political and social discussion blogged in Russia is hosted on the blog platform LiveJournal. Some of these bloggers can claim a certain amount of influence; the top thirty bloggers have over 20,000 “friends” each, representing a good circulation for the average Russian newspaper. Part of the blogosphere may thereby resemble the traditional media; the deeper into the long tail of average bloggers, however, the more it functions as more as pure public opinion. This “top list” effect may be particularly important in societies (like Russia’s) where popularity lists exert a visible influence on bloggers’ competitive behavior and on public perceptions of their significance. Given the influence of these top bloggers, it may be claimed that, like the traditional media, they act as filters of issues to be thought about, and as definers of their relative importance and salience.

Gauging public opinion is of obvious interest to governments and politicians, and opinion polls are widely used to do this, but they have been consistently criticized for the imposition of agendas on respondents by pollsters, producing artefacts. Indeed, the public opinion literature has tended to regard opinion as something to be “extracted” by pollsters, which inevitably pre-structures the output. This literature doesn’t consider that public opinion might also exist in the form of natural language texts, such as blog posts, that have not been pre-structured by external observers.

There are two basic ways to detect topics in natural language texts: the first is manual coding of texts (ie by traditional content analysis), and the other involves rapidly developing techniques of automatic topic modeling or text clustering. The media studies literature has relied heavily on traditional content analysis; however, these studies are inevitably limited by the volume of data a person can physically process, given there may be hundreds of issues and opinions to track — LiveJournal’s 2.8 million blog accounts, for example, generate 90,000 posts daily.

For large text collections, therefore, only the second approach is feasible. In our article we explored how methods for topic modeling developed in computer science may be applied to social science questions – such as how to efficiently track public opinion on particular (and evolving) issues across entire populations. Specifically, we demonstrate how automated topic modeling can identify public agendas, their composition, structure, the relative salience of different topics, and their evolution over time without prior knowledge of the issues being discussed and written about. This automated “discovery” of issues in texts involves division of texts into topically — or more precisely, lexically — similar groups that can later be interpreted and labeled by researchers. Although this approach has limitations in tackling subtle meanings and links, experiments where automated results have been checked against human coding show over 90 percent accuracy.

The computer science literature is flooded with methodological papers on automatic analysis of big textual data. While these methods can’t entirely replace manual work with texts, they can help reduce it to the most meaningful and representative areas of the textual space they help to map, and are the only means to monitor agendas and attitudes across multiple sources, over long periods and at scale. They can also help solve problems of insufficient and biased sampling, when entire populations become available for analysis. Due to their recentness, as well as their mathematical and computational complexity, these approaches are rarely applied by social scientists, and to our knowledge, topic modeling has not previously been applied for the extraction of agendas from blogs in any social science research.

The natural extension of automated topic or issue extraction involves sentiment mining and analysis; as Gonzalez-Bailon, Kaltenbrunner, and Banches (2012) have pointed out, public opinion doesn’t just involve specific issues, but also encompasses the state of public emotion about these issues, including attitudes and preferences. This involves extracting opinions on the issues/agendas that are thought to be present in the texts, usually by dividing sentences into positive and negative. These techniques are based on human-coded dictionaries of emotive words, on algorithmic construction of sentiment dictionaries, or on machine learning techniques.

Both topic modeling and sentiment analysis techniques are required to effectively monitor self-generated public opinion. When methods for tracking attitudes complement methods to build topic structures, a rich and powerful map of self-generated public opinion can be drawn. Of course this mapping can’t completely replace opinion polls; rather, it’s a new way of learning what people are thinking and talking about; a method that makes the vast amounts of user-generated content about society – such as the 65 million blogs that make up the Russian blogosphere — available for social and policy analysis.

Naturally, this approach to public opinion and attitudes is not free of limitations. First, the dataset is only representative of the self-selected population of those who have authored the texts, not of the whole population. Second, like regular polled public opinion, online public opinion only covers those attitudes that bloggers are willing to share in public. Furthermore, there is still a long way to go before the relevant instruments become mature, and this will demand the efforts of the whole research community: computer scientists and social scientists alike.

Read the full paper: Olessia Koltsova and Sergei Koltcov (2013) Mapping the public agenda with topic modeling: The case of the Russian livejournal. Policy and Internet 5 (2) 207–227.

Also read on this blog: Can text mining help handle the data deluge in public policy analysis? by Aude Bicquelet.

References

González-Bailón, S., A. Kaltenbrunner, and R.E. Banches. 2012. “Emotions, Public Opinion and U.S. Presidential Approval Rates: A 5 Year Analysis of Online Political Discussions,” Human Communication Research 38 (2): 121–43.

]]>
Edit wars! Measuring and mapping society’s most controversial topics https://ensr.oii.ox.ac.uk/edit-wars-measuring-mapping-societys-most-controversial-topics/ Tue, 03 Dec 2013 08:21:43 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2339 Ed: How did you construct your quantitative measure of ‘conflict’? Did you go beyond just looking at content flagged by editors as controversial?

Taha: Yes we did … actually, we have shown that controversy measures based on “controversial” flags are not inclusive at all and although they might have high precision, they have very low recall. Instead, we constructed an automated algorithm to locate and quantify the editorial wars taking place on the Wikipedia platform. Our algorithm is based on reversions, i.e. when editors undo each other’s contributions. We focused specifically on mutual reverts between pairs of editors and we assigned a maturity score to each editor, based on the total volume of their previous contributions. While counting the mutual reverts, we used more weight for those ones committed by/on editors with higher maturity scores; as a revert between two experienced editors indicates a more serious problem. We always validated our method and compared it with other methods, using human judgement on a random selection of articles.

Ed: Was there any discrepancy between the content deemed controversial by your own quantitative measure, and what the editors themselves had flagged?

Taha: We were able to capture all the flagged content, but not all the articles found to be controversial by our method are flagged. And when you check the editorial history of those articles, you soon realise that they are indeed controversial but for some reason have not been flagged. It’s worth mentioning that the flagging process is not very well implemented in smaller language editions of Wikipedia. Even if the controversy is detected and flagged in English Wikipedia, it might not be in the smaller language editions. Our model is of course independent of the size and editorial conventions of different language editions.

Ed: Were there any differences in the way conflicts arose / were resolved in the different language versions?

Taha: We found the main differences to be the topics of controversial articles. Although some topics are globally debated, like religion and politics, there are many topics which are controversial only in a single language edition. This reflects the local preferences and importances assigned to topics by different editorial communities. And then the way editorial wars initiate and more importantly fade to consensus is also different in different language editions. In some languages moderators interfere very soon, while in others the war might go on for a long time without any moderation.

Ed: In general, what were the most controversial topics in each language? And overall?

Taha: Generally, religion, politics, and geographical places like countries and cities (sometimes even villages) are the topics of debates. But each language edition has also its own focus, for example football in Spanish and Portuguese, animations and TV series in Chinese and Japanese, sex and gender-related topics in Czech, and Science and Technology related topics in French Wikipedia are very often behind editing wars.

Ed: What other quantitative studies of this sort of conflict -ie over knowledge and points of view- are there?

Taha: My favourite work is one by researchers from Barcelona Media Lab. In their paper Jointly They Edit: Examining the Impact of Community Identification on Political Interaction in Wikipedia they provide quantitative evidence that editors interested in political topics identify themselves more significantly as Wikipedians than as political activists, even though they try hard to reflect their opinions and political orientations in the articles they contribute to. And I think that’s the key issue here. While there are lots of debates and editorial wars between editors, at the end what really counts for most of them is Wikipedia as a whole project, and the concept of shared knowledge. It might explain how Wikipedia really works despite all the diversity among its editors.

Ed: How would you like to extend this work?

Taha: Of course some of the controversial topics change over time. While Jesus might stay a controversial figure for a long time, I’m sure the article on President (W) Bush will soon reach a consensus and most likely disappear from the list of the most controversial articles. In the current study we examined the aggregated data from the inception of each Wikipedia-edition up to March 2010. One possible extension that we are working on now is to study the dynamics of these controversy-lists and the positions of topics in them.

Read the full paper: Yasseri, T., Spoerri, A., Graham, M. and Kertész, J. (2014) The most controversial topics in Wikipedia: A multilingual and geographical analysis. In: P.Fichman and N.Hara (eds) Global Wikipedia: International and cross-cultural issues in online collaboration. Scarecrow Press.


Taha was talking to blog editor David Sutcliffe.

Taha Yasseri is the Big Data Research Officer at the OII. Prior to coming to the OII, he spent two years as a Postdoctoral Researcher at the Budapest University of Technology and Economics, working on the socio-physical aspects of the community of Wikipedia editors, focusing on conflict and editorial wars, along with Big Data analysis to understand human dynamics, language complexity, and popularity spread. He has interests in analysis of Big Data to understand human dynamics, government-society interactions, mass collaboration, and opinion dynamics.

]]>
The physics of social science: using big data for real-time predictive modelling https://ensr.oii.ox.ac.uk/physics-of-social-science-using-big-data-for-real-time-predictive-modelling/ Thu, 21 Nov 2013 09:49:27 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2320 Ed: You are interested in analysis of big data to understand human dynamics; how much work is being done in terms of real-time predictive modelling using these data?

Taha: The socially generated transactional data that we call “big data” have been available only very recently; the amount of data we now produce about human activities in a year is comparable to the amount that used to be produced in decades (or centuries). And this is all due to recent advancements in ICTs. Despite the short period of availability of big data, the use of them in different sectors including academia and business has been significant. However, in many cases, the use of big data is limited to monitoring and post hoc analysis of different patterns. Predictive models have been rarely used in combination with big data. Nevertheless, there are very interesting examples of using big data to make predictions about disease outbreaks, financial moves in the markets, social interactions based on human mobility patterns, election results, etc.

Ed: What were the advantages of using Wikipedia as a data source for your study — as opposed to Twitter, blogs, Facebook or traditional media, etc.?

Taha: Our results have shown that the predictive power of Wikipedia page view and edit data outperforms similar box office-prediction models based on Twitter data. This can partially be explained by considering the different nature of Wikipedia compared to social media sites. Wikipedia is now the number one source of online information, and Wikipedia article page view statistics show how much Internet users have been interested in knowing about a specific movie. And the edit counts — even more importantly — indicate the level of interest of the editors in sharing their knowledge about the movies with others. Both indicators are much stronger than what you could measure on Twitter, which is mainly the reaction of the users after watching or reading about the movie. The cost of participation in Wikipedia’s editorial process makes the activity data more revealing about the potential popularity of the movies.

Another advantage is the sheer availability of Wikipedia data. Twitter streams, by comparison, are limited in both size and time. Gathering Facebook data is also problematic, whereas all the Wikipedia editorial activities and page views are recorded in full detail — and made publicly available.

Ed: Could you briefly describe your method and model?

Taha: We retrieved two sets of data from Wikipedia, the editorial activity and the page views relating to our set of 312 movies. The former indicates the popularity of the movie among the Wikipedia editors and the latter among Wikipedia readers. We then defined different measures based on these two data streams (eg number of edits, number of unique editors, etc.) In the next step we combined these data into a linear model that assumes the more popular the movie is, the larger the size of these parameters. However this model needs both training and calibration. We calibrated the model based on the IMBD data on the financial success of a set of ‘training’ movies. After calibration, we applied the model to a set of “test” movies and (luckily) saw that the model worked very well in predicting the financial success of the test movies.

Ed: What were the most significant variables in terms of predictive power; and did you use any content or sentiment analysis?

Taha: The nice thing about this method is that you don’t need to perform any content or sentiment analysis. We deal only with volumes of activities and their evolution over time. The parameter that correlated best with financial success (and which was therefore the best predictor) was the number of page views. I can easily imagine that these days if someone wants to go to watch a movie, they most likely turn to the Internet and make a quick search. Thanks to Google, Wikipedia is going to be among the top results and it’s very likely that the click will go to the Wikipedia article about the movie. I think that’s why the page views correlate to the box office takings so significantly.

Ed: Presumably people are picking up on signals, ie Wikipedia is acting like an aggregator and normaliser of disparate environmental signals — what do you think these signals might be, in terms of box office success? ie is it ultimately driven by the studio media machine?

Taha: This is a very difficult question to answer. There are numerous factors that make a movie (or a product in general) popular. Studio marketing strategies definitely play an important role, but the quality of the movie, the collective mood of the public, herding effects, and many other hidden variables are involved as well. I hope our research serves as a first step in studying popularity in a quantitative framework, letting us answer such questions. To fully understand a system the first thing you need is a tool to monitor and observe it very well quantitatively. In this research we have shown that (for example) Wikipedia is a nice window and useful tool to observe and measure popularity and its dynamics; hopefully leading to a deep understanding of the underlying mechanisms as well.

Ed: Is there similar work / approaches to what you have done in this study?

Taha: There have been other projects using socially generated data to make predictions on the popularity of movies or movement in financial markets, however to the best of my knowledge, it’s been the first time that Wikipedia data have been used to feed the models. We were positively surprised when we observed that these data have stronger predictive power than previously examined datasets.

Ed: If you have essentially shown that ‘interest on Wikipedia’ tracks ‘real-world interest’ (ie box office receipts), can this be applied to other things? eg attention to legislation, political scandal, environmental issues, humanitarian issues: ie Wikipedia as “public opinion monitor”?

Taha: I think so. Now I’m running two other projects using a similar approach; one to predict election outcomes and the other one to do opinion mining about the new policies implemented by governing bodies. In the case of elections, we have observed very strong correlations between changes in the information seeking rates of the general public and the number of ballots cast. And in the case of new policies, I think Wikipedia could be of great help in understanding the level of public interest in searching for accurate information about the policies, and how this interest is satisfied by the information provided online. And more interestingly, how this changes overtime as the new policy is fully implemented.

Ed: Do you think there are / will be practical applications of using social media platforms for prediction, or is the data too variable?

Taha: Although the availability and popularity of social media are recent phenomena, I’m sure that social media data are already being used by different bodies for predictions in various areas. We have seen very nice examples of using these data to predict disease outbreaks or the arrival of earthquake waves. The future of this field is very promising, considering both the advancements in the methodologies and also the increase in popularity and use of social media worldwide.

Ed: How practical would it be to generate real-time processing of this data — rather than analysing databases post hoc?

Taha: Data collection and analysis could be done instantly. However the challenge would be the calibration. Human societies and social systems — similarly to most complex systems — are non-stationary. That means any statistical property of the system is subject to abrupt and dramatic changes. That makes it a bit challenging to use a stationary model to describe a continuously changing system. However, one could use a class of adaptive models or Bayesian models which could modify themselves as the system evolves and more data are available. All these could be done in real time, and that’s the exciting part of the method.

Ed: As a physicist; what are you learning in a social science department? And what does physicist bring to social science and the study of human systems?

Taha: Looking at complicated phenomena in a simple way is the art of physics. As Einstein said, a physicist always tries to “make things as simple as possible, but not simpler”. And that works very well in describing natural phenomena, ranging from sub-atomic interactions all the way to cosmology. However, studying social systems with the tools of natural sciences can be very challenging, and sometimes too much simplification makes it very difficult to understand the real underlying mechanisms. Working with social scientists, I’m learning a lot about the importance of the individual attributes (and variations between) the elements of the systems under study, outliers, self-awarenesses, ethical issues related to data, agency and self-adaptation, and many other details that are mostly overlooked when a physicist studies a social system.

At the same time, I try to contribute the methodological approaches and quantitative skills that physicists have gained during two centuries of studying complex systems. I think statistical physics is an amazing example where statistical techniques can be used to describe the macro-scale collective behaviour of billions and billions of atoms with a single formula. I should admit here that humans are way more complicated than atoms — but the dialogue between natural scientists and social scientists could eventually lead to multi-scale models which could help us to gain a quantitative understanding of social systems, thereby facilitating accurate predictions of social phenomena.

Ed: What database would you like access to, if you could access anything?

Taha: I have day dreams about the database of search queries from all the Internet users worldwide at the individual level. These data are being collected continuously by search engines and technically could be accessed, but due to privacy policy issues it’s impossible to get a hold on; even if only for research purposes. This is another difference between social systems and natural systems. An atom never gets upset being watched through a microscope all the time, but working on social systems and human-related data requires a lot of care with respect to privacy and ethics.

Read the full paper: Mestyán, M., Yasseri, T., and Kertész, J. (2013) Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data. PLoS ONE 8 (8) e71226.


Taha Yasseri was talking to blog editor David Sutcliffe.

Taha Yasseri is the Big Data Research Officer at the OII. Prior to coming to the OII, he spent two years as a Postdoctoral Researcher at the Budapest University of Technology and Economics, working on the socio-physical aspects of the community of Wikipedia editors, focusing on conflict and editorial wars, along with Big Data analysis to understand human dynamics, language complexity, and popularity spread. He has interests in analysis of Big Data to understand human dynamics, government-society interactions, mass collaboration, and opinion dynamics.

]]>
Can text mining help handle the data deluge in public policy analysis? https://ensr.oii.ox.ac.uk/can-text-mining-help-handle-data-deluge-public-policy-analysis/ Sun, 27 Oct 2013 12:29:01 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2273 Policy makers today must contend with two inescapable phenomena. On the one hand, there has been a major shift in the policies of governments concerning participatory governance – that is, engaged, collaborative, and community-focused public policy. At the same time, a significant proportion of government activities have now moved online, bringing about “a change to the whole information environment within which government operates” (Margetts 2009, 6).

Indeed, the Internet has become the main medium of interaction between government and citizens, and numerous websites offer opportunities for online democratic participation. The Hansard Society, for instance, regularly runs e-consultations on behalf of UK parliamentary select committees. For examples, e-consultations have been run on the Climate Change Bill (2007), the Human Tissue and Embryo Bill (2007), and on domestic violence and forced marriage (2008). Councils and boroughs also regularly invite citizens to take part in online consultations on issues affecting their area. The London Borough of Hammersmith and Fulham, for example, recently asked its residents for thier views on Sex Entertainment Venues and Sex Establishment Licensing policy.

However, citizen participation poses certain challenges for the design and analysis of public policy. In particular, governments and organizations must demonstrate that all opinions expressed through participatory exercises have been duly considered and carefully weighted before decisions are reached. One method for partly automating the interpretation of large quantities of online content typically produced by public consultations is text mining. Software products currently available range from those primarily used in qualitative research (integrating functions like tagging, indexing, and classification), to those integrating more quantitative and statistical tools, such as word frequency and cluster analysis (more information on text mining tools can be found at the National Centre for Text Mining).

While these methods have certainly attracted criticism and skepticism in terms of the interpretability of the output, they offer four important advantages for the analyst: namely categorization, data reduction, visualization, and speed.

1. Categorization. When analyzing the results of consultation exercises, analysts and policymakers must make sense of the high volume of disparate responses they receive; text mining supports the structuring of large amounts of this qualitative, discursive data into predefined or naturally occurring categories by storage and retrieval of sentence segments, indexing, and cross-referencing. Analysis of sentence segments from respondents with similar demographics (eg age) or opinions can itself be valuable, for example in the construction of descriptive typologies of respondents.

2. Data Reduction. Data reduction techniques include stemming (reduction of a word to its root form), combining of synonyms, and removal of non-informative “tool” or stop words. Hierarchical classifications, cluster analysis, and correspondence analysis methods allow the further reduction of texts to their structural components, highlighting the distinctive points of view associated with particular groups of respondents.

3. Visualization. Important points and interrelationships are easy to miss when read by eye, and rapid generation of visual overviews of responses (eg dendrograms, 3D scatter plots, heat maps, etc.) make large and complex datasets easier to comprehend in terms of identifying the main points of view and dimensions of a public debate.

4. Speed. Speed depends on whether a special dictionary or vocabulary needs to be compiled for the analysis, and on the amount of coding required. Coding is usually relatively fast and straightforward, and the succinct overview of responses provided by these methods can reduce the time for consultation responses.

Despite the above advantages of automated approaches to consultation analysis, text mining methods present several limitations. Automatic classification of responses runs the risk of missing or miscategorising distinctive or marginal points of view if sentence segments are too short, or if they rely on a rare vocabulary. Stemming can also generate problems if important semantic variations are overlooked (eg lumping together ‘ill+ness’, ‘ill+defined’, and ‘ill+ustration’). Other issues applicable to public e-consultation analysis include the danger that analysts distance themselves from the data, especially when converting words to numbers. This is quite apart from the issues of inter-coder reliability and data preparation, missing data, and insensitivity to figurative language, meaning and context, which can also result in misclassification when not human-verified.

However, when responding to criticisms of specific tools, we need to remember that different text mining methods are complementary, not mutually exclusive. A single solution to the analysis of qualitative or quantitative data would be very unlikely; and at the very least, exploratory techniques provide a useful first step that could be followed by a theory-testing model, or by triangulation exercises to confirm results obtained by other methods.

Apart from these technical issues, policy makers and analysts employing text mining methods for e-consultation analysis must also consider certain ethical issues in addition to those of informed consent, privacy, and confidentiality. First (of relevance to academics), respondents may not expect to end up as research subjects. They may simply be expecting to participate in a general consultation exercise, interacting exclusively with public officials and not indirectly with an analyst post hoc; much less ending up as a specific, traceable data point.

This has been a particularly delicate issue for healthcare professionals. Sharf (1999, 247) describes various negative experiences of following up online postings: one woman, on being contacted by a researcher seeking consent to gain insights from breast cancer patients about their personal experiences, accused the researcher of behaving voyeuristically and “taking advantage of people in distress.” Statistical interpretation of responses also presents its own issues, particularly if analyses are to be returned or made accessible to respondents.

Respondents might also be confused about or disagree with text mining as a method applied to their answers; indeed, it could be perceived as dehumanizing – reducing personal opinions and arguments to statistical data points. In a public consultation, respondents might feel somewhat betrayed that their views and opinions eventually result in just a dot on a correspondence analysis with no immediate, apparent meaning or import, at least in lay terms. Obviously the consultation organizer needs to outline clearly and precisely how qualitative responses can be collated into a quantifiable account of a sample population’s views.

This is an important point; in order to reduce both technical and ethical risks, researchers should ensure that their methodology combines both qualitative and quantitative analyses. While many text mining techniques provide useful statistical output, the UK Government’s prescribed Code of Practice on public consultation is quite explicit on the topic: “The focus should be on the evidence given by consultees to back up their arguments. Analyzing consultation responses is primarily a qualitative rather than a quantitative exercise” (2008, 12). This suggests that the perennial debate between quantitative and qualitative methodologists needs to be updated and better resolved.

References

Margetts, H. 2009. “The Internet and Public Policy.” Policy & Internet 1 (1).

Sharf, B. 1999. “Beyond Netiquette: The Ethics of Doing Naturalistic Discourse Research on the Internet.” In Doing Internet Research, ed. S. Jones, London: Sage.


Read the full paper: Bicquelet, A., and Weale, A. (2011) Coping with the Cornucopia: Can Text Mining Help Handle the Data Deluge in Public Policy Analysis? Policy & Internet 3 (4).

Dr Aude Bicquelet is a Fellow in LSE’s Department of Methodology. Her main research interests include computer-assisted analysis, Text Mining methods, comparative politics and public policy. She has published a number of journal articles in these areas and is the author of a forthcoming book, “Textual Analysis” (Sage Benchmarks in Social Research Methods, in press).

]]>
Can Twitter provide an early warning function for the next pandemic? https://ensr.oii.ox.ac.uk/can-twitter-provide-an-early-warning-function-for-the-next-flu-pandemic/ Mon, 14 Oct 2013 08:00:41 +0000 http://blogs.oii.ox.ac.uk/policy/?p=1241 Image by .
Communication of risk in any public health emergency is a complex task for healthcare agencies; a task made more challenging when citizens are bombarded with online information. Mexico City, 2009. Image by Eneas.

 

Ed: Could you briefly outline your study?

Patty: We investigated the role of Twitter during the 2009 swine flu pandemics from two perspectives. Firstly, we demonstrated the role of the social network to detect an upcoming spike in an epidemic before the official surveillance systems – up to week in the UK and up to 2-3 weeks in the US – by investigating users who “self-diagnosed” themselves posting tweets such as “I have flu / swine flu”. Secondly, we illustrated how online resources reporting the WHO declaration of “pandemics” on 11 June 2009 were propagated through Twitter during the 24 hours after the official announcement [1,2,3].

Ed: Disease control agencies already routinely follow media sources; are public health agencies  aware of social media as another valuable source of information?

Patty:  Social media are providing an invaluable real-time data signal complementing well-established epidemic intelligence (EI) systems monitoring online media, such as MedISys and GPHIN. While traditional surveillance systems will remain the pillars of public health, online media monitoring has added an important early-warning function, with social media bringing  additional benefits to epidemic intelligence: virtually real-time information available in the public domain that is contributed by users themselves, thus not relying on the editorial policies of media agencies.

Public health agencies (such as the European Centre for Disease Prevention and Control) are interested in social media early warning systems, but more research is required to develop robust social media monitoring solutions that are ready to be integrated with agencies’ EI services.

Ed: How difficult is this data to process? Eg: is this a full sample, processed in real-time?

Patty:  No, obtaining all Twitter search query results is not possible. In our 2009 pilot study we were accessing data from Twitter using a search API interface querying the database every minute (the number of results was limited to 100 tweets). Currently, only 1% of the ‘Firehose’ (massive real-time stream of all public tweets) is made available using the streaming API. The searches have to be performed in real-time as historical Twitter data are normally available only through paid services. Twitter analytics methods are diverse; in our study, we used frequency calculations, developed algorithms for geo-location, automatic spam and duplication detection, and applied time series and cross-correlation with surveillance data [1,2,3].

Ed: What’s the relationship between traditional and social media in terms of diffusion of health information? Do you have a sense that one may be driving the other?

Patty: This is a fundamental question. “Does media coverage of certain topic causes buzz on social media or does social media discussion causes media frenzy?” This was particularly important to investigate for the 2009 swine flu pandemic, which experienced unprecedented media interest. While it could be assumed that disease cases preceded media coverage, or that media discussion sparked public interest causing Twitter debate, neither proved to be the case in our experiment. On some days, media coverage for flu was higher, and on others Twitter discussion was higher; but peaks seemed synchronized – happening on the same days.

Ed: In terms of communicating accurate information, does the Internet make the job easier or more difficult for health authorities?

Patty: The communication of risk in any public health emergencies is a complex task for government and healthcare agencies; this task is made more challenging when citizens are bombarded with online information, from a variety of sources that vary in accuracy. This has become even more challenging with the increase in users accessing health-related information on their mobile phones (17% in 2010 and 31% in 2012, according to the US Pew Internet study).

Our findings from analyzing Twitter reaction to online media coverage of the WHO declaration of swine flu as a “pandemic” (stage 6) on 11 June 2009, which unquestionably was the most media-covered event during the 2009 epidemic, indicated that Twitter does favour reputable sources (such as the BBC, which was by far the most popular) but also that bogus information can still leak into the network.

Ed: What differences do you see between traditional and social media, in terms of eg bias / error rate of public health-related information?

Patty: Fully understanding quality of media coverage of health topics such as the 2009 swine flu pandemics in terms of bias and medical accuracy would require a qualitative study (for example, one conducted by Duncan in the EU [4]). However, the main role of social media, in particular Twitter due to the 140 character limit, is to disseminate media coverage by propagating links rather than creating primary health information about a particular event. In our study around 65% of tweets analysed contained a link.

Ed: Google flu trends (which monitors user search terms to estimate worldwide flu activity) has been around a couple of years: where is that going? And how useful is it?

Patty: Search companies such as Google have demonstrated that online search queries for keywords relating to flu and its symptoms can serve as a proxy for the number of individuals who are sick (Google Flu Trends), however, in 2013 the system “drastically overestimated peak flu levels”, as reported by Nature. Most importantly, however, unlike Twitter, Google search queries remain proprietary and are therefore not useful for research or the construction of non-commercial applications.

Ed: What are implications of social media monitoring for countries that may want to suppress information about potential pandemics?

Patty: The importance of event-based surveillance and monitoring social media for epidemic intelligence is of particular importance in countries with sub-optimal surveillance systems and those lacking the capacity for outbreak preparedness and response. Secondly, the role of user-generated information on social media is also of particular importance in counties with limited freedom of press or those that actively try to suppress information about potential outbreaks.

Ed: Would it be possible with this data to follow spread geographically, ie from point sources, or is population movement too complex to allow this sort of modelling?

Patty: Spatio-temporal modelling is technically possible as tweets are time-stamped and there is a support for geo-tagging. However, the location of all tweets can’t be precisely identified; however, early warning systems will improve in accuracy as geo-tagging of user generated content becomes widespread. Mathematical modelling of the spread of diseases and population movements are very topical research challenges (undertaken by, for example, by Colliza et al. [5]) but modelling social media user behaviour during health emergencies to provide a robust baseline for early disease detection remains a challenge.

Ed: A strength of monitoring social media is that it follows what people do already (eg search / Tweet / update statuses). Are there any mobile / SNS apps to support collection of epidemic health data? eg a sort of ‘how are you feeling now’ app?

Patty: The strength of early warning systems using social media is exactly in the ability to piggy-back on existing users’ behaviour rather than having to recruit participants. However, there are a growing number of participatory surveillance systems that ask users to provide their symptoms (web-based such as Flusurvey in the UK, and “Flu Near You” in the US that also exists as a mobile app). While interest in self-reporting systems is growing, challenges include their reliability, user recruitment and long-term retention, and integration with public health services; these remain open research questions for the future. There is also a potential for public health services to use social media two-ways – by providing information over the networks rather than only collect user-generated content. Social media could be used for providing evidence-based advice and personalized health information directly to affected citizens where they need it and when they need it, thus effectively engaging them in active management of their health.

References

[1.] M Szomszor, P Kostkova, C St Louis: Twitter Informatics: Tracking and Understanding Public Reaction during the 2009 Swine Flu Pandemics, IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology 2011, WI-IAT, Vol. 1, pp.320-323.

[2.]  Szomszor, M., Kostkova, P., de Quincey, E. (2010). #swineflu: Twitter Predicts Swine Flu Outbreak in 2009. M Szomszor, P Kostkova (Eds.): ehealth 2010, Springer Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering LNICST 69, pages 18-26, 2011.

[3.] Ed de Quincey, Patty Kostkova Early Warning and Outbreak Detection Using Social Networking Websites: the Potential of Twitter, P Kostkova (Ed.): ehealth 2009, Springer Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering LNICST 27, pages 21-24, 2010.

[4.] B Duncan. How the Media reported the first day of the pandemic H1N1) 2009: Results of EU-wide Media Analysis. Eurosurveillance, Vol 14, Issue 30, July 2009

[5.] Colizza V, Barrat A, Barthelemy M, Valleron AJ, Vespignani A (2007) Modeling the worldwide spread of pandemic influenza: Baseline case an containment interventions. PloS Med 4(1): e13. doi:10.1371/journal. pmed.0040013

Further information on this project and related activities, can be found at: BMJ-funded scientific film: http://www.youtube.com/watch?v=_JNogEk-pnM ; Can Twitter predict disease outbreaks? http://www.bmj.com/content/344/bmj.e2353 ; 1st International Workshop on Public Health in the Digital Age: Social Media, Crowdsourcing and Participatory Systems (PHDA 2013): http://www.digitalhealth.ws/ ; Social networks and big data meet public health @ WWW 2013: http://www2013.org/2013/04/25/social-networks-and-big-data-meet-public-health/


Patty Kostkova was talking to blog editor David Sutcliffe.

Dr Patty Kostkova is a Principal Research Associate in eHealth at the Department of Computer Science, University College London (UCL) and held a Research Scientist post at the ISI Foundation in Italy. Until 2012, she was the Head of the City eHealth Research Centre (CeRC) at City University, London, a thriving multidisciplinary research centre with expertise in computer science, information science and public health. In recent years, she was appointed a consultant at WHO responsible for the design and development of information systems for international surveillance.

Researchers who were instrumental in this project include Ed de Quincey, Martin Szomszor and Connie St Louis.

]]>
Who represents the Arab world online? https://ensr.oii.ox.ac.uk/arab-world/ Tue, 01 Oct 2013 07:09:58 +0000 http://blogs.oii.ox.ac.uk/policy/?p=2190 Caption
Editors from all over the world have played some part in writing about Egypt; in fact, only 13% of all edits actually originate in the country (38% are from the US). More: Who edits Wikipedia? by Mark Graham.

Ed: In basic terms, what patterns of ‘information geography’ are you seeing in the region?

Mark: The first pattern that we see is that the Middle East and North Africa are relatively under-represented in Wikipedia. Even after accounting for factors like population, Internet access, and literacy, we still see less contact than would be expected. Second, of the content that exists, a lot of it is in European and French rather than in Arabic (or Farsi or Hebrew). In other words, there is even less in local languages.

And finally, if we look at contributions (or edits), not only do we also see a relatively small number of edits originating in the region, but many of those edits are being used to write about other parts of the word rather than their own region. What this broadly seems to suggest is that the participatory potentials of Wikipedia aren’t yet being harnessed in order to even out the differences between the world’s informational cores and peripheries.

Ed: How closely do these online patterns in representation correlate with regional (offline) patterns in income, education, language, access to technology (etc.) Can you map one to the other?

Mark: Population and broadband availability alone explain a lot of the variance that we see. Other factors like income and education also play a role, but it is population and broadband that have the greatest explanatory power here. Interestingly, it is most countries in the MENA region that fail to fit well to those predictors.

Ed: How much do you think these patterns result from the systematic imposition of a particular view point – such as official editorial policies – as opposed to the (emergent) outcome of lots of users and editors acting independently?

Mark: Particular modes of governance in Wikipedia likely do play a factor here. The Arabic Wikipedia, for instance, to combat vandalism has a feature whereby changes to articles need to be reviewed before being made public. This alone seems to put off some potential contributors. Guidelines around sourcing in places where there are few secondary sources also likely play a role.

Ed: How much discussion (in the region) is there around this issue? Is this even acknowledged as a fact or problem?

Mark: I think it certainly is recognised as an issue now. But there are few viable alternatives to Wikipedia. Our goal is hopefully to identify problems that lead to solutions, rather than simply discouraging people from even using the platform.

Ed: This work has been covered by the Guardian, Wired, the Huffington Post (etc.) How much interest has there been from the non-Western press or bloggers in the region?

Mark: There has been a lot of coverage from the non-Western press, particularly in Latin America and Asia. However, I haven’t actually seen that much coverage from the MENA region.

Ed: As an academic, do you feel at all personally invested in this, or do you see your role to be simply about the objective documentation and analysis of these patterns?

Mark: I don’t believe there is any such thing as ‘objective documentation.’ All research has particular effects in and on the world, and I think it is important to be aware of the debates, processes, and practices surrounding any research project. Personally, I think Wikipedia is one of humanity’s greatest achievements. No previous single platform or repository of knowledge has ever even come close to Wikipedia in terms of its scale or reach. However, that is all the more reason to critically investigate what exactly is, and isn’t, contained within this fantastic resource. By revealing some of the biases and imbalances in Wikipedia, I hope that we’re doing our bit to improving it.

Ed: What factors do you think would lead to greater representation in the region? For example: is this a matter of voices being actively (or indirectly) excluded, or are they maybe just not all that bothered?

Mark: This is certainly a complicated question. I think the most important step would be to encourage participation from the region, rather than just representation of the region. Some of this involves increasing some of the enabling factors that are the prerequisites for participation; factors like: increasing broadband access, increasing literacy, encouraging more participation from women and minority groups.

Some of it is then changing perceptions around Wikipedia. For instance, many people that we spoke to in the region framed Wikipedia as an American our outside project rather than something that is locally created. Unfortunately we seem to be currently stuck in a vicious cycle in which few people from the region participate, therefore fulfilling the very reason why some people think that they shouldn’t participate. There is also the issue of sources. Not only does Wikipedia require all assertions to be properly sourced, but secondary sources themselves can be a great source of raw informational material for Wikipedia articles. However, if few sources about a place exist, then it adds an additional burden to creating content about that place. Again, a vicious cycle of geographic representation.

My hope is that by both working on some of the necessary conditions to participation, and engaging in a diverse range of initiatives to encourage content generation, we can start to break out of some of these vicious cycles.

Ed: The final moonshot question: How would you like to extend this work; time and money being no object?

Mark: Ideally, I’d like us to better understand the geographies of representation and participation outside of just the MENA region. This would involve mixed-methods (large scale big data approaches combined with in-depth qualitative studies) work focusing on multiple parts of the world. More broadly, I’m trying to build a research program that maintains a focus on a wide range of Internet and information geographies. The goal here is to understand participation and representation through a diverse range of online and offline platforms and practices and to share that work through a range of publicly accessible media: for instance the ‘Atlas of the Internet’ that we’re putting together.


Mark Graham was talking to blog editor David Sutcliffe.

Mark Graham is a Senior Research Fellow at the OII. His research focuses on Internet and information geographies, and the overlaps between ICTs and economic development.

]]>
Predicting elections on Twitter: a different way of thinking about the data https://ensr.oii.ox.ac.uk/predicting-elections-on-twitter-a-different-way-of-thinking-about-the-data/ Sun, 04 Aug 2013 11:43:52 +0000 http://blogs.oii.ox.ac.uk/policy/?p=1498 GOP presidential nominee Mitt Romney
GOP presidential nominee Mitt Romney, centre, waving to crowd, after delivering his acceptance speech on the final night of the 2012 Republican National Convention. Image by NewsHour.

Recently, there has been a lot of interest in the potential of social media as a means to understand public opinion. Driven by an interest in the potential of so-called “big data”, this development has been fuelled by a number of trends. Governments have been keen to create techniques for what they term “horizon scanning”, which broadly means searching for the indications of emerging crises (such as runs on banks or emerging natural disasters) online, and reacting before the problem really develops. Governments around the world are already committing massive resources to developing these techniques. In the private sector, big companies’ interest in brand management has fitted neatly with the potential of social media monitoring. A number of specialised consultancies now claim to be able to monitor and quantify reactions to products, interactions or bad publicity in real time.

It should therefore come as little surprise that, like other research methods before, these new techniques are now crossing over into the competitive political space. Social media monitoring, which in theory can extract information from tweets and Facebook posts and quantify positive and negative public reactions to people, policies and events has an obvious utility for politicians seeking office. Broadly, the process works like this: vast datasets relating to an election, often running into millions of items, are gathered from social media sites such as Twitter. These data are then analysed using natural language processing software, which automatically identifies qualities relating to candidates or policies and attributes a positive or negative sentiment to each item. Finally, these sentiments and other properties mined from the text are totalised, to produce an overall figure for public reaction on social media.

These techniques have already been employed by the mainstream media to report on the 2010 British general election (when the country had its first leaders debate, an event ripe for this kind of research) and also in the 2012 US presidential election. This growing prominence led my co-author Mike Jensen of the University of Canberra and myself to question: exactly how useful are these techniques for predicting election results? In order to answer this question, we carried out a study on the Republican nomination contest in 2012, focused on the Iowa Caucus and Super Tuesday. Our findings are published in the current issue of Policy and Internet.

There are definite merits to this endeavour. US candidate selection contests are notoriously hard to predict with traditional public opinion measurement methods. This is because of the unusual and unpredictable make-up of the electorate. Voters are likely (to greater or lesser degrees depending on circumstances in a particular contest and election laws in the state concerned) to share a broadly similar outlook, so the electorate is harder for pollsters to model. Turnout can also vary greatly from one cycle to the next, adding an additional layer of unpredictability to the proceedings.

However, as any professional opinion pollster will quickly tell you, there is a big problem with trying to predict elections using social media. The people who use it are simply not like the rest of the population. In the case of the US, research from Pew suggests that only 16 per cent of internet users use Twitter, and while that figure goes up to 27 per cent of those aged 18-29, only 2 per cent of over 65s use the site. The proportion of the electorate voting for within those categories, however, is the inverse: over 65s vote at a relatively high rate compared to the 18-29 cohort. furthermore, given that we know (from research such as Matthew Hindman’s The Myth of Digital Democracy) that the only a very small proportion of people online actually create content on politics, those who are commenting on elections become an even more unusual subset of the population.

Thus (and I can say this as someone who does use social media to talk about politics!) we are looking at an unrepresentative sub-set (those interested in politics) of an unrepresentative sub-set (those using social media) of the population. This is hardly a good omen for election prediction, which relies on modelling the voting population as closely as possible. As such, it seems foolish to suggest that a simply culmination of individual preferences can simply be equated to voting intentions.

However, in our article we suggest a different way of thinking about social media data, more akin to James Surowiecki’s idea of The Wisdom of Crowds. The idea here is that citizens commenting on social media should not be treated like voters, but rather as commentators, seeking to understand and predict emerging political dynamics. As such, the method we operationalized was more akin to an electoral prediction market, such as the Iowa Electronic Markets, than a traditional opinion poll.

We looked for two things in our dataset: sudden changes in the number of mentions of a particular candidate and also words that indicated momentum for a particular candidate, such as “surge”. Our ultimate finding was that this turned out to be a strong predictor. We found that the former measure had a good relationship with Rick Santorum’s sudden surge in the Iowa caucus, although it did also tend to disproportionately-emphasise a lot of the less successful candidates, such as Michelle Bachmann. The latter method, on the other hand, picked up the Santorum surge without generating false positives, a finding certainly worth further investigation.

Our aim in the paper was to present new ways of thinking about election prediction through social media, going beyond the paradigm established by the dominance of opinion polling. Our results indicate that there may be some value in this approach.


Read the full paper: Michael J. Jensen and Nick Anstead (2013) Psephological investigations: Tweets, votes, and unknown unknowns in the republican nomination process. Policy and Internet 5 (2) 161–182.

Dr Nick Anstead was appointed as a Lecturer in the LSE’s Department of Media and Communication in September 2010, with a focus on Political Communication. His research focuses on the relationship between existing political institutions and new media, covering such topics as the impact of the Internet on politics and government (especially e-campaigning), electoral competition and political campaigns, the history and future development of political parties, and political mobilisation and encouraging participation in civil society.

Dr Michael Jensen is a Research Fellow at the ANZSOG Institute for Governance (ANZSIG), University of Canberra. His research spans the subdisciplines of political communication, social movements, political participation, and political campaigning and elections. In the last few years, he has worked particularly with the analysis of social media data and other digital artefacts, contributing to the emerging field of computational social science.

]]>
Mapping the uneven geographies of information worldwide https://ensr.oii.ox.ac.uk/mapping-the-uneven-geographies-of-information-worldwide/ Tue, 11 Jun 2013 16:06:15 +0000 http://blogs.oii.ox.ac.uk/policy/?p=1451 Map of Flickr activity worldwide
Images are an important form of knowledge that allow us to develop understandings about our world; the global geographic distribution of geotagged images on Flickr reveals the density of visual representations and locally depicted knowledge of all places on our planet. Map by M.Graham, M.Stephens, S.Hale.

Information is the raw material for much of the work that goes on in the contemporary global economy, and visibility and voice in this information ecosystem is a prerequisite for influence and control. As Hand and Sandywell (2002: 199) have argued, “digitalised knowledge and its electronic media are indeed synonymous with power.” As such, it is important to understand who produces and reproduces information, who has access to it, and who and where are represented by it.

Traditionally, information and knowledge about the world have been geographically constrained. The transmission of information required either the movement of people or the availability of some other medium of communication. However, up until the late 20th century, almost all mediums of information – books, newspapers, academic journals, patents and the like – were characterised by huge geographic inequalities. The global north produced, consumed and controlled much of the world’s codified knowledge, while the global south was largely left out.

Today, the movement of information is, in theory, rarely constrained by distance. Very few parts of the world remain disconnected from the grid, and over 2 billion people are now online (most of them in the Global South). Unsurprisingly, many believe we now have the potential to access what Wikipedia’s founder Jimmy Wales refers to as “the sum of all human knowledge”. Theoretically, parts of the world that have been left out of flows and representations of knowledge can be quite literally put back on the map.

However, “potential” has too often been confused with actual practice, and stark digital divisions of labour are still evident in all open platforms that rely on user-generated content. Google Map’s databases contain more indexed user-generated content about the Tokyo metropolitan region than the entire continent of Africa. On Wikipedia, there is more written about Germany than about South America and Africa combined. In other words, there are massive inequalities that cannot simply be explained by uneven Internet penetration. A range of other physical, social, political and economic barriers are reinforcing this digital divide, amplifying the informational power of the already powerful and visible.

That’s not to say that the Internet doesn’t have important implications for the developing world. People use it not just to connect with friends and family, but to learn, share information, trade, and represent their communities. However, it’s important to be aware of the Internet’s highly uneven geographies of information. These inequalities matter to the south, because connectivity – despite being a clear prerequisite for access to most 21st-century platforms of knowledge sharing – by no means guarantees knowledge production and digital participation.

How do we move towards encouraging participation from (and about) parts of the world that are currently left out of virtual representations? The first step is to allow people to see what is, and isn’t, represented; something we are planning with this project. After that, there’s also a clear need for plans like Kenya’s strategy to boost local digital content, or Wikimedia’s Arabic Catalyst project, which aims to encourage the creation of content in Arabic and provide information about the Middle East.

It remains to be seen how effective such strategies will be in changing the highly uneven digital division of labour. As we rely increasingly on user-generated platforms, there is a real possibility that we will see the widening of divides between “digital cores” and “peripheries”. It’s therefore crucial to keep asking where visibility, voice and power reside in our increasingly networked world.

References

Graham, M. and M. Zook. 2013. Augmented Realities and Uneven Geographies: Exploring the Geo-linguistic Contours of the Web. Environment and Planning A 45(1) 77-99.

Graham, M. 2013. The Virtual Dimension. In Global City Challenges: debating a concept, improving the practice. eds. M. Acuto and W. Steele. London: Palgrave.

Graham, M., M. Zook., and A. Boulton. 2012. Augmented Reality in the Urban Environment: contested content and the duplicity of code. Transactions of the Institute of British Geographers. DOI: 10.1111/j.1475-5661.2012.00539.x

Graham, M. 2013. The Knowledge Based Economy and Digital Divisions of Labour. In Companion to Development Studies, 3rd edition, eds V. Desai, and R. Potter. Hodder.

Hand, M. and B. Sandywell. 2002. E-topia as Cosmopolis or Citadel On the Democratizing and De-democratizing Logics of the Internet, or, Toward a Critique of the New Technological Fetishism. Theory, Culture & Society


Mark Graham‘s research focuses on Internet and information geographies, and the overlaps between ICTs and economic development. His work on the geographies of the Internet examines how people and places are ever more defined by, and made visible through, not only their traditional physical locations and properties, but also their virtual attributes and digital shadows.

Read Mark’s blog.

]]>
Investigating the structure and connectivity of online global protest networks https://ensr.oii.ox.ac.uk/investigating-the-structure-and-connectivity-of-online-global-protest-networks/ Mon, 10 Jun 2013 12:04:26 +0000 http://blogs.oii.ox.ac.uk/policy/?p=1275 How have online technologies reconfigured collective action? It is often assumed that the rise of social networking tools, accompanied by the mass adoption of mobile devices, have strengthened the impact and broadened the reach of today’s political protests. Enabling massive self-communication allows protesters to write their own interpretation of events – free from a mass media often seen as adversarial – and emerging protests may also benefit from the cheaper, faster transmission of information and more effective mobilization made possible by online tools such as Twitter.

The new networks of political protest, which harness these new online technologies are often described in theoretical terms as being ‘fluid’ and ‘horizontal’, in contrast to the rigid and hierarchical structure of earlier protest organization. Yet such theoretical assumptions have seldom been tested empirically. This new language of networks may be useful as a shorthand to describe protest dynamics, but does it accurately reflect how protest networks mediate communication and coordinate support?

The global protests against austerity and inequality which took place on May 12, 2012 provide an interesting case study to test the structure and strength of a transnational online protest movement. The ‘indignados’ movement emerged as a response to the Spanish government’s politics of austerity in the aftermath of the global financial crisis. The movement flared in May 2011, when hundreds of thousands of protesters marched in Spanish cities, and many set up camps ahead of municipal elections a week later.

These protests contributed to the emergence of the worldwide Occupy movement. After the original plan to occupy New York City’s financial district mobilised thousands of protesters in September 2011, the movement spread to other cities in the US and worldwide, including London and Frankfurt, before winding down as the camp sites were dismantled weeks later. Interest in these movements was revived, however, as the first anniversary of the ‘indignados’ protests approached in May 2012.

To test whether the fluidity, horizontality and connectivity often claimed for online protest networks holds true in reality, tweets referencing these protest movements during May 2012 were collected. These tweets were then classified as relating either to the ‘indignados’ or Occupy movement, using hashtags as a proxy for content. Many tweets, however, contained hashtags relevant for the two movements, creating bridges across the two streams of information. The users behind those bridges acted as  information ‘brokers’, and are fundamentally important to the global connectivity of the two movements: they joined the two streams of information and their audiences on Twitter. Once all the tweets were classified by content and author, it emerged that around 6.5% of all users posted at least one message relevant for the two movements by using hashtags from both sides jointly.

Analysis of the Twitter data shows that this small minority of ‘brokers’ play an important role connecting users to a network that would otherwise be disconnected. Brokers are significantly more active in the contribution of messages and more visible in the stream of information, being re-tweeted and mentioned more often than other users. The analysis also shows that these brokers play an important role in the global network, by helping to keep the network together and improving global connectivity. In a simulation, the removal of brokers fragmented the network faster than the removal of random users at the same rate.

What does this tell us about global networks of protest? Firstly, it is clear that global networks are more vulnerable and fragile than is often assumed. Only a small percentage of users disseminate information across transnational divides, and if any of these users cease to perform this role, they are difficult to immediately replace, thus limiting the assumed fluidity of such networks. The decentralized nature of online networks, with no central authority imposing order or even suggesting a common strategy, make the role of ‘brokers’ all the more vital to the survival of networks which cross national borders.

Secondly, the central role performed by brokers suggests that global networks of online protest lack the ‘horizontal’ structure that is often described in the literature. Talking about horizontal structures can be useful as shorthand to refer to decentralised organisations, but not to analyse the process by which these organisations materialise in communication networks. The distribution of users in those networks reveals a strong hierarchy in terms of connections and the ability to communicate effectively.

Future research into online networks, then, should keep in mind that the language of protest networks in the digital age, particularly terms like horizontality and fluidity, do not necessarily stand up to empirical scrutiny. The study of contentious politics in the digital age should be evaluated, first and foremost, through the lens of what protesters actually reveal through their actions.


Read the paper: Sandra Gonzalez-Bailon and Ning Wang (2013) The Bridges and Brokers of Global Campaigns in the Context of Social Media.

]]>
How accessible are online legislative data archives to political scientists? https://ensr.oii.ox.ac.uk/how-accessible-are-online-legislative-data-archives-to-political-scientists/ https://ensr.oii.ox.ac.uk/how-accessible-are-online-legislative-data-archives-to-political-scientists/#comments Mon, 03 Jun 2013 12:07:40 +0000 http://blogs.oii.ox.ac.uk/policy/?p=654 House chamber of the Utah State Legislature
A view inside the House chamber of the Utah State Legislature. Image by deltaMike.

Public demands for transparency in the political process have long been a central feature of American democracy, and recent technological improvements have considerably facilitated the ability of state governments to respond to such public pressures. With online legislative archives, state legislatures can make available a large number of public documents. In addition to meeting the demands of interest groups, activists, and the public at large, these websites enable researchers to conduct single-state studies, cross-state comparisons, and longitudinal analysis.

While online legislative archives are, in theory, rich sources of information that save researchers valuable time as they gather data across the states, in practice, government agencies are rarely completely transparent, often do not provide clear instructions for accessing the information they store, seldom use standardized norms, and can overlook user needs. These obstacles to state politics research are longstanding: Malcolm Jewell noted almost three decades ago the need for “a much more comprehensive and systematic collection and analysis of comparative state political data.” While the growing availability of online legislative resources helps to address the first problem of collection, the limitations of search and retrieval functions remind us that the latter remains a challenge.

The fifty state legislative websites are quite different; few of them are intuitive or adequately transparent, and there is no standardized or systematic process to retrieve data. For many states, it is not possible to identify issue-specific bills that are introduced and/or passed during a specific period of time, let alone the sponsors or committees, without reading the full text of each bill. For researchers who are interested in certain time periods, policy areas, committees, or sponsors, the inability to set filters or immediately see relevant results limits their ability to efficiently collect data.

Frustrated by the obstacles we faced in undertaking a study of state-level immigration legislation before and after September 11, 2001, we decided to instead  evaluate each state legislative website — a “state of the states” analysis — to help scholars who need to understand the limitations of the online legislative resources they may want to use. We evaluated three main dimensions on an eleven-point scale: (1) the number of searchable years; (2) the keyword search filters; and (3) the information available on the immediate results pages. The number of searchable sessions is crucial for researchers interested in longitudinal studies, before/after comparisons, other time-related analyses, and the activity of specific legislators across multiple years. The “search interface” helps researchers to define, filter, and narrow the scope of the bills—a particularly important feature when keywords can generate hundreds of possibilities. The “results interface” allows researchers to determine if a given bill is relevant to a research project.

Our paper builds on the work of other scholars and organizations interested in state policy. To help begin a centralized space for data collection, Kevin Smith and Scott Granberg-Rademacker publicly invited “researchers to submit descriptions of data sources that were likely to be of interest to state politics and policy scholars,” calling for “centralized, comprehensive, and reliable datasets” that are easy to download and manipulate. In this spirit, Jason Sorens, Fait Muedini, and William Ruger introduced a free database that offered a comprehensive set of variables involving over 170 public policies at the state and local levels in order to “reduce reduplication of scholarly effort.” The National Conference of State Legislatures (NCSL) provides links to state legislatures, bill lists, constitutions, reports, and statutes for all fifty states. The State Legislative History Research Guides compiled by the University of Indiana Law School also include links to legislative and historical resources for the states, such as the Legislative Reference Library of Texas. However, to our knowledge, no existing resource assesses usability across all state websites.

So, what did we find during our assessment of the state websites? In general, we observed that the archival records as well as the search and results functions leave considerable room for improvement. The maximum possible score was 11 in each year, and the average was 3.87 in 2008 and 4.25 in 2010. For researchers interested in certain time periods, policy areas, committees, or sponsors, the inability to set filters, immediately see relevant results, and access past legislative sessions limits their ability to complete projects in a timely manner (or at all). We also found a great deal of variation in site features, content, and navigation. Greater standardization would improve access to information about state policymaking by researchers and the general public—although some legislators may well see benefits to opacity.

While we noted some progress over the study period, not all change was positive. By 2010, two states had scored 10 points (no state scored the full 11), fewer states had very low scores, and the average score rose slightly from 3.87 to 4.25 (out of 11). This suggests slow but steady improvement, and the provision of a baseline of support for researchers. However, a quarter of the states showed score drops over the study period, for the most part reflecting the adoption of “Powered by Google” search tools that used only keywords, and some in a very limited manner. If the latter becomes a trend, we could see websites becoming less, not more, user friendly in the future.

In addition, our index may serve as a proxy variable for state government transparency. While  the website scores were not statistically associated with Robert Erikson, Gerald Wright, and John McIver’s measure of state ideology, there may nevertheless be promise for future research along these lines; additional transparency determinants worth testing include legislative professionalism and social capital. Moving forward, the states might consider creating a working group to share ideas and best practices, perhaps through an organization like the National Conference of State Legislatures, rather than the national government, as some states might resist leadership from D.C. on federalist grounds.

Helen Margetts (2009) has noted that “The Internet has the capacity to provide both too much (which poses challenges to analysis) and too little data (which requires innovation to fill the gaps).” It is notable, and sometimes frustrating, that state legislative websites illustrate both dynamics. As datasets come online at an increasing rate, it is also easy to forget that websites can vary in terms of user friendliness, hierarchical structure, search terms and functions, terminology, and navigability — causing unanticipated methodological and data capture problems (i.e. headaches) to scholars working in this area.


Read the full paper: Taofang Huang, David Leal, B.J. Lee, and Jill Strube (2012) Assessing the Online Legislative Resources of the American States. Policy and Internet 4 (3-4).

]]>
https://ensr.oii.ox.ac.uk/how-accessible-are-online-legislative-data-archives-to-political-scientists/feed/ 1
Crowdsourcing translation during crisis situations: are ‘real voices’ being excluded from the decisions and policies it supports? https://ensr.oii.ox.ac.uk/crowdsourcing-translation-during-crisis-situations-are-real-voices-being-excluded-from-the-decisions-and-policies-it-supports/ Tue, 07 May 2013 08:58:47 +0000 http://blogs.oii.ox.ac.uk/policy/?p=957 As revolution spread across North Africa and the Middle East in 2011, participants and observers of the events were keen to engage via social media. However, saturation by Arab-language content demanded a new translation strategy for those outside the region to follow the information flows — and for those inside to reach beyond their domestic audience. Crowdsourcing was seen as the most efficient strategy in terms of cost and time to meet the demand, and translation applications that harnessed volunteers across the internet were integrated with nearly every type of ICT project. For example, as Steve Stottlemyre has already mentioned on this blog, translation played a part in tools like the Libya Crisis Map, and was essential for harnessing tweets from the region’s ‘voices on the ground.’

If you have ever worried about media bias then you should really worry about the impact of translation. Before the revolutions, the translation software for Egyptian Arabic was almost non-existent. Few translation applications were able to handle the different Arabic dialects or supply coding labor and capital to build something that could contend with internet blackouts. Google’s Speak to Tweet became the dominant application used in the Egyptian uprisings, delivering one homogenized source of information that fed the other sources. In 2011, this collaboration helped circumvent the problem of Internet connectivity in Egypt by allowing cellphone users to call their tweet into a voicemail to be transcribed and translated. A crowd of volunteers working for Twitter enhanced translation of Egyptian Arabic after the Tweets were first transcribed by a Mechanical Turk application trained from an initial 10 hours of speech.

The unintended consequence of these crowdsourcing applications was that when the material crossed the language barrier into English, it often became inaccessible to the original contributors. Individuals on the ground essentially ceded authorship to crowds of untrained volunteer translators who stripped the information of context, and then plotted it in categories and on maps without feedback from original sources. Controlling the application meant controlling the information flow, the lens through which the revolutions were conveyed to the outside world.

This flawed system prevented the original sources (e.g. in Libya) from interacting with the information that directly related to their own life-threatening situation, while the information became an unsound basis for decision-making by international actors. As Stottlemyre describes, ceding authorship was sometimes an intentional strategy, but also one imposed by the nature of the language/power imbalance and the failure of the translation applications and the associated projects to incorporate feedback loops or more two-way communication.

The after action report for the Libya Crisis Map project commissioned by the UN OCHA offers some insight into the disenfranchisement of sources to the decision-making process once they had provided information for the end product; the crisis map. In the final ‘best practices section’ reviewing the outcomes, The Standby Task Force which created the map described decision-makers and sources, but did not consider or mention the sources’ access to decision-making, the map, or a mechanism by which they could feed back to the decision-making chain. In essence, Libyans were not seen as part of the user group of the product they helped create.

How exactly does translation and crowdsourcing shape our understanding of complex developing crises, or influence subsequent policy decisions?  The SMS polling initiative launched by Al Jazeera English in collaboration with Ushahidi, a prominent crowdsourcing platform, illustrates the most common process of visualizing crisis information: translation, categorization, and mapping.  In December 2011, Al Jazeera launched Somalia Speaks, with the aim of giving a voice to the people of Somalia and sharing a picture of how violence was impacting everyday lives. The two have since repeated this project in Mali, to share opinions about the military intervention in the north.  While Al Jazeera is a news organization, not a research institute or a government actor, it plays an important role in informing electorates who can put political pressure on governments involved in the conflict. Furthermore, this same type of technology is being used on the ground to gather information in crisis situations at the governmental and UN levels.

A call for translators in the diaspora, particularly Somali student groups, was issued online, and phones were distributed on the ground throughout Somalia so multiple users could participate. The volunteers translated the SMSs and categorized the content as either political, social, or economic. The results were color-coded and aggregated on a map.

SMS-translation

The stated goal of the project was to give a voice to the Somali people, but the Somalis who participated had no say in how their voices were categorized or depicted on the map. The SMS poll asked an open question:

How has the Somalia conflict affected your life?

In one response example:

The Bosaso Market fire has affected me. It happened on Saturday.

The response was categorized as ‘social.’ But why didn’t the fact that violence happened in a market, an economic centre, denote ‘economic’ categorization? There was no guidance for maintaining consistency among the translators, nor any indication of how the information would be used later. It was these categories chosen by the translators, represented as bright colorful circles on the map, which were speaking to the world, not the Somalis — whose voices had been lost through a crowdsourcing application that was designed with a language barrier. The primary sources could not suggest another category that better suited the intentions of their responses, nor did they understand the role categories would play in representing and visualizing their responses to the English language audience.

Somalia Crisis Map

An 8 December 2011 comment on the Ushahidi blog described in compelling terms how language and control over information flow impact the power balance during a conflict:

A—-, My friend received the message from you on his phone. The question says “tell us how is conflict affecting your life” and “include your name of location”. You did not tell him that his name will be told to the world. People in Somalia understand that sms is between just two people. Many people do not even understand the internet. The warlords have money and many contacts. They understand the internet. They will look at this and they will look at who is complaining. Can you protect them? I think this project is not for the people of Somalia. It is for the media like Al Jazeera and Ushahidi. You are not from here. You are not helping. It is better that you stay out.

Ushahidi director Patrick Meier, responded to the comment:

Patrick: Dear A—-, I completely share your concern and already mentioned this exact issue to Al Jazeera a few hours ago. I’m sure they’ll fix the issue as soon as they get my message. Note that the question that was sent out does *not* request people to share their names, only the name of their general location. Al Jazeera is careful to map the general location and *not* the exact location. Finally, Al Jazeera has full editorial control over this project, not Ushahidi.

As of 14 January 2012, there were still names featured on the Al Jazeera English website.

The danger is that these categories — economic, political, social — become the framework for aid donations and policy endeavors; the application frames the discussion rather than the words of the Somalis. The simplistic categories become the entry point for policy-makers and citizens alike to understand and become involved with translated material. But decisions and policies developed from the translated information are less connected to ‘real voices’ than we would like to believe.

Developing technologies so that Somalis or Libyans — or any group sharing information via translation — are themselves directing the information flow about the future of their country should be the goal, rather than perpetual simplification into the client / victim that is waiting to be given a voice.

]]>
Did Libyan crisis mapping create usable military intelligence? https://ensr.oii.ox.ac.uk/did-libyan-crisis-mapping-create-usable-military-intelligence/ Thu, 14 Mar 2013 10:45:22 +0000 http://blogs.oii.ox.ac.uk/policy/?p=817 The Middle East has recently witnessed a series of popular uprisings against autocratic rulers. In mid-January 2011, Tunisian President Zine El Abidine Ben Ali fled his country, and just four weeks later, protesters overthrew the regime of Egyptian President Hosni Mubarak. Yemen’s government was also overthrown in 2011, and Morocco, Jordan, and Oman saw significant governmental reforms leading, if only modestly, toward the implementation of additional civil liberties.

Protesters in Libya called for their own ‘day of rage’ on February 17, 2011, marked by violent protests in several major cities, including the capitol Tripoli. As they transformed from ‘protestors’ to ‘Opposition forces’ they began pushing information onto Twitter, Facebook, and YouTube, reporting their firsthand experiences of what had turned into a civil war virtually overnight. The evolving humanitarian crisis prompted the United Nations to request the creation of the Libya Crisis Map, which was made public on March 6, 2011. Other, more focused crisis maps followed, and were widely distributed on Twitter.

While the map was initially populated with humanitarian information pulled from the media and online social networks, as the imposition of an internationally enforced No Fly Zone (NFZ) over Libya became imminent, information began to appear on it that appeared to be of a tactical military nature. While many people continued to contribute conventional humanitarian information to the map, the sudden shift toward information that could aid international military intervention was unmistakable.

How useful was this information, though? Agencies in the U.S. Intelligence Community convert raw data into useable information (incorporated into finished intelligence) by utilizing some form of the Intelligence Process. As outlined in the U.S. military’s joint intelligence manual, this consists of six interrelated steps all centered on a specific mission. It is interesting that many Twitter users, though perhaps unaware of the intelligence process, replicated each step during the Libyan civil war; producing finished intelligence adequate for consumption by NATO commanders and rebel leadership.

It was clear from the beginning of the Libyan civil war that very few people knew exactly what was happening on the ground. Even NATO, according to one of the organization’s spokesmen, lacked the ground-level informants necessary to get a full picture of the situation in Libya. There is no public information about the extent to which military commanders used information from crisis maps during the Libyan civil war. According to one NATO official, “Any military campaign relies on something that we call ‘fused information’. So we will take information from every source we can… We’ll get information from open source on the internet, we’ll get Twitter, you name any source of media and our fusion centre will deliver all of that into useable intelligence.”

The data in these crisis maps came from a variety of sources, including journalists, official press releases, and civilians on the ground who updated blogs and/or maintaining telephone contact. The @feb17voices Twitter feed (translated into English and used to support the creation of The Guardian’s and the UN’s Libya Crisis Map) included accounts of live phone calls from people on the ground in areas where the Internet was blocked, and where there was little or no media coverage. Twitter users began compiling data and information; they tweeted and retweeted data they collected, information they filtered and processed, and their own requests for specific data and clarifications.

Information from various Twitter feeds was then published in detailed maps of major events that contained information pertinent to military and humanitarian operations. For example, as fighting intensified, @LibyaMap’s updates began to provide a general picture of the battlefield, including specific, sourced intelligence about the progress of fighting, humanitarian and supply needs, and the success of some NATO missions. Although it did not explicitly state its purpose as spreading mission-relevant intelligence, the nature of the information renders alternative motivations highly unlikely.

Interestingly, the Twitter users featured in a June 2011 article by the Guardian had already explicitly expressed their intention of affecting military outcomes in Libya by providing NATO forces with specific geographical coordinates to target Qadhafi regime forces. We could speculate at this point about the extent to which the Intelligence Community might have guided Twitter users to participate in the intelligence process; while NATO and the Libyan Opposition issued no explicit intelligence requirements to the public, they tweeted stories about social network users trying to help NATO, likely leading their online supporters to draw their own conclusions.

It appears from similar maps created during the ongoing uprisings in Syria that the creation of finished intelligence products by crisis mappers may become a regular occurrence. Future study should focus on determining the motivations of mappers for collecting, processing, and distributing intelligence, particularly as a better understanding of their motivations could inform research on the ethics of crisis mapping. It is reasonable to believe that some (or possibly many) crisis mappers would be averse to their efforts being used by military commanders to target “enemy” forces and infrastructure.

Indeed, some are already questioning the direction of crisis mapping in the absence of professional oversight (Global Brief 2011): “[If] crisis mappers do not develop a set of best practices and shared ethical standards, they will not only lose the trust of the populations that they seek to serve and the policymakers that they seek to influence, but (…) they could unwittingly increase the number of civilians being hurt, arrested or even killed without knowing that they are in fact doing so.”


Read the full paper: Stottlemyre, S., and Stottlemyre, S. (2012) Crisis Mapping Intelligence Information During the Libyan Civil War: An Exploratory Case Study. Policy and Internet 4 (3-4).

]]>
Experiments are the most exciting thing on the UK public policy horizon https://ensr.oii.ox.ac.uk/experiments-are-the-most-exciting-thing-on-the-uk-public-policy-horizon/ Thu, 28 Feb 2013 10:20:29 +0000 http://blogs.oii.ox.ac.uk/policy/?p=392
Iraq War protesters in Trafalgar Square, London
What makes people join political actions? Iraq War protesters crowd Trafalgar Square in February 2007. Image by DavidMartynHunt.
Experiments – or more technically, Randomised Control Trials – are the most exciting thing on the UK public policy horizon. In 2010, the incoming Coalition Government set up the Behavioural Insights Team in the Cabinet Office to find innovative and cost effective (cheap) ways to change people’s behaviour. Since then the team have run a number of exciting experiments with remarkable success, particularly in terms of encouraging organ donation and timely payment of taxes. With Bad Science author Ben Goldacre, they have now published a Guide to RCTs, and plenty more experiments are planned.

This sudden enthusiasm for experiments in the UK government is very exciting. The Behavioural Insights Team is the first of its kind in the world – In the US, there are few experiments at federal level, although there have been a few well publicised ones at local level – and the UK government has always been rather scared of the concept before, there being a number of cultural barriers to the very word ‘experiment’ in British government. Experiments came to the fore in the previous Administration’s Mindscape document. But what made them popular for Public Policy may well have been the 2008 book Nudge by Thaler and Sunstein, which shows that by knowing how people think, it is possible to design choice environments that make it “easier for people to choose what is best for themselves, their families, and their society.” Since then, the political scientist Peter John has published ‘Nudge, Nudge, Think, Think, which has received positive coverage in The Economist: The use of behavioural economics in public policy shows promise and the Financial Times: Nudge, nudge. Think, think. Say no more …; and has been reviewed by the LSE Review of Books: Nudge, Nudge, Think, Think: experimenting with ways to change civic behaviour.

But there is one thing missing here. Very few of these experiments use manipulation of information environments on the internet as a way to change people’s behaviour. The Internet seems to hold enormous promise for ‘Nudging’ by redesigning ‘choice environments’, yet Thaler and Sunstein’s book hardly mentions it, and none of the BIT’s experiments so far have used the Internet; although a new experiment looking at ways of encouraging court attendees to pay fines is based on text messages.

So, at the Oxford Internet Institute we are doing something about that. At OxLab, an experimental laboratory for the social sciences run by the OII and Said Business School, we are running online experiments to test the impact of various online platforms on people’s behaviour. So, for example, two reports for the UK National Audit Office: Government on the Internet (2007) and Communicating with Customers (2009) carried out by a joint OII-LSE team used experiments to see how people search for and find government-internet related information. Further experiments investigated the impact of various types of social influence, particularly social information about the behaviour of others and visibility (as opposed to anonymity), on the propensity of people to participate politically.

And the OII-edited journal Policy and Internet has been a good venue for experimentalists to publicise their work. So, Stephan Grimmelikhuijsen’s paper Transparency of Public Decision-Making: Towards Trust in Local Government? (Policy & Internet 2010; 2:1) reports an experiment to see if transparency (relating to decision-making by local government) actually leads to higher levels of trust. Interestingly, his results indicated that participants exposed to more information (in this case, full council minutes) were significantly more negative regarding the perceived competence of the council compared to those who did not access all the available information. Additionally, participants who received restricted information about the minutes thought the council was less honest compared to those who did not read them.

]]>
Uncovering the structure of online child exploitation networks https://ensr.oii.ox.ac.uk/uncovering-the-structure-of-online-child-exploitation-networks/ https://ensr.oii.ox.ac.uk/uncovering-the-structure-of-online-child-exploitation-networks/#comments Thu, 07 Feb 2013 10:11:17 +0000 http://blogs.oii.ox.ac.uk/policy/?p=661 The Internet has provided the social, individual, and technological circumstances needed for child pornography to flourish. Sex offenders have been able to utilize the Internet for dissemination of child pornographic content, for social networking with other pedophiles through chatrooms and newsgroups, and for sexual communication with children. A 2009 estimate by the United Nations estimates that there are more than four million websites containing child pornography, with 35 percent of them depicting serious sexual assault [1]. Even if this report or others exaggerate the true prevalence of those websites by a wide margin, the fact of the matter is that those websites are pervasive on the world wide web.

Despite large investments of law enforcement resources, online child exploitation is nowhere near under control, and while there are numerous technological products to aid in finding child pornography online, they still require substantial human intervention. Despite this, steps can be taken to increase the automation process of these searches, to reduce the amount of content police officers have to examine, and increase the time they can spend on investigating individuals.

While law enforcement agencies will aim for maximum disruption of online child exploitation networks by targeting the most connected players, there is a general lack of research on the structural nature of these networks; something we aimed to address in our study, by developing a method to extract child exploitation networks, map their structure, and analyze their content. Our custom-written Child Exploitation Network Extractor (CENE) automatically crawls the Web from a user-specified seed page, collecting information about the pages it visits by recursively following the links out of the page; the result of the crawl is a network structure containing information about the content of the websites, and the linkages between them [2].

We chose ten websites as starting points for the crawls; four were selected from a list of known child pornography websites while the other six were selected and verified through Google searches using child pornography search terms. To guide the network extraction process we defined a set of 63 keywords, which included words commonly used by the Royal Canadian Mounted Police to find illegal content; most of them code words used by pedophiles. Websites included in the analysis had to contain at least seven of the 63 unique keywords, on a given web page; manual verification showed us that seven keywords distinguished well between child exploitation web pages and regular web pages. Ten sports networks were analyzed as a control.

The web crawler was found to be able to properly identify child exploitation websites, with a clear difference found in the hardcore content hosted by child exploitation and non-child exploitation websites. Our results further suggest that a ‘network capital’ measure — which takes into account network connectivity, as well as severity of content — could aid in identifying the key players within online child exploitation networks. These websites are the main concern of law enforcement agencies, making the web crawler a time saving tool in target prioritization exercises. Interestingly, while one might assume that website owners would find ways to avoid detection by a web crawler of the type we have used, these websites — despite the fact that much of the content is illegal — turned out to be easy to find. This fits with previous research that has found that only 20-25 percent of online child pornography arrestees used sophisticated tools for hiding illegal content [3,4].

As mentioned earlier, the huge amount of content found on the Internet means that the likelihood of eradicating the problem of online child exploitation is nil. As the decentralized nature of the Internet makes combating child exploitation difficult, it becomes more important to introduce new methods to address this. Social network analysis measurements, in general, can be of great assistance to law enforcement investigating all forms of online crime—including online child exploitation. By creating a web crawler that reduces the amount of hours officers need to spend examining possible child pornography websites, and determining whom to target, we believe that we have touched on a method to maximize the current efforts by law enforcement. An automated process has the added benefit of aiding to keep officers in the department longer, as they would not be subjugated to as much traumatic content.

There are still areas for further research; the first step being to further refine the web crawler. Despite being a considerable improvement over a manual analysis of 300,000 web pages, it could be improved to allow for efficient analysis of larger networks, bringing us closer to the true size of the full online child exploitation network, but also, we expect, to some of the more hidden (e.g., password/membership protected) websites. This does not negate the value of researching publicly accessible websites, given that they may be used as starting locations for most individuals.

Much of the law enforcement to date has focused on investigating images, with the primary reason being that databases of hash values (used to authenticate the content) exists for images, and not for videos. Our web crawler did not distinguish between the image content, but utilizing known hash values would help improve the validity of our severity measurement. Although it would be naïve to suggest that online child exploitation can be completely eradicated, the sorts of social network analysis methods described in our study provide a means of understanding the structure (and therefore key vulnerabilities) of online networks; in turn, greatly improving the effectiveness of law enforcement.

[1] Engeler, E. 2009. September 16. UN Expert: Child Porn on Internet Increases. The Associated Press.

[2] Westlake, B.G., Bouchard, M., and Frank, R. 2012. Finding the Key Players in Online Child Exploitation Networks. Policy and Internet 3 (2).

[3] Carr, J. 2004. Child Abuse, Child Pornography and the Internet. London: NCH.

[4] Wolak, J., D. Finkelhor, and K.J. Mitchell. 2005. “Child Pornography Possessors Arrested in Internet-Related Crimes: Findings from the National Juvenile Online Victimization Study (NCMEC 06–05–023).” Alexandria, VA: National Center for Missing and Exploited Children.


Read the full paper: Westlake, B.G., Bouchard, M., and Frank, R. 2012. Finding the Key Players in Online Child Exploitation Networks. Policy and Internet 3 (2).

]]>
https://ensr.oii.ox.ac.uk/uncovering-the-structure-of-online-child-exploitation-networks/feed/ 2
Slicing digital data: methodological challenges in computational social science https://ensr.oii.ox.ac.uk/slicing-digital-data-methodological-challenges-in-computational-social-science/ Wed, 30 May 2012 10:45:26 +0000 http://blogs.oii.ox.ac.uk/policy/?p=337 One of the big social science questions is how our individual actions aggregate into collective patterns of behaviour (think crowds, riots, and revolutions). This question has so far been difficult to tackle due to a lack of appropriate data, and the complexity of the relationship between the individual and the collective. Digital trails are allowing Social Scientists to understand this relationship better.

Small changes in individual actions can have large effects at the aggregate level; this opens up the potential for drawing incorrect conclusions about generative mechanisms when only aggregated patterns are analysed, as Schelling aimed to show in his classic example of racial segregation. 

Part of the reason why it has been so difficult to explore this connection between the individual and the collective — and the unintended consequences that arise from that connection — is lack of proper empirical data, particularly around the structure of interdependence that links individual actions. This relational information is what digital data is now providing; however, they present some new challenges to the social scientist, particularly those who are used to working with smaller, cross-sectional datasets. Suddenly, we can track and analyse the interactions of thousands (if not millions) of people with a time resolution that can go down to the second. The question is how to best aggregate that data and deal with the time dimension.

Interactions take place in continuous time; however, most digital interactions are recorded as events (i.e. sending or receiving messages), and different network structures emerge when those events are aggregated according to different windows (i.e. days, weeks, months). We still don’t have systematic knowledge on how transforming continuous data into discrete observation windows affects the networks of interaction we analyse. Reconstructing interpersonal networks (particularly longitudinal network data) used to be extremely time consuming and difficult; now it is relatively easy to obtain that sort of network data, but modelling and analysing them is still a challenge.

Another problem faced by social scientists using digital data is that most social networks are multiplex in nature, that is, we belong to many different networks that interact and affect each other by means of feedback effects: How do all these different network structures co-evolve? If we only focus on one network, such as Twitter, we lose information about how activity in other networks (like Facebook, or email, or offline communication) is related to changes in the network we observe. In our study on the Spanish protests, we only track part of the relevant activity: we have a good idea of what was happening on Twitter, but there were obviously lots of other communication networks simultaneously having an influence on people’s behaviour. And while it is exciting as a social scientist to be able to access and analyse huge quantities of detailed data about social movements as they happen, the Twitter network only provides part of the picture.

Finally, when analysing the cascading effects of individual actions there is also the challenge of separating out the effects of social influence and self-selection. Digital data allow us to follow cascading behaviour with better time resolution, but the observational data usually does not help discriminate if people behave similarly because they influence and follow each other or because they share similar attributes and motivations. Social scientists need to find ways of controlling for this self-selection in online networks; although digital data often lacks the demographic information that allows applying this control, digital technologies are also helping researchers conduct experiments that help them pin down the effects of social influence.

Digital data is allowing social scientists pose questions that couldn’t be answered before. However, there are many methodological challenges that need solving. This talk considers a few, emphasising that strong theoretical motivations should still direct the questions we pose to digital data.

Further reading:

Gonzalez-Bailon, S., Borge-Holthoefer, J. and Moreno, Y. (2013) Broadcasters and Hidden Influentials in Online Protest Diffusion. American Behavioural Scientist (forthcoming).

Gonzalez-Bailon, S., Wang, N., Rivero, A., Borge-Holthoefer, J., and Moreno, Y. (2012) Assessing the Bias in Communication Networks Sampled from Twitter. Working Paper.

Gonzalez-Bailon, S., Borge-Holthoefer, J., Rivero, A. and Moreno, Y. (2011) The Dynamics of Protest Recruitment Through an Online Network. Scientific Reports 1, 197. DOI: 10.1038/srep00197

González-Bailón, S., Kaltenbrunner, A. and Banchs, R.E. (2010) The Structure of Political Discussion Networks: A Model for the Analysis of Online Deliberation. Journal of Information Technology 25 (2) 230-243.

]]>