data science, digital politics, smart cities...|

Postcode sector counts of alcohol points of sale from OpenStreetMap data

I have a new article out in the journal Health & Place entitled OpenStreetMap data for alcohol research: Reliability assessment and quality indicators, written in conjunction with a number of people here at the OII and elsewhere. My colleague David Humphreys at SPI got me interested in the area when he told me about how difficult it was to construct local area indicators of alcohol availability in the UK, and how this was hampering research in the field. I wanted to see whether data in OpenStreetMap could fix the problem, as in general I’m pretty interested in the extent to which web data can be used as a valid proxy measurement for real life quantities of interest. Stefano de Sabbata, Sumin Lee and Bharath Ganesh all contributed to the analysis.

We did a few different things in the article: we conducted a validation of a random sample of 2,000 licenses we knew to exist, we used OSM data to duplicate a previous study in the area (E.A. Richardson et al. 2015 Is local alcohol outlet density related to alcohol-related morbidity and mortality in Scottish cities? Health Place, 33, 172-180), and we used a technique developed by Stefano to measure the ‘quality’ of OSM data in a given area. We showed that OSM is about 50% complete in terms of the amount of data it contains (in the specific case of alcohol licenses), and also that we could use the quality indicators to find areas with more complete alcohol data.

Quality of OpenStreetMap Data in Britain

Alongside the article, we are also releasing more general estimates of alcohol outlet prevalence across Britain, which are drawn from OpenStreetMap. We thought they might be of use to other researchers working in the area of the spatial availability of alcohol. They are a simple count of alcohol points of sale within each postcode sector in the UK, according to the data in OSM (see the paper for details of how they were counted). We’re also releasing an accompanying quality metric with each postcode sector so researchers can determine how trusted the OSM data should be (again see the paper for details on how it is constructed). The spatial distribution of the quality metric in the UK is mapped above. Feel free to reach out to me if you have any questions!

Get the estimates themselves here: OSM GB Alcohol Outlet Counts and Quality Index

The full reference for the paper:

J Bright, S De Sabbata, S Lee, B Ganesh, DK Humphreys. 2018. OpenStreetMap data for alcohol research: Reliability assessment and quality indicators. Health & Place 50, 130-136

and another related paper using the same dataset:

J Bright, S De Sabbata, S Lee. 2018. Geodemographic biases in crowdsourced knowledge websites: Do neighbours fill in the blanks? GeoJournal, 83, 3, 427–440

This research was partially funded by a grant from the ESRC (Grant no. ES/M010058/1).

By |2018-06-12T09:42:28+01:00June 11th, 2018|Research, Smart Cities, Social Web|0 Comments

Understanding news story chains using information retrieval and network clustering techniques

I have a new draft paper out with my colleague Tom Nicholls, entitled Understanding news story chains using information retrieval and network clustering techniques. In it we address what we perceive as an important technical challenge in news media research, which is how to group together articles that all address the same individual news event. This challenge is unmet by most current approaches in unsupervised machine learning as applied to the news, which tend to focus on the broader (also important!) problem of grouping articles in topic categories. It is in general a difficult problem, as we are looking for what are typically small “chains” of content on the same event (e.g. four or five different articles) amongst a corpus of tens of thousands of articles, most of which are unrelated to each other.

Our approach makes use of algorithms and insight drawn from the fields of both information retrieval [IR] and network clustering to develop a novel unsupervised method of news story chain detection. IR techniques (which are used to build things like search engines) especially haven’t been much employed in the social sciences, where the focus has more been on machine learning. But these algorithms were much closer to our problem as connecting small amounts of news stories is quite similar to the task of searching a huge corpus of documents in response to a specific user query.

The resulting algorithm works pretty well, though it is very difficult to validate properly because of the nature of the data! We use it to pull out a couple of interesting first order descriptive statistics about news stories in the UK, for example the graphic above shows the typical evolution of news stories after the publication of an initial article.

Just a draft at the moment so all feedback welcome!

By |2018-01-31T13:28:52+00:00January 31st, 2018|News, Python, Research, Social Science Computing|0 Comments

Does Campaigning on Social Media Make a Difference?

I’ve got a new draft paper out with a host of colleagues here at the OII entitled Does Campaigning on Social Media Make a Difference? Evidence from candidate use of Twitter during the 2015 and 2017 UK Elections. There’s an enormous volume of research on the activities of politicians on social media, especially around election time, but not a lot of it has actually addressed whether this activity ‘makes a difference’, i.e. helps to win votes. Part of the reason for this is that measuring ‘campaign effects’ is quite difficult (unless you can convince campaigns themselves to participate in field experiments) and most of the data is purely cross-sectional which means a host of causality problems in this type of context.

Our study improves the situation by taking advantage of the fact that the UK has recently had two general elections in quick succession, and a considerable proportion of politicians (around 800 in fact) fought in both of them. This allowed us to create a panel dataset of politician social media use (in particular their Twitter activity) and electoral outcomes, which allows for much stronger causal claims (essentially we look at whether a change in the level of social media use by candidates was correlated with a change in vote share outcomes, controlling for factors such as the party they belong to).

The results were pretty interesting – we found a large amount of Twitter activity, spread throughout the country (see graphics), which support the idea that social media use is now a normal part of political campaigns. However the level of effort did vary quite a lot and this allowed us to explore our key interest, where we did indeed find that increasing Twitter activity correlates with increased levels of votes, even in this pretty strong panel data design. So – some good supporting evidence that politicians aren’t wasting their time on social media!

By |2018-01-31T13:39:14+00:00January 10th, 2018|Politics and Democracy, Social Media|0 Comments

Estimating local commuting patterns from geolocated Twitter data

Over the last decade or so there has been an explosion of research interest in the area of measuring (and forecasting) of traffic and commuting patterns. Part of this is driven by ever increasing human mobility: in 2016 alone, people in the UK travelled a collective 800 billion kilometres [PDF], more than 60% of which was by car, and congestion on these networks costs billions of pounds a year. But also driving the research agenda is the emergence of a wide variety of new forms of data (which has built on and supplemented more traditional magnetic loop technologies): such as data re-purposed from mobile phone records, or collected through IoT enabled smart sensors, or emerging from freely contributed traces to social media platforms. These data sources offer huge potential to improve on existing methods of data collection, such as hated transport census (see picture).

As part of a research project entitled NEXUS: Real Time Data Fusion and Network Analysis for Urban Systems (funded by InnovateUK), myself and a team of researchers at the OII have been looking into some of these possibilities. Our first paper on the subject, entitled “Estimating Local Commuting Patterns from Geolocated Twitter Data“, has just been published in EPJ Data Science. The paper addresses the extent to which we can make use of geolocated Twitter data to estimate commuting flows between local authorities (you can have a play with some of the underlying data using the map below, which shows census commuting figures and Twitter based estimates for local authorities around Britain).

We draw two main conclusions from the paper. First we show that, making use of heuristics for mapping individuals making geolocated tweets to home and work areas, we can use Twitter to produce accurate representations of the overall structure of commuting in mainland Great Britain; estimates which improve considerably on other ‘low information’ methods of estimating commuting flows (we compared estimates in particular to the popular radiation model). Second, and probably most importantly, we show that these results are not particularly sensitive to demographic characteristics. When looking at commuting flows broken down by gender, age group and social class, we found that Twitter still offered reasonable estimations for all of these sub-categories. We think this is important because a key concern about using social media data for this type of proxy estimation is the extent to which the ‘demographic bias’ in social media users (who are often younger, better educated and wealthier than the population average) might also result in biased predictions (for example, better prediction of the travel patterns of younger people). We show that, at least in our context, this is not the case.

What’s next? There is plenty more to explore in this research area: looking at whether predictions can be made more granular, or perhaps whether sentiment from social media can be worked in, or whether other platforms can also contribute. We will also start to work on some other data sources, making use of some of the exciting datasets being made available by places like the ADRN and CDRC.

Graham McNeill, Jonathan Bright and Scott A Hale (2017) Estimating local commuting patterns from geolocated Twitter data, EPJ Data Science 20176:24.

By |2017-10-25T21:42:15+01:00October 25th, 2017|Research, Smart Cities, Social Media|0 Comments

Predicting elections with Wikipedia data: new article in EPJ Data Science

Taha Yasseri and I have a new article out in EPJ Data Science which looks at the subject of electoral prediction using page view data from Wikipedia. Forecasting electoral results with some form of novel internet data is really a growth area in the literature at the moment, with a huge amount of research teams trying out different approaches. However I think our paper nevertheless makes a novel contribution, in a couple of respects. First, our model is theory driven rather than taking a machine learning approach, by which I mean that we try and theorise the mechanism generating Wikipedia page view data and how that relates to electoral outcomes, rather than simply looking at a range of indicators to see if any of them offers any predictive power. Second, we test a reasonably large set of electoral results: a group of around 60 parties in the European Parliament elections in 2014, whereas many other studies look at prediction only in the case of one election.

We found a number of things: we are able to show that the majority of online information seeking happens in the couple of days before the election (left hand panel in the figure); we are also able to show that page views do seem to offer indicators of a number of things happening in the election, such as turnout levels (right hand panel in the figure) and overall electoral results. Wikipedia was particularly good at predicting the emergence of small parties which were shooting to prominence (something which has become a feature of European politics in the last decade), even if it did tend to overstate their final result.

In future work, we intend to spread the work out to more countries and more types of information seeking.

By |2016-08-26T16:48:27+01:00August 26th, 2016|Politics and Democracy, Research, Social Web|0 Comments

The Social News Gap: New article in Journal of Communication

I have a new article out in the Journal of Communication which analyses which types of news get shared the most. Based on articles published in BBC news, the research shows that even though readership drives sharing in general, certain types of articles lend themselves more to being shared than others.

Figure 2

The graphic above gives a glimpse of some of the results, by visualising the relationship between reading and sharing for different categories of news article. We can see that reading and sharing are not in a linear relationship: rather some types of article are well shared but not well read, and vice versa. For example, stories about technology and social welfare seem to be shared more, whilst stories about violent crime and accidents are shared less. This creates a social “news gap” (following Boczkowski and Mitchelstein’s traditional news gap) whereby peoples preferences for sharing and their preferences for reading diverge. I suggest that, as more and more people start to consume news on social media, the implications of this become potentially more profound: as social media starts to filter out certain types of news whilst emphasising others.


By |2016-07-01T10:41:31+01:00July 1st, 2016|News, Research, Social Media|0 Comments

Getting ggplot2 to work with igraph

One common criticism of the otherwise excellent ggplot2 is that it doesn’t come with network visualisation capability. Network vis is so popular at the moment that it seems like a bit of a big omission; but network data is also quite unique in terms of structure (and the layout algorithms would need implementing) so I can see why it hasn’t been integrated.

Moritz Marbach has a great post explaining how to easily get ggplot2 up and running with network data. It was still one of the top hits on Google when I checked it out recently for a project. However the post is from 2011 so is getting a little dated – it uses the sna package rather than igraph (which seems to be becoming a standard for network science) and also has a few deprecated ggplot2 commands in it. So I thought I’d add a bit of an update here to the code.

As Marbach explains the secret to getting ggplot2 to draw networks is quite simple: get a network analysis package to give you a list of nodes, edges, and node layout information as a series of X,Y coordinates. Then you can simply plot the nodes with geom_point and the edges with geom_segment. Put together it looks something like this:


g = read.graph("a-network.gml", format="gml")

#get the node coordinates
plotcord <- data.frame(layout.fruchterman.reingold(g))
colnames(plotcord) = c("X1","X2")

#get edges, which are pairs of node IDs
edgelist <- get.edgelist(g)

#convert to a four column edge data frame with source and destination coordinates
edges <- data.frame(plotcord[edgelist[,1],], plotcord[edgelist[,2],])
colnames(edges) <- c("X1","Y1","X2","Y2")

ggplot() + geom_segment(aes(x=X1, y=Y1, xend = X2, yend = Y2), data=edges, size = 0.5, colour="grey") + geom_point(aes(X1, X2), data=plotcord)



OK it still needs some work! But anyone familiar with ggplot2 can do the rest.

By |2015-12-07T16:49:42+00:00December 7th, 2015|ggplot2, igraph, R|2 Comments

The History of Social News

I am giving a presentation tomorrow at the IJPP conference here in Oxford. It’s being hosted by the Reuters Institute who are world leaders in the study of contemporary news organisations, and I’m really excited to be going.

Together with Scott Hale I am giving a presentation on the “history” of social news. We have an 8 year long dataset (2002-2010) consisting of links to millions of news articles which we have used to trace the beginnings of social media news sharing. We are interested to know whether the types of news being shared have changed over time as social media platforms have massified; we’re also interested in looking at whether site design changes (such as bringing in sharing buttons) have had a major impact.

Twitter - Facebook Comparison

The project is at an early stage but the results are pretty interesting so far (to me). To give one tidbit, we show that in this large scale dataset there is only a weak correlation between sharing on Twitter and Facebook at the article level, with Twitter tending to share more sports news than Facebook (see image).

By |2015-09-16T12:56:18+01:00September 16th, 2015|News, Research, Social Media|0 Comments

The real component of virtual learning

Monica Bulger, Cristobal Cobo and I have a new paper out in Information, Communication and Society where we investigate real world meetings organised by MOOC users. These meetings are sort of contradictory as of course one of the advantages of MOOCs is that they are online and can be accessed anywhere without the need to travel; yet lots of users are kind of building in this face to face component themselves, all over the world (see the map). We asked whether this was because they felt they were missing something from the MOOC experience (and were therefore sort of recreating classrooms) or whether it was more of an excuse to network and socialise (hence recreating the after school social experience). We find evidence for both motivations though the former is stronger.

Meetup - Map

These meetings show important potential to fix one of the strongest criticisms of MOOCs, which is that they are only for the really self-motivated and that many people drop out: by creating local learning communities, perhaps motivation can increase. Yet this also cuts against the idea of global learning: it was clear, for obvious reasons, that most meetings take place in big cities in the developed world. Those in urban areas or developing countries simply have less people to meet with.

By |2015-07-28T08:48:14+01:00July 28th, 2015|Research, Social Web|0 Comments

Public Policy, Big Data and Smart Cities

I have just got back from the International Conference on Public Policy in Milan, where I was attending a stream of internet and public policy panels, as well as presenting a paper on explaining open data outcomes which I am currently working on together with some colleagues here at the OII. The conference itself was huge: in only it’s second year it attracted around 1,300 registrations, from across the policy sciences. Our sessions on the internet were quite well attended, though I didn’t feel like we attracted many people beyond those already interested in the internet.


I acted as discussant on a couple of panels on big data, with a particularly interesting one on smart cities. I think the smart city field is where public policy and big data overlap most closely: using big data to govern the city has already captured a lot of attention in both academia and policy itself, with examples of initiatives such as the Mayor’s Office for Data Analytics in New York or the Centro de Operações in Rio de Janeiro. It’s interesting to see the potential these places have for improving existing administration

It’s also worth highlighting all the challenges to smart city development, from opening data to getting the right skills in place. This is probably the reason why large cities which have created these kind of data “nerve centres” are leading the way, because they can overcome these obstacles in a concentrated way with direct support from the hierarchy. They raise the interesting possibility, furthermore, that they will become not just supporters of policy execution, but places where policy is set and defined. That would be revolutionary.

By |2015-07-10T08:14:57+01:00July 10th, 2015|Civic Technology, Research|0 Comments