Estimating local commuting patterns from geolocated Twitter data

Over the last decade or so there has been an explosion of research interest in the area of measuring (and forecasting) of traffic and commuting patterns. Part of this is driven by ever increasing human mobility: in 2016 alone, people in the UK travelled a collective 800 billion kilometres [PDF], more than 60% of which was by car, and congestion on these networks costs billions of pounds a year. But also driving the research agenda is the emergence of a wide variety of new forms of data (which has built on and supplemented more traditional magnetic loop technologies): such as data re-purposed from mobile phone records, or collected through IoT enabled smart sensors, or emerging from freely contributed traces to social media platforms. These data sources offer huge potential to improve on existing methods of data collection, such as hated transport census (see picture).

As part of a research project entitled NEXUS: Real Time Data Fusion and Network Analysis for Urban Systems (funded by InnovateUK), myself and a team of researchers at the OII have been looking into some of these possibilities. Our first paper on the subject, entitled “Estimating Local Commuting Patterns from Geolocated Twitter Data“, has just been published in EPJ Data Science. The paper addresses the extent to which we can make use of geolocated Twitter data to estimate commuting flows between local authorities (you can have a play with some of the underlying data using the map below, which shows census commuting figures and Twitter based estimates for local authorities around Britain).

We draw two main conclusions from the paper. First we show that, making use of heuristics for mapping individuals making geolocated tweets to home and work areas, we can use Twitter to produce accurate representations of the overall structure of commuting in mainland Great Britain; estimates which improve considerably on other ‘low information’ methods of estimating commuting flows (we compared estimates in particular to the popular radiation model). Second, and probably most importantly, we show that these results are not particularly sensitive to demographic characteristics. When looking at commuting flows broken down by gender, age group and social class, we found that Twitter still offered reasonable estimations for all of these sub-categories. We think this is important because a key concern about using social media data for this type of proxy estimation is the extent to which the ‘demographic bias’ in social media users (who are often younger, better educated and wealthier than the population average) might also result in biased predictions (for example, better prediction of the travel patterns of younger people). We show that, at least in our context, this is not the case.

What’s next? There is plenty more to explore in this research area: looking at whether predictions can be made more granular, or perhaps whether sentiment from social media can be worked in, or whether other platforms can also contribute. We will also start to work on some other data sources, making use of some of the exciting datasets being made available by places like the ADRN and CDRC.

Graham McNeill, Jonathan Bright and Scott A Hale (2017) Estimating local commuting patterns from geolocated Twitter data, EPJ Data Science 20176:24.
https://doi.org/10.1140/epjds/s13688-017-0120-x

Predicting elections with Wikipedia data: new article in EPJ Data Science

Taha Yasseri and I have a new article out in EPJ Data Science which looks at the subject of electoral prediction using page view data from Wikipedia. Forecasting electoral results with some form of novel internet data is really a growth area in the literature at the moment, with a huge amount of research teams trying out different approaches. However I think our paper nevertheless makes a novel contribution, in a couple of respects. First, our model is theory driven rather than taking a machine learning approach, by which I mean that we try and theorise the mechanism generating Wikipedia page view data and how that relates to electoral outcomes, rather than simply looking at a range of indicators to see if any of them offers any predictive power. Second, we test a reasonably large set of electoral results: a group of around 60 parties in the European Parliament elections in 2014, whereas many other studies look at prediction only in the case of one election.

We found a number of things: we are able to show that the majority of online information seeking happens in the couple of days before the election (left hand panel in the figure); we are also able to show that page views do seem to offer indicators of a number of things happening in the election, such as turnout levels (right hand panel in the figure) and overall electoral results. Wikipedia was particularly good at predicting the emergence of small parties which were shooting to prominence (something which has become a feature of European politics in the last decade), even if it did tend to overstate their final result.

In future work, we intend to spread the work out to more countries and more types of information seeking.

The Social News Gap: New article in Journal of Communication

I have a new article out in the Journal of Communication which analyses which types of news get shared the most. Based on articles published in BBC news, the research shows that even though readership drives sharing in general, certain types of articles lend themselves more to being shared than others.

Figure 2

The graphic above gives a glimpse of some of the results, by visualising the relationship between reading and sharing for different categories of news article. We can see that reading and sharing are not in a linear relationship: rather some types of article are well shared but not well read, and vice versa. For example, stories about technology and social welfare seem to be shared more, whilst stories about violent crime and accidents are shared less. This creates a social “news gap” (following Boczkowski and Mitchelstein’s traditional news gap) whereby peoples preferences for sharing and their preferences for reading diverge. I suggest that, as more and more people start to consume news on social media, the implications of this become potentially more profound: as social media starts to filter out certain types of news whilst emphasising others.

 

Getting ggplot2 to work with igraph

7 December 2015 2

One common criticism of the otherwise excellent ggplot2 is that it doesn’t come with network visualisation capability. Network vis is so popular at the moment that it seems like a bit of a big omission; but network data is also quite unique in terms of structure (and the layout algorithms would need implementing) so I can see why it hasn’t been integrated.

Moritz Marbach has a great post explaining how to easily get ggplot2 up and running with network data. It was still one of the top hits on Google when I checked it out recently for a project. However the post is from 2011 so is getting a little dated – it uses the sna package rather than igraph (which seems to be becoming a standard for network science) and also has a few deprecated ggplot2 commands in it. So I thought I’d add a bit of an update here to the code.

As Marbach explains the secret to getting ggplot2 to draw networks is quite simple: get a network analysis package to give you a list of nodes, edges, and node layout information as a series of X,Y coordinates. Then you can simply plot the nodes with geom_point and the edges with geom_segment. Put together it looks something like this:

library(igraph)

g = read.graph("a-network.gml", format="gml")

#get the node coordinates
plotcord <- data.frame(layout.fruchterman.reingold(g))
colnames(plotcord) = c("X1","X2")

#get edges, which are pairs of node IDs
edgelist <- get.edgelist(g)

#convert to a four column edge data frame with source and destination coordinates
edges <- data.frame(plotcord[edgelist[,1],], plotcord[edgelist[,2],])
colnames(edges) <- c("X1","Y1","X2","Y2")

ggplot() + geom_segment(aes(x=X1, y=Y1, xend = X2, yend = Y2), data=edges, size = 0.5, colour="grey") + geom_point(aes(X1, X2), data=plotcord)

Output:

igraph-ggplot2

OK it still needs some work! But anyone familiar with ggplot2 can do the rest.

The History of Social News

I am giving a presentation tomorrow at the IJPP conference here in Oxford. It’s being hosted by the Reuters Institute who are world leaders in the study of contemporary news organisations, and I’m really excited to be going.

Together with Scott Hale I am giving a presentation on the “history” of social news. We have an 8 year long dataset (2002-2010) consisting of links to millions of news articles which we have used to trace the beginnings of social media news sharing. We are interested to know whether the types of news being shared have changed over time as social media platforms have massified; we’re also interested in looking at whether site design changes (such as bringing in sharing buttons) have had a major impact.

Twitter - Facebook Comparison

The project is at an early stage but the results are pretty interesting so far (to me). To give one tidbit, we show that in this large scale dataset there is only a weak correlation between sharing on Twitter and Facebook at the article level, with Twitter tending to share more sports news than Facebook (see image).

The real component of virtual learning

Monica Bulger, Cristobal Cobo and I have a new paper out in Information, Communication and Society where we investigate real world meetings organised by MOOC users. These meetings are sort of contradictory as of course one of the advantages of MOOCs is that they are online and can be accessed anywhere without the need to travel; yet lots of users are kind of building in this face to face component themselves, all over the world (see the map). We asked whether this was because they felt they were missing something from the MOOC experience (and were therefore sort of recreating classrooms) or whether it was more of an excuse to network and socialise (hence recreating the after school social experience). We find evidence for both motivations though the former is stronger.

Meetup - Map

These meetings show important potential to fix one of the strongest criticisms of MOOCs, which is that they are only for the really self-motivated and that many people drop out: by creating local learning communities, perhaps motivation can increase. Yet this also cuts against the idea of global learning: it was clear, for obvious reasons, that most meetings take place in big cities in the developed world. Those in urban areas or developing countries simply have less people to meet with.

Public Policy, Big Data and Smart Cities

I have just got back from the International Conference on Public Policy in Milan, where I was attending a stream of internet and public policy panels, as well as presenting a paper on explaining open data outcomes which I am currently working on together with some colleagues here at the OII. The conference itself was huge: in only it’s second year it attracted around 1,300 registrations, from across the policy sciences. Our sessions on the internet were quite well attended, though I didn’t feel like we attracted many people beyond those already interested in the internet.

ICPP2015

I acted as discussant on a couple of panels on big data, with a particularly interesting one on smart cities. I think the smart city field is where public policy and big data overlap most closely: using big data to govern the city has already captured a lot of attention in both academia and policy itself, with examples of initiatives such as the Mayor’s Office for Data Analytics in New York or the Centro de Operações in Rio de Janeiro. It’s interesting to see the potential these places have for improving existing administration

It’s also worth highlighting all the challenges to smart city development, from opening data to getting the right skills in place. This is probably the reason why large cities which have created these kind of data “nerve centres” are leading the way, because they can overcome these obstacles in a concentrated way with direct support from the hierarchy. They raise the interesting possibility, furthermore, that they will become not just supporters of policy execution, but places where policy is set and defined. That would be revolutionary.

New Paper in European Union Politics

I have just published a paper in European Union Politics, together with Diego Garzia, Joseph Lacey and Alex Trechsel of the EUI. The paper was the fruition of a long term research project examining potential ways of changing the European Parliament’s electoral system, focussed in particular on allowing people to vote for parties in any member state. It seems particularly relevant today when protest parties such as Syriza and Podemos attract support (and criticism) from well outside of their own borders.

The paper explores what would happen under conditions of such transnationalisation, examining both what types of people would be likely to vote “transnationally” and the extent to which overall levels of representation would improve. Great to have it in print.

GE2015 on social media

Last week we had a sort of social media hackathon in honour of the UK’s election, looking at the reaction generated on social media. We took what I believe was a fairly novel approach to the analysis, by looking at social media reaction to individual candidates in constituencies (rather than just general hashtags or party leaders). The map below shows what the election results would have been if @mentions of these local candidates had been votes

Twitter-election

We are still digesting the data so I’m not yet sure what the main findings are really, though we did get some interesting stuff on the diverging social media “reach” of different candidates, and the way Twitter impact and vote has different relationships depending on the party.

TwitterMentions-line

Check out our full range of work here. More to follow…

TICTEC 2015

A couple of weeks ago I gave a presentation at TICTEC, mySociety‘s inaugural research conference on the impact of civic technology. It was an inspiring event with so many presentations from different organisations trying to make a difference in countries all over the world.

TICTeC-logos_general-with-year-263x300

There were a few academics there and hopefully we added some value too. I gave a presentation on a current project we are running with ULB exploring the dynamics of the website lapetition.be.

Threshold Scatter

It was interesting however to see how differently academia and civic tech conceptualise research, with us academics coming in for some stick for taking years to produce research which makes it difficult to integrate into the development of new tools. But there were also lots of good examples of researchers working with civic tech organisations to try out new ways of reaching people or do research on impact – this sort of stuff is the future of political science in my opinion.