data science, digital politics, smart cities...|jonathan.bright@oii.ox.ac.uk

Understanding news story chains using information retrieval and network clustering techniques

I have a new draft paper out with my colleague Tom Nicholls, entitled Understanding news story chains using information retrieval and network clustering techniques. In it we address what we perceive as an important technical challenge in news media research, which is how to group together articles that all address the same individual news event. This challenge is unmet by most current approaches in unsupervised machine learning as applied to the news, which tend to focus on the broader (also important!) problem of grouping articles in topic categories. It is in general a difficult problem, as we are looking for what are typically small “chains” of content on the same event (e.g. four or five different articles) amongst a corpus of tens of thousands of articles, most of which are unrelated to each other.

Our approach makes use of algorithms and insight drawn from the fields of both information retrieval [IR] and network clustering to develop a novel unsupervised method of news story chain detection. IR techniques (which are used to build things like search engines) especially haven’t been much employed in the social sciences, where the focus has more been on machine learning. But these algorithms were much closer to our problem as connecting small amounts of news stories is quite similar to the task of searching a huge corpus of documents in response to a specific user query.

The resulting algorithm works pretty well, though it is very difficult to validate properly because of the nature of the data! We use it to pull out a couple of interesting first order descriptive statistics about news stories in the UK, for example the graphic above shows the typical evolution of news stories after the publication of an initial article.

Just a draft at the moment so all feedback welcome!

By | 2018-01-31T13:28:52+00:00 January 31st, 2018|News, Python, Research, Social Science Computing|0 Comments

Measuring Ministerial Career Dynamics

I have a new article out in West European Politics with my colleagues Holger Döring and Conor Little which looks at the career dynamics of ministers in seven European countries over the last 50 or so years. We were interested in factors relating to their stability in the job to a large extent, but also more general things such as how power gradually turns over in most democracies. We find an important diversion in the career trajectories of senior and junior ministers in most countries, with a small core of senior ministers staying in power for long periods of time whilst a larger and more fluid mass of junior ministers move in and out of power more frequently.

WEP

Ministerial careers isn’t a core area of substantive research for me, but there was a fairly extensive computational element to the project which did get me interested. It makes use of the wonderful Parlgov political data structure, which was really useful for both organising collaborative data collection and storing the data.

This project was also my first foray into using SQL seriously for academic research. An SQL database is a wonderfully neat format for organising research projects if you’ve got lots of different types of data which only need to be mashed together for analysis. It does save time on the recombining element as well. But it does create a bit of overhead and I’m still not sure it is in the core computational social science toolkit (unless you are in Hadoop territory with the size of your dataset, in which case the SQL equivalent Hive really comes into it’s own).

By | 2015-01-19T15:29:15+00:00 January 19th, 2015|Research, Social Science Computing|3 Comments

Ideology and Social Structure on Twitter

Last week I was at the VOXPOL conference @ King’s College London. Vast majority of researchers were talking about terrorists and extremists, so I was a bit out of my field; though interestingly they were also all talking about big data and computational social science, which seems to be a staple in every social science conference these days. Ongoing debate about whether we need more teams of social scientists + computer scientists, or whether social scientists need to up their computing skills. I think both approaches are fine in the short term but in the long run social scientists need to skill up, as computer scientists won’t always be interested in our questions (we will want to use automatic content analysis in social science long after it becomes a boring topic in computer science, in the same way as we are still using the t test).

network

I gave a presentation on the relationship between ideology and social structure on twitter, arguing that political groups at the ideological extremes are more likely to exhibit closed and centralising communication patterns than those in the middle, which is an early result from a join project between myself, Diego Garzia and Alex Trechsel. The main point of the presentation was to discuss different ways of measuring closure and centralisation, which I’m still not sure about. Luckily most of our measures point in a similar direction, so I’m pretty sure there’s an interesting result in there somewhere.

Point size legends in matplotlib and basemap plots

Python’s matplotlib and basemap can do a lot. I think I still prefer R’s ggplot, but I much prefer manipulating data in Python and sometimes it’s more natural to do the analysis in the same language.

Recently I have been combining the two packages to create maps of events happening around the world. The size and colour of the marker on the map can be varied, meaning it’s possible to fit quite a lot of information in one graphic.

Meetup - Map

However one thing I really struggled with was the legend. If you use different colour points matplotlib makes it easy to add a colour bar, with something like:

c = plt.colorbar(orientation='vertical', shrink = 0.5)
c.set_label("My Title")

Shrink gives you a quick way of adjusting the size of the bar relative to the graphic.

However I couldn’t find an equivalent command which would give me a legend for the size of the points (something that ggplot does easily). After fiddling with get_label() for ages, and trying to capture and do something useful with the results of plt.scatter(), I finally came across this useful post, which basically says that this feature doesn’t really exist and if you want such a legend you have to make it yourself. However, the trick to doing it is quite simple – draw three or four points on your plot with location set to [], [], (so they won’t actually show up), each one representing a certain size in your scale. These points can then be passed to plt.legend with some hand written labels. Overall it looks something like this:

l1 = plt.scatter([],[], s=10, edgecolors='none')
l2 = plt.scatter([],[], s=50, edgecolors='none')
l3 = plt.scatter([],[], s=100, edgecolors='none')
l4 = plt.scatter([],[], s=200, edgecolors='none')

labels = ["10", "50", "100", "200"]

leg = plt.legend([l1, l2, l3, l4], labels, ncol=4, frameon=True, fontsize=12,
handlelength=2, loc = 8, borderpad = 1.8,
handletextpad=1, title='My Title', scatterpoints = 1)

The results:

map with legend

Well, I still think that should be easier, but at least it works and it also gives you a lot of flexibility with what goes on the legend.

Python and Social Media Data for the Social Sciences

In July I gave two short workshops at the OII’s Summer Doctoral Programme and also at the Digital Humanities at Oxford Summer School. I had two great groups of bright PhD students and postdocs to teach to. The sessions were only two hours long, and its a big challenge to teach some meaningful programming skills in such a period to complete beginners (in the end, I decided to walk them through a small example project of getting news articles from an RSS feed and checking how many times they have been shared on Facebook, providing most of the code myself). I also rely on lots of technology which I can’t fully control, which is a risk (I want to teach people to connect to things like the Facebook API, which means I need to rely on getting python working on their machine, on their machine connecting to the internet through the visitor wifi, and on the FB API being up and running during class). But the tech worked, mostly, and overall experience was really positive.

python

In the future however I strongly believe that social science needs a better way of integrating computer programming skills into undergraduate and postgraduate teaching, so that these doctoral level workshops can be more about mastering skills and less about training beginners. So I suppose the hope is that in a few years I won’t need to teach such courses any more, even if I do enjoy them.

By | 2014-08-01T12:29:32+00:00 August 1st, 2014|Python, Social Science Computing, Social Web, Teaching|0 Comments

A “big data” approach to studying parliamentary scrutiny

I have a new article out in the British Journal of Politics and International Relations: In Search of the Politics of Security. In it, I take what could be called a big data approach to the study of parliamentary scrutiny, by scraping information on the passage of legislation from the UK parliament’s website. The website’s current incarnation is relatively recent and there isn’t that much legislation passed every year so I was only able to scrape information on around 150 successfully passed bills. However the information which does come out is quite rich – all recorded votes, amount of time it took to pass the legislation, links to debates and committee hearings, etc. So I still think of it as a kind of big data approach.

My question was pretty simple: does the UK parliament offer less scrutiny on legislation which relates to crime and national security? This emerges from my interest in securitization theory and security politics, which I must admit I have recently been drifting away from slightly (as the war on terror has died down I also think it is becoming slightly less relevant). The project started off as an attempt to measure the scale of this difference, based on what I perceive as a quite widespread assumption that legislators essentially roll over when the government wants to toughen up crime or security law. In the end however I found a relationship in the other direction – such legislation seems to get more attention and scrutiny. It’s a smallish dataset and a limited time period so the conclusions aren’t hard and fast, nevertheless I think it’s a bit of a challenge to the way security politics is often conceptualised.

By | 2014-06-30T10:24:36+00:00 June 30th, 2014|Research, Social Science Computing|0 Comments

Why do MOOC users meet face to face?

Last week Monica Bulger, Cristobal Cobo and I presented a paper at the ICA’s pre-conference on higher education innovation. Monica and Cris are the experts in this area and did most of the heavy lifting, but I was pleased to take part, mainly out of a professional curiosity about how Massively Open Online Courses may or may not be changing the face of higher education. In the paper we looked in particular at patterns of offline meetups amongst the users of these online courses, using data from the Meetup API (my role being to facilitate data gathering and manipulation). Meetup have an open and generous stance to API data, and after a bit of coding I was able to extract information on several thousand face to face meetings of students taking part in Coursera courses in over 100 countries around the world.

Meetup - Map

More clicks on Wordle produced a word cloud of the titles of each meetup, which I can’t resist because it looks so nice even if it probably isn’t a good way of doing science.Word Cloud - Titles

What does it all mean? Beyond showing the impressive worldwide reach of Coursera, and the fact that people like face to face interaction when they are learning, we are still deciding to be honest with you. Suggestions welcome.

Computational Social Science: Social Contagion, Collective Behaviour, and Networks

I am part of the organising committee of this event -> part of my growing interest in all things related to sociophysics. Call for abstracts follows:

eccs

Computational Social Science: Social Contagion, Collective Behaviour, and Networks
to be held in Lucca, Italy, 24-25 September 2014

Website: http://cssworkshop.oii.ox.ac.uk/

Important Dates:
Abstract submission deadline 22 June 2014
Conference date 24-25 September 2013

Event Overview
Technology-mediated social collectives are taking an important role in the design of social structures. Yet our understanding of the complex mechanisms governing networks and collective behaviour is still deplorably shallow. Fundamental concepts of on- and off-line networks such as power, authority, leader-follower dynamics, consensus emergence, information sharing, conflict, and collaboration are still not well defined and investigated. These are all crucial to illuminate the advantages and pitfalls of collective decision-making, which can cancel out individual mistakes, but also spiral out of control.
In recent endeavours, data from Twitter, Facebook, Google+, Wikipedia, and weblogs have been shown to strongly correlate to, and even predict, elections, opinions, attitudes, movie revenues, and oscillations in the stock market, to cite few examples. Similar data provided insights into the mechanisms driving the formation of groups of interests, topical communities, and the evolution of social networks. They also have been used to study polarization phenomena in politics, diffusion of information, and the dynamics of collective attention. However, a deeper understanding of these phenomena is still very much on demand. In parallel, and even preceding the surge in interest towards social media, the area of agent-based modeling (ABM) has grown in scope, focus and capability to produce testable hypotheses, going beyond the original goal of explaining macroscopic behaviors from simple interaction rules among stylized agents.
The aim of this satellite is to address the question of ICT-mediated social phenomena emerging over multiple scales, ranging from the interactions of individuals to the emergence of self-organized global movements. We would like to gather researchers from different disciplines and methodological backgrounds to form a forum to discuss ideas, research questions, recent results, and future challenges in this emerging area of research and public interest.

By | 2014-05-23T11:05:53+00:00 May 23rd, 2014|Social Science Computing, Sociophysics|0 Comments

Can social data be used to predict elections?

I’ve just started a new research blog with my colleague Taha Yasseri. Two aims: we want to know if and when social data might be useful in election prediction; we want to see if this knowledge teaches us anything about the political process. It’s also interesting to experiment with the idea of blogging research rather than going the usual journal route (though I imagine a paper or two will result anyway). Much quicker, rougher, but definitely satisfies my urge to do things quickly. We hope it will make the finished output better as well.

all-wikipedia-euelections article-2

The above image is an excerpt from the first post, on electoral information seeking in 19 different countries. We find that, essentially, people look for information much more after the election has already finished than before, probably in response to the election itself as a media event.

I’ll be cross posting a bit more as the blog develops.

By | 2014-04-08T14:56:18+00:00 April 8th, 2014|Research, Social Science Computing, Social Web|0 Comments

Teaching Inferential Statistics with Netlogo

Last term I decided to try using a Netlogo simulation in stats class to help explain some basic principles of inferential statistics. The advantage of having a simulation package is that students can see for themselves that things like that standard error really do “work” (i.e. they offer a good estimate of what they are supposed to, in this case the standard deviation of the sampling distribution). This is something you can’t see if you’re working with just one real world sample, and the maths which allows us to derive these concepts is too complex for such a class.

Netlogo

I was really impressed with Netlogo in particular as the package of choice – easy to install and get running, handles packages smartly, clean and simple programming language behind it. The tidy graphical interface is also a major plus. Overall I think the students found it useful – something to engage in and play around with. A few of them noticed straight away that the interface could be reprogrammed which I think is also quite stimulating. I also made use of the stats module developed by Charles Staelin.

If you are interested in trying out my model you can download it here. No guarantees about accuracy – indeed I’m sure there’s a mistake in there somewhere! All feedback is appreciated.

 

By | 2016-07-04T12:02:43+00:00 February 12th, 2014|Social Science Computing, Teaching|0 Comments