data science, digital politics, smart cities...|jonathan.bright@oii.ox.ac.uk

Understanding news story chains using information retrieval and network clustering techniques

I have a new draft paper out with my colleague Tom Nicholls, entitled Understanding news story chains using information retrieval and network clustering techniques. In it we address what we perceive as an important technical challenge in news media research, which is how to group together articles that all address the same individual news event. This challenge is unmet by most current approaches in unsupervised machine learning as applied to the news, which tend to focus on the broader (also important!) problem of grouping articles in topic categories. It is in general a difficult problem, as we are looking for what are typically small “chains” of content on the same event (e.g. four or five different articles) amongst a corpus of tens of thousands of articles, most of which are unrelated to each other.

Our approach makes use of algorithms and insight drawn from the fields of both information retrieval [IR] and network clustering to develop a novel unsupervised method of news story chain detection. IR techniques (which are used to build things like search engines) especially haven’t been much employed in the social sciences, where the focus has more been on machine learning. But these algorithms were much closer to our problem as connecting small amounts of news stories is quite similar to the task of searching a huge corpus of documents in response to a specific user query.

The resulting algorithm works pretty well, though it is very difficult to validate properly because of the nature of the data! We use it to pull out a couple of interesting first order descriptive statistics about news stories in the UK, for example the graphic above shows the typical evolution of news stories after the publication of an initial article.

Just a draft at the moment so all feedback welcome!

By | 2018-01-31T13:28:52+00:00 January 31st, 2018|News, Python, Research, Social Science Computing|0 Comments

Point size legends in matplotlib and basemap plots

Python’s matplotlib and basemap can do a lot. I think I still prefer R’s ggplot, but I much prefer manipulating data in Python and sometimes it’s more natural to do the analysis in the same language.

Recently I have been combining the two packages to create maps of events happening around the world. The size and colour of the marker on the map can be varied, meaning it’s possible to fit quite a lot of information in one graphic.

Meetup - Map

However one thing I really struggled with was the legend. If you use different colour points matplotlib makes it easy to add a colour bar, with something like:

c = plt.colorbar(orientation='vertical', shrink = 0.5)
c.set_label("My Title")

Shrink gives you a quick way of adjusting the size of the bar relative to the graphic.

However I couldn’t find an equivalent command which would give me a legend for the size of the points (something that ggplot does easily). After fiddling with get_label() for ages, and trying to capture and do something useful with the results of plt.scatter(), I finally came across this useful post, which basically says that this feature doesn’t really exist and if you want such a legend you have to make it yourself. However, the trick to doing it is quite simple – draw three or four points on your plot with location set to [], [], (so they won’t actually show up), each one representing a certain size in your scale. These points can then be passed to plt.legend with some hand written labels. Overall it looks something like this:

l1 = plt.scatter([],[], s=10, edgecolors='none')
l2 = plt.scatter([],[], s=50, edgecolors='none')
l3 = plt.scatter([],[], s=100, edgecolors='none')
l4 = plt.scatter([],[], s=200, edgecolors='none')

labels = ["10", "50", "100", "200"]

leg = plt.legend([l1, l2, l3, l4], labels, ncol=4, frameon=True, fontsize=12,
handlelength=2, loc = 8, borderpad = 1.8,
handletextpad=1, title='My Title', scatterpoints = 1)

The results:

map with legend

Well, I still think that should be easier, but at least it works and it also gives you a lot of flexibility with what goes on the legend.

Python and Social Media Data for the Social Sciences

In July I gave two short workshops at the OII’s Summer Doctoral Programme and also at the Digital Humanities at Oxford Summer School. I had two great groups of bright PhD students and postdocs to teach to. The sessions were only two hours long, and its a big challenge to teach some meaningful programming skills in such a period to complete beginners (in the end, I decided to walk them through a small example project of getting news articles from an RSS feed and checking how many times they have been shared on Facebook, providing most of the code myself). I also rely on lots of technology which I can’t fully control, which is a risk (I want to teach people to connect to things like the Facebook API, which means I need to rely on getting python working on their machine, on their machine connecting to the internet through the visitor wifi, and on the FB API being up and running during class). But the tech worked, mostly, and overall experience was really positive.

python

In the future however I strongly believe that social science needs a better way of integrating computer programming skills into undergraduate and postgraduate teaching, so that these doctoral level workshops can be more about mastering skills and less about training beginners. So I suppose the hope is that in a few years I won’t need to teach such courses any more, even if I do enjoy them.

By | 2014-08-01T12:29:32+00:00 August 1st, 2014|Python, Social Science Computing, Social Web, Teaching|0 Comments

Research Skills for the World of Big Data

Last week I finished teaching a five day workshop entitled ‘Research Skills for the World of Big Data’ at the EUI, which was kind of a follow up to a course I taught there in 2012. We had about 15 students along interested in learning how Python programming skills can help with social research. I really enjoyed teaching the workshop and the feedback was very positive (average evaluation 9.11 / 10). It is a big challenge to teach programming to beginners, especially social scientists who I think are used to different methods of learning. Nevertheless the students were committed and really smart which makes things easier. I learnt a lot doing the workshop and will improve a few things next time.

By | 2013-05-15T11:14:10+00:00 May 15th, 2013|Programming, Python, Social Science Computing|0 Comments

Combining Python’s Basemap and NetworkX

Recently I have been involved with a project mapping relationships between countries in terms of a social network. There are a lot of social network analysis packages around; I prefer Python’s NetworkX largely because I’m already so used to Python.

The first thing everyone wants to see when doing sna is the network graph…understandable of course as they look pretty visually attractive and are a welcome respite from a field (political science) which is dominated by text. However as we all know sna graphs can also be a bit misleading, unless you are very good at reading them. The fact that node position doesn’t (necessarily) mean anything is a bit of a disadvantage, and once you have more than a few nodes actually understanding link patterns is essentially impossible. Using a classic layout for my country relations SNA for example gives me this:

country-sna

Even with degree included as node size and colour I still don’t find it very informative. One way of improving the situation is to give some meaning to the node position, which is of course especially easy with countries. Displaying a sna as a world map has two advantages in my opinion: everyone knows the names of a lot of the countries (which saves you having to label nodes), and you can also get a quick handle on any geographical patterns.

Python’s Basemap module can be easily combined with NetworkX. The key is to build your NX graph’s ‘pos’ list as you are building your overall node list, using Basemap to transform node coordinates. Of course you will need a list of country longitudes and latitudes but there are plenty of those available.

Hence when I am building my overall graph, I do:

x,y = m(lon, lat)
G.add_node(country)
pos[country]=(x,y)

where m is any Basemap projection and lon, lat are the coordinates of the country in question.

country-sna-2

Much nicer.

By | 2013-01-01T16:05:39+00:00 January 1st, 2013|Basemap, NetworkX, Programming, Python|2 Comments

R-types

Types are a perennial headache in social science computing. One of the reasons I like perl is that it is so tolerant to variable types changing dynamically according to their context – saves a lot of time when scripting and also much easier to explain to students. It’s pretty clear which one of these is easier to explain to a beginner:

Yesterday however I ran into an even more annoying typing problem in R. I needed to export a large dataset (600,000 obs x 100 vars) which I only had in .RData. Using write.table() quickly hits R’s upper memory buffer. So I set up a simple loop to divide up the file on the basis of area_id, a variable with around 40 unique values:

What do I get? 40 files with nothing in them. Subset clearly isn’t working. A closer look at the last value country took turns this up:

What’s happening? R doesn’t think country or even country[1] are equivalent to 6. But when I assign country[1] to another variable (without making any explicit attempt to change types) then everything works. It’s not really clear to me why that should be. But this sort of typing difficulty is one of the things that puts beginners off: and I think it’s especially a shame in R since this language should be oriented towards the needs of small script writers.

By | 2012-11-20T17:53:25+00:00 November 20th, 2012|Perl, Programming, Python, R, Social Science Computing|0 Comments