data science, digital politics, smart cities...|jonathan.bright@oii.ox.ac.uk

TICTEC 2015

A couple of weeks ago I gave a presentation at TICTEC, mySociety‘s inaugural research conference on the impact of civic technology. It was an inspiring event with so many presentations from different organisations trying to make a difference in countries all over the world.

TICTeC-logos_general-with-year-263x300

There were a few academics there and hopefully we added some value too. I gave a presentation on a current project we are running with ULB exploring the dynamics of the website lapetition.be.

Threshold Scatter

It was interesting however to see how differently academia and civic tech conceptualise research, with us academics coming in for some stick for taking years to produce research which makes it difficult to integrate into the development of new tools. But there were also lots of good examples of researchers working with civic tech organisations to try out new ways of reaching people or do research on impact – this sort of stuff is the future of political science in my opinion.

Point size legends in matplotlib and basemap plots

Python’s matplotlib and basemap can do a lot. I think I still prefer R’s ggplot, but I much prefer manipulating data in Python and sometimes it’s more natural to do the analysis in the same language.

Recently I have been combining the two packages to create maps of events happening around the world. The size and colour of the marker on the map can be varied, meaning it’s possible to fit quite a lot of information in one graphic.

Meetup - Map

However one thing I really struggled with was the legend. If you use different colour points matplotlib makes it easy to add a colour bar, with something like:

c = plt.colorbar(orientation='vertical', shrink = 0.5)
c.set_label("My Title")

Shrink gives you a quick way of adjusting the size of the bar relative to the graphic.

However I couldn’t find an equivalent command which would give me a legend for the size of the points (something that ggplot does easily). After fiddling with get_label() for ages, and trying to capture and do something useful with the results of plt.scatter(), I finally came across this useful post, which basically says that this feature doesn’t really exist and if you want such a legend you have to make it yourself. However, the trick to doing it is quite simple – draw three or four points on your plot with location set to [], [], (so they won’t actually show up), each one representing a certain size in your scale. These points can then be passed to plt.legend with some hand written labels. Overall it looks something like this:

l1 = plt.scatter([],[], s=10, edgecolors='none')
l2 = plt.scatter([],[], s=50, edgecolors='none')
l3 = plt.scatter([],[], s=100, edgecolors='none')
l4 = plt.scatter([],[], s=200, edgecolors='none')

labels = ["10", "50", "100", "200"]

leg = plt.legend([l1, l2, l3, l4], labels, ncol=4, frameon=True, fontsize=12,
handlelength=2, loc = 8, borderpad = 1.8,
handletextpad=1, title='My Title', scatterpoints = 1)

The results:

map with legend

Well, I still think that should be easier, but at least it works and it also gives you a lot of flexibility with what goes on the legend.

What future ACA for political science?

About a year ago I was really convinced that automatic content analysis was part of the future for political science. There is a lot of political text out there and hand coding it all is difficult and time consuming. ACA seemed to offer a lot of potential for new insights about agenda setting, political communication, etc.

Last week I attended a workshop on text analysis and migration politics organised by COMPAS. The team there had made a great effort to bring together people working on the technical aspects of ACA (from the field of, as I believe it is known, corpus linguistics) with people trying to apply it to interesting political science questions, especially in the field of migration. I presented a paper with my colleague Tom Nicholls.

Overall both the workshop was really interesting, however it did make me wonder if ACA is going to play quite as key a role in the future of political science as I thought. Several things struck me:

  • ACA in theory eliminates the need for hand coding. But in practice doing ACA properly requires a lot of hand coding to create a training and validation dataset.
  • Getting relatively good results with a naïve bayes classifier which tackles a simple problem (e.g. topic classification) isn’t too technically challenging. But getting very good results is much more complex. Furthermore, the field of corpus linguistics is still very much experimenting to find the best techniques.
  • I’m not really sure how best to present or interpret the measures of accuracy (precision and recall) presented by ACA, nor how to feed them in to a more typical statistical analysis.

All in all I feel there’s a lot of potential here. But doing good ACA also requires a lot of hard work, and the unfamiliar statistics for how accurate it is mean that I’m not sure the results will be accepted at face value by many political scientists.

By | 2013-11-12T10:32:44+00:00 November 12th, 2013|Programming, Research, Social Science Computing|0 Comments

Research Skills for the World of Big Data

Last week I finished teaching a five day workshop entitled ‘Research Skills for the World of Big Data’ at the EUI, which was kind of a follow up to a course I taught there in 2012. We had about 15 students along interested in learning how Python programming skills can help with social research. I really enjoyed teaching the workshop and the feedback was very positive (average evaluation 9.11 / 10). It is a big challenge to teach programming to beginners, especially social scientists who I think are used to different methods of learning. Nevertheless the students were committed and really smart which makes things easier. I learnt a lot doing the workshop and will improve a few things next time.

By | 2013-05-15T11:14:10+00:00 May 15th, 2013|Programming, Python, Social Science Computing|0 Comments

Combining Python’s Basemap and NetworkX

Recently I have been involved with a project mapping relationships between countries in terms of a social network. There are a lot of social network analysis packages around; I prefer Python’s NetworkX largely because I’m already so used to Python.

The first thing everyone wants to see when doing sna is the network graph…understandable of course as they look pretty visually attractive and are a welcome respite from a field (political science) which is dominated by text. However as we all know sna graphs can also be a bit misleading, unless you are very good at reading them. The fact that node position doesn’t (necessarily) mean anything is a bit of a disadvantage, and once you have more than a few nodes actually understanding link patterns is essentially impossible. Using a classic layout for my country relations SNA for example gives me this:

country-sna

Even with degree included as node size and colour I still don’t find it very informative. One way of improving the situation is to give some meaning to the node position, which is of course especially easy with countries. Displaying a sna as a world map has two advantages in my opinion: everyone knows the names of a lot of the countries (which saves you having to label nodes), and you can also get a quick handle on any geographical patterns.

Python’s Basemap module can be easily combined with NetworkX. The key is to build your NX graph’s ‘pos’ list as you are building your overall node list, using Basemap to transform node coordinates. Of course you will need a list of country longitudes and latitudes but there are plenty of those available.

Hence when I am building my overall graph, I do:

x,y = m(lon, lat)
G.add_node(country)
pos[country]=(x,y)

where m is any Basemap projection and lon, lat are the coordinates of the country in question.

country-sna-2

Much nicer.

By | 2013-01-01T16:05:39+00:00 January 1st, 2013|Basemap, NetworkX, Programming, Python|2 Comments

R-types

Types are a perennial headache in social science computing. One of the reasons I like perl is that it is so tolerant to variable types changing dynamically according to their context – saves a lot of time when scripting and also much easier to explain to students. It’s pretty clear which one of these is easier to explain to a beginner:

Yesterday however I ran into an even more annoying typing problem in R. I needed to export a large dataset (600,000 obs x 100 vars) which I only had in .RData. Using write.table() quickly hits R’s upper memory buffer. So I set up a simple loop to divide up the file on the basis of area_id, a variable with around 40 unique values:

What do I get? 40 files with nothing in them. Subset clearly isn’t working. A closer look at the last value country took turns this up:

What’s happening? R doesn’t think country or even country[1] are equivalent to 6. But when I assign country[1] to another variable (without making any explicit attempt to change types) then everything works. It’s not really clear to me why that should be. But this sort of typing difficulty is one of the things that puts beginners off: and I think it’s especially a shame in R since this language should be oriented towards the needs of small script writers.

By | 2012-11-20T17:53:25+00:00 November 20th, 2012|Perl, Programming, Python, R, Social Science Computing|0 Comments