data science, digital politics, smart cities...|jonathan.bright@oii.ox.ac.uk

What future ACA for political science?

About a year ago I was really convinced that automatic content analysis was part of the future for political science. There is a lot of political text out there and hand coding it all is difficult and time consuming. ACA seemed to offer a lot of potential for new insights about agenda setting, political communication, etc.

Last week I attended a workshop on text analysis and migration politics organised by COMPAS. The team there had made a great effort to bring together people working on the technical aspects of ACA (from the field of, as I believe it is known, corpus linguistics) with people trying to apply it to interesting political science questions, especially in the field of migration. I presented a paper with my colleague Tom Nicholls.

Overall both the workshop was really interesting, however it did make me wonder if ACA is going to play quite as key a role in the future of political science as I thought. Several things struck me:

  • ACA in theory eliminates the need for hand coding. But in practice doing ACA properly requires a lot of hand coding to create a training and validation dataset.
  • Getting relatively good results with a naïve bayes classifier which tackles a simple problem (e.g. topic classification) isn’t too technically challenging. But getting very good results is much more complex. Furthermore, the field of corpus linguistics is still very much experimenting to find the best techniques.
  • I’m not really sure how best to present or interpret the measures of accuracy (precision and recall) presented by ACA, nor how to feed them in to a more typical statistical analysis.

All in all I feel there’s a lot of potential here. But doing good ACA also requires a lot of hard work, and the unfamiliar statistics for how accurate it is mean that I’m not sure the results will be accepted at face value by many political scientists.

By |2013-11-12T10:32:44+00:00November 12th, 2013|Programming, Research, Social Science Computing|0 Comments

Research Skills for the World of Big Data

Last week I finished teaching a five day workshop entitled ‘Research Skills for the World of Big Data’ at the EUI, which was kind of a follow up to a course I taught there in 2012. We had about 15 students along interested in learning how Python programming skills can help with social research. I really enjoyed teaching the workshop and the feedback was very positive (average evaluation 9.11 / 10). It is a big challenge to teach programming to beginners, especially social scientists who I think are used to different methods of learning. Nevertheless the students were committed and really smart which makes things easier. I learnt a lot doing the workshop and will improve a few things next time.

R-types

Types are a perennial headache in social science computing. One of the reasons I like perl is that it is so tolerant to variable types changing dynamically according to their context – saves a lot of time when scripting and also much easier to explain to students. It’s pretty clear which one of these is easier to explain to a beginner:

Yesterday however I ran into an even more annoying typing problem in R. I needed to export a large dataset (600,000 obs x 100 vars) which I only had in .RData. Using write.table() quickly hits R’s upper memory buffer. So I set up a simple loop to divide up the file on the basis of area_id, a variable with around 40 unique values:

What do I get? 40 files with nothing in them. Subset clearly isn’t working. A closer look at the last value country took turns this up:

What’s happening? R doesn’t think country or even country[1] are equivalent to 6. But when I assign country[1] to another variable (without making any explicit attempt to change types) then everything works. It’s not really clear to me why that should be. But this sort of typing difficulty is one of the things that puts beginners off: and I think it’s especially a shame in R since this language should be oriented towards the needs of small script writers.

By |2012-11-20T17:53:25+00:00November 20th, 2012|Perl, Programming, Python, R, Social Science Computing|0 Comments