30 May 2017

Using Open Government Data to predict sense of local community

Advocates hope that opening government data will increase government transparency, catalyse economic growth, address social and environmental challenges. Image by the UK’s Open Data Institute.

Community-based approaches are widely employed in programmes that monitor and promote socioeconomic development. And building the “capacity” of a community — i.e. the ability of people to act individually or collectively to benefit the community — is key to these approaches. The various definitions of community capacity all agree that it comprises a number of dimensions — including opportunities and skills development, resource mobilization, leadership, participatory decision making, etc. — all of which can be measured in order to understand and monitor the implementation of community-based policy. However, measuring these dimensions (typically using surveys) is time consuming and expensive, and the absence of such measurements is reflected in a greater focus in the literature on describing the process of community capacity building, rather than on describing how it’s actually measured.

A cheaper way to measure these dimensions, for example by applying predictive algorithms to existing secondary data like socioeconomic characteristics, socio-demographics, and condition of housing stock, would certainly help policy makers gain a better understanding of local communities. In their Policy & Internet article “Predicting Sense of Community and Participation by Applying Machine Learning to Open Government Data”, Alessandro Piscopo, Ronald Siebes, and Lynda Hardman employ a machine-learning technique (“Random Forests”) to evaluate an estimate of community capacity derived from open government data, and determine the most important predictive variables.

The resulting models were found to be more accurate than those based on traditional statistics, demonstrating the feasibility of the Random Forests technique for this purpose — being accurate, able to deal with small data sets and nonlinear data, and providing information about how each variable in the dataset contributes to predictive accuracy.

We caught up with the authors to discuss their findings:

Ed.: Just briefly: how did you do the study? Were you essentially trying to find which combinations of variables available in Open Government Data predicted “sense of community and participation” as already measured by surveys?

Authors: Our research stemmed from an observation of the measures of social characteristics available. These are generally obtained through expensive surveys, so we asked ourselves “how could we generate them in a more economic and efficient way?” In recent years, the UK government has openly released a wealth of datasets, which could be used to provide information for other purposes — in our case, providing measures of sense of community and participation — than those for which they had been created. We started our work by consulting papers from the social science domain, to understand which factors were associated to sense of community and participation. Afterwards, we matched the factors that were most commonly mentioned in the literature with “actual” variables found in UK Open Government Data sources.

Ed.: You say “the most determinant variables in our models were only partially in agreement with the most influential factors for sense of community and participation according to the social science literature” — which were they, and how do you account for the discrepancy?

Authors: We observed two types of discrepancy. The first was the case of variables that had roughly the same level of importance in our models and in others previously developed, but with a different rank. For instance, median age was by far the most determinant variable in our model for sense of community. This variable was not ranked among the top five variables in the literature, although it was listed among the significant variables.

The second type of discrepancy regarded variables which were highly important in our models and not influential in others, or vice versa. An example is the socioeconomic status of residents of a neighbourhood, which appeared to have no effect on participation in prior studies, but was the top-ranking variable in our participation model (operationalised as the number of people in intermediate occupation).

We believe that there are multiple explanations for these phenomena, all of which deserve further investigation. First, highly determinant predictors in conventional statistical models have been proven to have little or no importance in ensemble algorithms, such as the one we used [1]. Second, factors influencing sense of community and civic participation may vary according to the context (e.g. different countries; see [3] about sense of community in China for an example). Finally, different methods may measure different aspects related to a socially meaningful concept, leading to different partial explanations.

Ed.: What were the predictors for “lack of community” — i.e. what would a terrible community look like, according to your models?

Authors: Our work did not really focus on finding “good” and “bad” communities. However, we did notice some characteristics that were typical of communities with low sense of community or participation in our dataset. For example, sense of community had a strong negative correlation with work and stores accessibility, with ethnic fragmentation, and with the number of people living in the UK for less than 10 years. On the other hand, it was positively correlated with the age of residents. Participation, instead, was negatively correlated with household composition and occupation of its residents, whilst it had a positive relation with their level of education and the weekly worked hours. Of course, these data would require to be interpreted by a social scientist, in order to properly contextualise and understand them.

Ed.: Do you see these techniques as being more useful to highlight issues and encourage discussion, or actually being used in planning? For example, I can see it might raise issues if machine-learning models “proved” that presence of immigrant populations, or neighbourhoods of mixed economic or ethnic backgrounds, were less cohesive than homogeneous ones (not sure if they are?).

Authors: How machine learning algorithms work is not always clear, even to specialists, and this has led some people to describe them as “black boxes”. We believe that models like those we developed can be extremely useful to challenge existing perspectives based on past data available in the social science literature, e.g. they can be used to confirm or reject previous measures in the literature. Additionally, machine learning models can serve as indicators that can be more frequently consulted: they are cheaper to produce, we can use them more often, and see whether policies have actually worked.

Ed.: It’s great that existing data (in this case, Open Government Data) can be used, rather than collecting new data from scratch. In practice, how easy is it to repurpose this data and build models with it — including in countries where this data may be more difficult to access? And were there any variables you were interested in that you couldn’t access?

Authors: Identifying relevant datasets and getting hold of them was a lengthy process, even in the UK, where plenty of work has been done to make government data openly available. We had to retrieve many datasets from the pages of the government department that produced them, such as the Department for Work and Pensions or the Home Office, because we could not find them through the portal data.gov.uk. Next to this, the ONS website was another very useful resource, which we used to get census data.

The hurdles encountered in gathering the data led us to recommend the development of methods that would be able to more automatically retrieve datasets from a list of sources and select the ones that provide the best results for predictive models of social dimensions.

Ed.: The OII has done some similar work, estimating the local geography of Internet use across Britain, combining survey and national census data. The researchers said the small-area estimation technique wasn’t being used routinely in government, despite its power. What do you think of their work and discussion, in relation to your own?

Authors: One of the issues we were faced with in our research was the absence of nationwide data about sense of community and participation at a neighbourhood level. The small area estimation approach used by Blank et al., 2017 [2] could provide a suitable solution to the issue. However, the estimates produced by their approach understandably incorporate a certain amount of error. In order to use estimated values as training data for predictive models of community measures it would be key to understand how this error would be propagated to the predicted values.

[1] Berk, R. 2006. “ An Introduction to Ensemble Methods for Data Analysis.” Sociological Methods & Research 34 (3): 263–95.
[2] Blank, G., Graham, M., and Calvino, C. 2017. Local Geographies of Digital Inequality. Social Science Computer Review. DOI: 10.1177/0894439317693332.
[3] Xu, Q., Perkins, D.D. and Chow, J.C.C., 2010. Sense of community, neighboring, and social capital as predictors of local political participation in China. American journal of community psychology, 45(3-4), pp.259-271.

Read the full article: Piscopo, A., Siebes, R. and Hardman, L. (2017) Predicting Sense of Community and Participation by Applying Machine Learning to Open Government Data. Policy & Internet 9 (1) doi:10.1002/poi3.145.


Alessandro Piscopo, Ronald Siebes, and Lynda Hardman were talking to blog editor David Sutcliffe.


Note: This article gives the views of the authors, and not the position of the Policy and Internet Blog, nor of the Oxford Internet Institute.