Solon Barocas: Big data are made by (and not just a resource for) social science and policy-making | The Internet, Policy & Politics Conferences

Solon Barocas, New York University

Abstract

Contrary to what the etymology of the word might suggest, data are never simply given or there for the taking. They are artifacts of human intervention, not records imparted by nature itself. And they are the result of mechanisms of observation, inscription, and representation that serve specific ends—ends, in the case of big data, that are very often commercial: transactional records motivated by billing or other administrative requirements; user input that conforms to the data dictionary developed for a firm’s internal operations; sensors and sensor logs dictated by the particularities of industrial processes; etc. This should not lead us to discount the value of big data for social scientific research and policy-making. But it should enjoin social scientists and policy-makers to treat big data circumspectly and to inquire into the conditions that produce it.

This paper argues that big data are the product of a kind of science practiced by commercial developers of information systems well before they are potential evidence for scientists who are eager to subject them to more traditional academic study. In particular, this paper will show how the information systems that manufacture this data are shot-through with specific ideas of a social world and sociality that they are innocently meant to mediate, but which they quite clearly shape. It will explain how the interfaces and infrastructure that generate big data affect—often purposefully so—the behaviors that are exhibited on those systems. And it will argue that social scientist and policy makers have an obligation to attend to these effects and the nature of these systems before they adopt big data as evidence for their own research purposes.

Specific Issues

To give some substance to this more general claim, this paper will draw attention to three specific issues:

1. Prior ontological and epistemological commitments inhere in big data: The paper first returns to Philip Agre’s seminal work on what he called a ‘capture ontology’ and ‘grammars of action’. Agre pointed out that the process of developing an information system necessarily involves setting in place a particular representational order—an ontology, in the language of information science—that maps the salient entities and relationships within a specific domain or for a specific task. The mediation performed by information systems entails a process of translation that puts developers in the awkward position of modeling the behavior or activity that their systems are simply meant to support. But, as Agre argued, this mediation, by relying on the formal ontologies of information systems, imposes certain ‘grammars of action’ by specifying in advance the range and form of acceptable input and activity within an information system. That is, they tend to impose a set of possible categories of observation in advance, ensuring that information systems generate records of only those actions that the system has already deemed salient. And in so doing, the data that are captured tend to reinforce existing epistemological categories—producing evidence that cyclically conforms to the model upon which the capture ontology and grammars of actions were developed. Even an information system that appears eminently agnostic must select for certain behaviors and it must render these things in representational forms that comport with the aims of the system’s developer. If these systems inscribe intuitive models of human behavior into the very mechanisms that capture behavior in the ‘real world’, social scientists and policy makers should not take the resulting data as a happy byproduct of the digital mediation of otherwise naturally occurring activities. The data are, at least in part, evidence of the purposeful design of the system that ‘happens’ to generate them.

2. Big data are already the result of careful research: This is most clear in cases of so-called A/B testing, when developers experiment with multiple, competing design choices ‘in the wild’. Web-based companies are especially well positioned to engage in this kind of testing by exposing limited sets of their customers to different feature sets and interface designs when customers arrive at their website so as to assess the relative impact of these choices on the desired behavior. This could be a form of usability testing, examining how easily users navigate the functionality of the system, given the particular design choices, but, in many instances, the analysis has other or additional motivations: to arrive at a design that is most likely to cultivate the kind of behavior that suits the companies’ interests. The results of A/B testing tend to be interesting scientific and social scientific findings in their own right. Common examples include significant changes in response rates to sign-up for a service or complete a task when certain elements of website change colors or when messages use different wording. Treating data that results from systems which are themselves the product of extensive A/B testing or experiments would miss that much of this behavior had been shaped with these earlier findings clearly in mind. Worse, researchers would not be able to determine what really accounts for this behavior, as they would only have access to data that resulted from the designs that won out. This goes beyond the now well accepted argument that artifacts have politics, revealing the degree to which (latent) politics are not only embodied in certain designs but how companies purposefully experiment to best inscribe explicit politics.

3. Research that draws on big data will furnish findings that commercial actors will be much better positioned to exploit: The findings that result from analyses of big data will be most immediately useful for the commercial actors from which the data sprung. This will be especially so in the case of behavioral research and research into decision-making. The very same research findings that suggest new levers for behavior change will be available to the commercial actors who collect and maintain the relevant data to act on these findings. Indeed, commercial actors will be much better positioned to pull these levers than anyone else. And they may do so to affect changes that are at odds with the intentions and moral commitments of the social scientists and policy-makers who performed the supporting research.