The Internet, Policy & Politics Conferences

Oxford Internet Institute, University of Oxford

Menchen-Trevino: Collecting Vertical Big Data: Big Possibilities and Big Challenges

This paper has been published as: Ericka Menchen-Trevino (2013) Collecting vertical trace data: Big possibilities and big challenges for multi-method research. Policy and Internet 5 (3) 328-339.

Ericka Menchen-Trevino, Erasmus University, Rotterdam

Abstract

Datasets that consists of digital imprints are often a mile wide and an inch deep, providing broad but shallow evidence of particular patterns of behavior. Twitter data are an example of this type of horizontal big data. Tweets can tell researchers how users interact on Twitter, and the collective mood of users and many other valuable insights. However, this is one small piece of how people use the web in their their everyday lives (a similar point is made by boyd & Crawford, 2011). Refocusing from websites to people, researchers can collect big data from individuals. These vertical big data illuminate digital behavior in great depth and provide rich contextual information. Vertical data pose big challenges as well, but tackling them may enrich our understanding of the digital world far beyond what horizontal data alone can provide. This article will explore evidence from a vertical big data project that I completed as a way to think about the analytical possibilities of vertical data in general.

In 2010 I co-created a web application that captures the browsing logs and full HTML text of a research participants’ web browsing, called Roxy (See Menchen-Trevino & Karr). I used Roxy to collect the browsing behavior of 41 carefully selected participants during the final eight weeks of the U.S. midterm elections of 2010, resulting in a dataset with over 13 GB of text. I used the application myself to make sure it was working properly, and so that I could check the analysis procedures using familiar data. This dataset is focused on individuals, collected with fully informed consent, and contains information that can show contributions to horizontal datasets. Data that meet these three criteria fit my definition of vertical big data.

When appropriate sampling techniques are used, vertical data can provide context to extend horizontal datasets. Web ratings data companies such as Nielsen NetRatings, ComScore and HitWise have datasets that could be used for these purposes, however they tend to allow researchers access only to aggregate data, they do not collect the full text of the web pages accessed, and their recruitment methods and data collection software are kept as trade secrets. One of the biggest challenges to collecting vertical big data is how to collect it in an ethical way. Roxy allows the participant to opt in and out of data collection at any time, and automatically offers the choice of a logged or a private session after 30 minutes of internet inactivity. Not only does this provide a robust informed consent process and respect the autonomy of participants, it offers them shades of grey in place of what is often a black and white choice, use logging software or not. This could make participation possible for those who would agree to have most but not all of their browsing logged. In my study participants chose a logged rather than a private session 91 percent of the cases when they were prompted to make this choice.

Aligning horizontal and vertical big data is just one way that vertical data can advance online research methods. The goal of designing Roxy was not only to collect observational web data, but to ask participants questions about the observed behavior such as, “I see that you went to the campaign websites of both candidates for Senate on Election Day; can you tell me more about that?” In order to accomplish this, and to comply with the ethical principals of research with human subjects, I designed a fully informed consent process for Roxy. This allows for data collection not just across websites, but about offline behavior and attitudes as well. In my study I administered eight weekly surveys about news interest and then compared reported interest with online browsing and asked participants to tell me more about cases where interest was high but browsing was not logged.

An additional contribution of vertical big data is to help individuals understand their own online behavior. I collected my own web browsing behavior and my data showed patterns that I recognized, but had not given much conscious thought to them. Also, when I shared information from their Roxy logs with participants during the interviews they were often quite interested in gaining further insights into their own web use. This is a potential direct benefit to participants.

None of this is news to researchers in the private sector. They have long realized the value of combining horizontal and vertical data (with or without fully informed consent). This project demonstrates a path forward for social scientists to collect vertical data in accordance with explicit norms of research ethics. Other researchers are also developing tools to collect vertical big data, such as ContextLogger, which collects certain types of data from mobile phones (Hasu, 2012). A recent survey by the Pew Internet & American Life Project in cooperation with Facebook asked survey participants to provide their Facebook records to the project by furnishing their email address (Hampton, Goulet, Marlow, & Rainie, 2012), enabling a multi-method approach.

Vertical data are often difficult to collect compared to horizontal data, which are simply a byproduct of digital technology use. This type of data collection requires building relationships of mutual trust and respect with participants; however, the challenges involved are similar to surveys and lab-based studies. Advancing research technology for collecting vertical big data requires a willingness to deal with ethical challenges along with technological difficulties, but the potential contribution of these types of datasets are quite promising.

References

boyd, d., & Crawford, K. (2011). Six Provocations for Big Data. SSRN eLibrary. doi: 10.2139/ssrn.1926431

Hampton, K. N., Goulet, L. S., Marlow, C., & Rainie, L. (2012). Why most Facebook users get more than they give: The effect of Facebook 'power users' on everybody else. Washington D.C.: Pew Research Center's Internet & American Life Project.

Hasu, T. (2012). ContextLogger (Version 2). Retrieved from http://contextlogger.org/

Menchen-Trevino, E., & Karr, C. Researching real-world Web use with Roxy: Collecting observational Web data with informed consent. Journal of Information Technology & Politics. doi: 10.1080/19331681.2012.664966