surveys – The Policy and Internet Blog https://ensr.oii.ox.ac.uk Understanding public policy online Mon, 07 Dec 2020 14:24:48 +0000 en-GB hourly 1 Did you consider Twitter’s (lack of) representativeness before doing that predictive study? https://ensr.oii.ox.ac.uk/did-you-consider-twitters-lack-of-representativeness-before-doing-that-predictive-study/ Mon, 10 Apr 2017 06:12:36 +0000 http://blogs.oii.ox.ac.uk/policy/?p=4062 Twitter data have many qualities that appeal to researchers. They are extraordinarily easy to collect. They are available in very large quantities. And with a simple 140-character text limit they are easy to analyze. As a result of these attractive qualities, over 1,400 papers have been published using Twitter data, including many attempts to predict disease outbreaks, election results, film box office gross, and stock market movements solely from the content of tweets.

Easy availability of Twitter data links nicely to a key goal of computational social science. If researchers can find ways to impute user characteristics from social media, then the capabilities of computational social science would be greatly extended. However few papers consider the digital divide among Twitter users. But the question of who uses Twitter has major implications for research attempts to use the content of tweets for inference about population behaviour. Do Twitter users share identical characteristics with the population interest? For what populations are Twitter data actually appropriate?

A new article by Grant Blank published in Social Science Computer Review provides a multivariate empirical analysis of the digital divide among Twitter users, comparing Twitter users and nonusers with respect to their characteristic patterns of Internet activity and to certain key attitudes. It thereby fills a gap in our knowledge about an important social media platform, and it joins a surprisingly small number of studies that describe the population that uses social media.

Comparing British (OxIS survey) and US (Pew) data, Grant finds that generally, British Twitter users are younger, wealthier, and better educated than other Internet users, who in turn are younger, wealthier, and better educated than the offline British population. American Twitter users are also younger and wealthier than the rest of the population, but they are not better educated. Twitter users are disproportionately members of elites in both countries. Twitter users also differ from other groups in their online activities and their attitudes.

Under these circumstances, any collection of tweets will be biased, and inferences based on analysis of such tweets will not match the population characteristics. A biased sample can’t be corrected by collecting more data; and these biases have important implications for research based on Twitter data, suggesting that Twitter data are not suitable for research where representativeness is important, such as forecasting elections or gaining insight into attitudes, sentiments, or activities of large populations.

Read the full article: Blank, G. (2016) The Digital Divide Among Twitter Users and Its Implications for Social Research. Social Science Computer Review. DOI: 10.1177/0894439316671698

We caught up with Grant to explore the implications of the findings:

Ed.: Despite your cautions about lack of representativeness, you mention that the bias in Twitter could actually make it useful to study (for example) elite behaviours: for example in political communication?

Grant: Yes. If you want to study elites and channels of elite influence then Twitter is a good candidate. Twitter data could be used as one channel of elite influence, along with other online channels like social media or blog posts, and offline channels like mass media or lobbying. There is an ecology of media and Twitter is one part.

Ed.: You also mention that Twitter is actually quite successful at forecasting certain offline, commercial behaviours (e.g. box office receipts).

Grant: Right. Some commercial products are disproportionately used by wealthier or younger people. That certainly would include certain forms of mass entertainment like cinema. It also probably includes a number of digital products like smartphones, especially more expensive phones, and wearable devices like a Fitbit. If a product is disproportionately bought by the same population groups that use Twitter then it may be possible to forecast sales using Twitter data. Conversely, products disproportionately used by poorer or older people are unlikely to be predictable using Twitter.

Ed.: Is there a general trend towards abandoning expensive, time-consuming, multi-year surveys and polling? And do you see any long-term danger in that? i.e. governments and media (and academics?) thinking “Oh, we can just get it off social media now”.

Grant: Yes and no. There are certainly people who are thinking about it and trying to make it work. The ease and low cost of social media is very seductive. However, that has to be balanced against major weaknesses. First the population using Twitter (and other social media) is unclear, but it is not a random sample. It is just a population of Twitter users, which is not a population of interest to many.

Second, tweets are even less representative. As I point out in the article, over 40% of people with a Twitter account have never sent a tweet, and the top 15% of users account for 85% of tweets. So tweets are even less representative of any real-world population than Twitter users. What these issues mean is that you can’t calculate measures of error or confidence intervals from Twitter data. This is crippling for many academic and government uses.

Third, Twitter’s limited message length and simple interface tends to give it advantages on devices with restricted input capability, like phones. It is well-suited for short, rapid messages. These characteristics tend to encourage Twitter use for political demonstrations, disasters, sports events, and other live events where reports from an on-the-spot observer are valuable. This suggests that Twitter usage is not like other social media or like email or blogs.

Fourth, researchers attempting to extract the meaning of words have 140 characters to analyze and they are littered with abbreviations, slang, non-standard English, misspellings and links to other documents. The measurement issues are immense. Measurement is hard enough in surveys when researchers have control over question wording and can do cognitive interviews to understand how people interpret words.

With Twitter (and other social media) researchers have no control over the process that generated the data, and no theory of the data generating process. Unlike surveys, social media analysis is not a general-purpose tool for research. Except in limited areas where these issues are less important, social media is not a promising tool.

Ed.: How would you respond to claims that for example Facebook actually had more accurate political polling than anyone else in the recent US Election? (just that no-one had access to its data, and Facebook didn’t say anything)?

Grant: That is an interesting possibility. The problem is matching Facebook data with other data, like voting records. Facebook doesn’t know where people live. Finding their location would not be an easy problem. It is simpler because Facebook would not need an actual address; it would only need to locate the correct voting district or the state (for the Electoral College in US Presidential elections). Still, there would be error of unknown magnitude, probably impossible to calculate. It would be a very interesting research project. Whether it would be more accurate than a poll is hard to say.

Ed.: Do you think social media (or maybe search data) scraping and analysis will ever successfully replace surveys?

Grant: Surveys are such versatile, general purpose tools. They can be used to elicit many kinds information on all kinds of subjects from almost any population. These are not characteristics of social media. There is no real danger that surveys will be replaced in general.

However, I can see certain specific areas where analysis of social media will be useful. Most of these are commercial areas, like consumer sentiments. If you want to know what people are saying about your product, then going to social media is a good, cheap source of information. This is especially true if you sell a mass market product that many people use and talk about; think: films, cars, fast food, breakfast cereal, etc.

These are important topics to some people, but they are a subset of things that surveys are used for. Too many things are not talked about, and some are very important. For example, there is the famous British reluctance to talk about money. Things like income, pensions, and real estate or financial assets are not likely to be common topics. If you are a government department or a researcher interested in poverty, the effect of government assistance, or the distribution of income and wealth, you have to depend on a survey.

There are a lot of other situations where surveys are indispensable. For example, if the OII wanted to know what kind of jobs OII alumni had found, it would probably have to survey them.

Ed.: Finally .. 1400 Twitter articles in .. do we actually know enough now to say anything particularly useful or concrete about it? Are we creeping towards a Twitter revelation or consensus, or is it basically 1400 articles saying “it’s all very complicated”?

Grant: Mostly researchers have accepted Twitter data at face value. Whatever people write in a tweet, it means whatever the researcher thinks it means. This is very easy and it avoids a whole collection of complex issues. All the hard work of understanding how meaning is constructed in Twitter and how it can be measured is yet to be done. We are a long way from understanding Twitter.

Read the full article: Blank, G. (2016) The Digital Divide Among Twitter Users and Its Implications for Social Research. Social Science Computer Review. DOI: 10.1177/0894439316671698


Grant Blank was talking to blog editor David Sutcliffe.

]]>
Estimating the Local Geographies of Digital Inequality in Britain: London and the South East Show Highest Internet Use — But Why? https://ensr.oii.ox.ac.uk/estimating-the-local-geographies-of-digital-inequality-in-britain/ Wed, 01 Mar 2017 11:39:54 +0000 http://blogs.oii.ox.ac.uk/policy/?p=3962 Despite the huge importance of the Internet in everyday life, we know surprisingly little about the geography of Internet use and participation at sub-national scales. A new article on Local Geographies of Digital Inequality by Grant Blank, Mark Graham, and Claudio Calvino published in Social Science Computer Review proposes a novel method to calculate the local geographies of Internet usage, employing Britain as an initial case study.

In the first attempt to estimate Internet use at any small-scale level, they combine data from a sample survey, the 2013 Oxford Internet Survey (OxIS), with the 2011 UK census, employing small area estimation to estimate Internet use in small geographies in Britain. (Read the paper for more on this method, and discussion of why there has been little work on the geography of digital inequality.)

There are two major reasons to suspect that geographic differences in Internet use may be important: apparent regional differences and the urban-rural divide. The authors do indeed find a regional difference: the area with least Internet use is in the North East, followed by central Wales; the highest is in London and the South East. But interestingly, geographic differences become non-significant after controlling for demographic variables (age, education, income etc.). That is, demographics matter more than simply where you live, in terms of the likelihood that you’re an Internet user.

Britain has one of the largest Internet economies in the developed world, and the Internet contributes an estimated 8.3 percent to Britain’s GDP. By reducing a range of geographic frictions and allowing access to new customers, markets and ideas it strongly supports domestic job and income growth. There are also personal benefits to Internet use. However, these advantages are denied to people who are not online, leading to a stream of research on the so-called digital divide.

We caught up with Grant Blank to discuss the policy implications of this marked disparity in (estimated) Internet use across Britain.

Ed.: The small-area estimation method you use combines the extreme breadth but shallowness of the national census, with the relative lack of breadth (2000 respondents) but extreme richness (550 variables) of the OxIS survey. Doing this allows you to estimate things like Internet use in fine-grained detail across all of Britain. Is this technique in standard use in government, to understand things like local demand for health services etc.? It seems pretty clever..

Grant: It is used by the government, but not extensively. It is complex and time-consuming to use well, and it requires considerable statistical skills. These have hampered its spread. It probably could be used more than it is — your example of local demand for health services is a good idea..

Ed.: You say this method works for Britain because OxIS collects information based on geographic area (rather than e.g. randomly by phone number) — so we can estimate things geographically for Britain that can’t be done for other countries in the World Internet Project (including the US, Canada, Sweden, Australia). What else will you be doing with the data, based on this happy fact?

Grant: We have used a straightforward measure of Internet use versus non-use as our dependent variable. Similar techniques could predict and map a variety of other variables. For example, we could take a more nuanced view of how people use the Internet. The patterns of mobile use versus fixed-line use may differ geographically and could be mapped. We could separate work-only users, teenagers using social media, or other subsets. Major Internet activities could be mapped, including such things as entertainment use, information gathering, commerce, and content production. In addition, the amount of use and the variety of uses could be mapped. All these are major issues and their geographic distribution has never been tracked.

Ed.: And what might you be able to do by integrating into this model another layer of geocoded (but perhaps not demographically rich or transparent) data, e.g. geolocated social media / Wikipedia activity (etc.)?

Grant: The strength of the data we have is that it is representative of the UK population. The other examples you mention, like Wikipedia activity or geolocated social media, are all done by smaller, self-selected groups of people, who are not at all representative. One possibility would be to show how and in what ways they are unrepresentative.

Ed.: If you say that Internet use actually correlates to the “usual” demographics, i.e. education, age, income — is there anything policy makers can realistically do with this information? i.e. other than hope that people go to school, never age, and get good jobs? What can policy-makers do with these findings?

Grant: The demographic characteristics are things that don’t change quickly. These results point to the limits of the government’s ability to move people online. They say that 100% of the UK population will never be online. This raises the question, what are realistic expectations for online activity? I don’t know the answer to that but it is an important question that is not easily addressed.

Ed.: You say that “The first law of the Internet is that everything is related to age”. When are we likely to have enough longitudinal data to understand whether this is simply because older people never had the chance to embed the Internet in their lives when they were younger, or whether it is indeed the case that older people inherently drop out. Will this age-effect eventually diminish or disappear?

Grant: You ask an important but unresolved question. In the language of social sciences — is the decline in Internet use with age an age-effect or a cohort-effect. An age-effect means that the Internet becomes less valuable as people age and so the decline in use with age is just a reflection of the declining value of the Internet. If this explanation is true then the age-effect will persist into the indefinite future. A cohort-effect implies that the reason older people tend to use the Internet less is that fewer of them learned to use the Internet in school or work. They will eventually be replaced by active Internet-using people and Internet use will no longer be associated with age. The decline with age will eventually disappear. We can address this question using data from the Oxford Internet Survey, but it is not a small area estimation problem.

Read the full article: Blank, G., Graham, M., and Calvino, C. 2017. Local Geographies of Digital Inequality. Social Science Computer Review. DOI: 10.1177/0894439317693332.

This work was supported by the Economic and Social Research Council [grant ES/K00283X/1]. The data have been deposited in the UK Data Archive under the name “Geography of Digital Inequality”.


Grant Blank was speaking to blog editor David Sutcliffe.

]]>