18 February 2014

How easy is it to research the Chinese web?

han-teng_liao By 2015, the proportion of Chinese language Internet users is expected to exceed the proportion of English language users. But how easy is it for researchers to find or request content and traffic data from the major Chinese websites for research? Han-Teng Liao discusses the issues of government control and intervention, public opinion monitoring, language and data tools (and the challenges of processing Chinese language texts), narratives and propaganda, and how to identify geographic information from Chinese Web data. With some robots thrown in too...
Chinese Internet Cafe

Access to data from the Chinese Web, like other Web data, depends on platform policies, the level of data openness, and the availability of data intermediary and tools. Image of a Chinese Internet cafe by Hal Dick.

Ed: How easy is it to request or scrape data from the “Chinese Web”? And how much of it is under some form of government control?

Han-Teng: Access to data from the Chinese Web, like other Web data, depends on the policies of platforms, the level of data openness, and the availability of data intermediary and tools. All these factors have direct impacts on the quality and usability of data. Since there are many forms of government control and intentions, increasingly not just the websites inside mainland China under Chinese jurisdiction, but also the Chinese “soft power” institutions and individuals telling the “Chinese story” or “Chinese dream” (as opposed to “American dreams”), it requires case-by-case research to determine the extent and level of government control and interventions. Based on my own research on Chinese user-generated encyclopaedias and Chinese-language twitter and Weibo, the research expectations seem to be that control and intervention by Beijing will be most likely on political and cultural topics, not likely on economic or entertainment ones.

This observation is linked to how various forms of government control and interventions are executed, which often requires massive data and human operations to filter, categorise and produce content that are often based on keywords. It is particularly true for Chinese websites in mainland China (behind the Great Firewall, excluding Hong Kong and Macao), where private website companies execute these day-to-day operations under the directives and memos of various Chinese party and government agencies.

Of course there is some extra layer of challenges if researchers try to request content and traffic data from the major Chinese websites for research, especially regarding censorship. Nonetheless, since most Web content data is open, researchers such as Professor Fu in Hong Kong University manage to scrape data sample from Weibo, helping researchers like me to access the data more easily. These openly collected data can then be used to measure potential government control, as has been done for previous research on search engines (Jiang and Akhtar 2011; Zhu et al. 2011) and social media (Bamman et al. 2012; Fu et al. 2013; Fu and Chau 2013; King et al. 2012; Zhu et al. 2012).

It follows that the availability of data intermediary and tools will become important for both academic and corporate research. Many new “public opinion monitoring” companies compete to provide better tools and datasets as data intermediaries, including the Online Public Opinion Monitoring and Measuring Unit (人民网舆情监测室) of the People’s Net (a Party press organ) with annual revenue near 200 million RMB. Hence, in addition to the on-going considerations on big data and Web data research, we need to factor in how these private and public Web data intermediaries shape the Chinese Web data environment (Liao et al. 2013).

Given the fact that the government’s control of information on the Chinese Web involves not only the marginalization (as opposed to the traditional censorship) of “unwanted” messages and information, but also the prioritisation of propaganda or pro-government messages (including those made by paid commentators and “robots”), I would add that the new challenges for researchers include the detection of paid (and sometimes robot-generated) comments. Although these challenges are not exactly the same as data access, researchers need to consider them for data collection.

Ed: How much of the content and traffic is identifiable or geolocatable by region (eg mainland vs Hong Kong, Taiwan, abroad)?

Han-Teng: Identifying geographic information from Chinese Web data, like other Web data, can be largely done by geo-IP (a straightforward IP to geographic location mapping service), domain names (.cn for China; .hk for Hong Kong; .tw for Taiwan), and language preferences (simplified Chinese used by mainland Chinese users; traditional Chinese used by Hong Kong and Taiwan). Again, like the question of data access, the availability and quality of such geographic and linguistic information depends on the policies, openness, and the availability of data intermediary and tools.

Nonetheless, there exist research efforts on using geographic and/or linguistic information of Chinese Web data to assess the level and extent of convergence and separation of Chinese information and users around the world (Etling et al. 2009; Liao 2008; Taneja and Wu 2013). Etling and colleagues (2009) concluded their mapping of Chinese blogsphere research with the interpretation of five “attentive spaces” roughly corresponding to five clusters or zones in the network map: on one side, two clusters of “Pro-state” and “Business” bloggers, and on the other, two clusters of “Overseas” bloggers (including Hong Kong and Taiwan) and “Culture”. Situated between the three clusters of “Pro-state”, “Overseas” and “Culture” (and thus at the centre of the network map) is the remaining cluster they call the “critical discourse” cluster, which is at the intersection of the two sides (albeit more on the “blocked” side of the Great Firewall).

I myself found distinct geographic focus and linguistic preferences between the online citations in Baidu Baike and Chinese Wikipedia (Liao 2008). Other research based on a sample of traffic data shows the existence of a “Chinese” cluster as an instance of a “culturally defined market”, regardless of their geographic and linguistic differences (Taneja and Wu 2013). Although I found their argument that the Great Firewall has very limited impacts on such a single “Chinese” cluster, they demonstrate the possibility of extracting geographic and linguistic information on Chinese Web data for better understanding the dynamics of Chinese online interactions; which are by no means limited within China or behind the Great Firewall.

Ed: In terms of online monitoring of public opinion, is it possible to identify robots / “50 cent party” — that is, what proportion of the “opinion” actually has a government source?

Han-Teng: There exist research efforts in identifying robot comments by analysing the patterns and content of comments, and their profile relationship with other accounts. It is more difficult to prove the direct footprint of government sources. Nonetheless, if researchers take another approach such as narrative analysis for well-defined propaganda research (such as the pro- and anti-Falun opinions), it might be easier to categorise and visualise the dynamics and then trace back to the origins of dominant keywords and narratives to identify the sources of loud messages. I personally think such research and analytical efforts require deep knowledge on both technical and cultural-political understanding of Chinese Web data, preferably with an integrated mixed method research design that incorporates both the quantitative and qualitative methods required for the data question at hand.

Ed: In terms of censorship, ISPs operate within explicit governmental guidelines; do the public (who contribute content) also have explicit rules about what topics and content are ‘acceptable’, or do they have to work it out by seeing what gets deleted?

Han-Teng: As a general rule, online censorship works better when individual contributors are isolated. Most of the time, contributors experience technical difficulties when using Beijing’s unwanted keywords or undesired websites, triggering self-censorship behaviours to avoid such difficulties. I personally believe such tacit learning serves as the most relevant psychological and behaviour mechanism (rather than explicit rules). In a sense, the power of censorship and political discipline is the fact that the real rules of engagement are never explicit to users, thereby giving more power to technocrats to exercise power in a more arbitrary fashion. I would describe the general situation as follows. Directives are given to both ISPs and ICPs about certain “hot terms”, some dynamic and some constant. Users “learn” them through encountering various forms of “technical difficulties”. Thus, while ISPs and ICPs may not enforce the same directives in the same fashion (some overshoot while others undershoot), the general tacit knowledge about the “red line” is thus delivered.

Nevertheless, there are some efforts where users do share their experiences with one another, so that they have a social understanding of what information and which category of users is being disciplined. There are also constant efforts outside mainland China, especially institutions in Hong Kong and Berkeley to monitor what is being deleted. However, given the fact that data is abundant for Chinese users, I have become more worried about the phenomenon of “marginalization of information and/or narratives”. It should be noted that censorship or deletion is just one of the tools of propaganda technocrats and that the Chinese Communist Party has had its share of historical lessons (and also victories) against its past opponents, such as the Chinese Nationalist Party and the United States during the Chinese Civil War and the Cold War. I strongly believe that as researchers we need better concepts and tools to assess the dynamics of information marginalization and prioritisation, treating censorship and data deletion as one mechanism of information marginalization in the age of data abundance and limited attention.

Ed: Has anyone tried to produce a map of censorship: ie mapping absence of discussion? For a researcher wanting to do this, how would they get hold of the deleted content?

Han-Teng: Mapping censorship has been done through experiment (MacKinnon 2008; Zhu et al. 2011) and by contrasting datasets (Fu et al. 2013; Liao 2013; Zhu et al. 2012). Here the availability of data intermediaries such as the WeiboScope in Hong Kong University, and unblocked alternative such as Chinese Wikipedia, serve as direct and indirect points of comparison to see what is being or most likely to be deleted. As I am more interested in mapping information marginalization (as opposed to prioritisation), I would say that we need more analytical and visualisation tools to map out the different levels and extent of information censorship and marginalization. The research challenges then shift to the questions of how and why certain content has been deleted inside mainland China, and thus kept or leaked outside China. As we begin to realise that the censorship regime can still achieve its desired political effects by voicing down the undesired messages and voicing up the desired ones, researchers do not necessarily have to get hold of the deleted content from the websites inside mainland China. They can simply reuse plenty of Chinese Web data available outside the censorship and filtering regime to undertake experiments or comparative study.

Ed: What other questions are people trying to explore or answer with data from the “Chinese Web”? And what are the difficulties? For instance, are there enough tools available for academics wanting to process Chinese text?

Han-Teng: As Chinese societies (including mainland China, Hong Kong, Taiwan and other overseas diaspora communities) go digital and networked, it’s only a matter of time before Chinese Web data becomes the equivalent of English Web data. However, there are challenges in processing Chinese language texts, although several of the major challenges become manageable as digital and network tools go multilingual. In fact, Chinese-language users and technologies have been the major goal and actors for a multi-lingual Internet (Liao 2009a,b). While there is technical progress in basic tools, we as Chinese Internet researchers still lack data and tool intermediaries that are designed to process Chinese texts smoothly. For instance, many analytical software and tools depend on or require the use of space characters as word boundaries, a condition that does not apply to Chinese texts.

In addition, since there exist some technical and interpretative challenges in analysing Chinese text datasets with mixed scripts (e.g. simplified and traditional Chinese) or with other foreign languages. Mandarin Chinese language is not the only language inside China; there are indications that the Cantonese and Shanghainese languages have a significant presence. Minority languages such as Tibetan, Mongolian, Uyghur, etc. are also still used by official Chinese websites to demonstrate the cultural inclusiveness of the Chinese authorities. Chinese official and semi-official diplomatic organs have also tried to tell “Chinese stories” in various of the world’s major languages, sometimes in direct competition with its political opponents such as Falun Gong.

These areas of the “Chinese Web” data remain unexplored territory for systematic research, which will require more tools and methods that are similar to the toolkits of multi-lingual Internet researchers. Hence I would say the basic data and tool challenges are not particular to the “Chinese Web”, but are rather a general challenge to the “Web” that is becoming increasingly multilingual by the day. We Chinese Internet researchers do need more collaboration when it comes to sharing data and tools, and I am hopeful that we will have more trustworthy and independent data intermediaries, such as Weiboscope and others, for a better future of the Chinese Web data ecology.


Bamman, D., O’Connor, B., & Smith, N. (2012). Censorship and deletion practices in Chinese social media. First Monday, 17(3-5).

Etling, B., Kelly, J., & Faris, R. (2009). Mapping Chinese Blogosphere. In 7th Annual Chinese Internet Research Conference (CIRC 2009). Annenberg School for Communication, University of Pennsylvania, Philadelphia, US.

Fu, K., Chan, C., & Chau, M. (2013). Assessing Censorship on Microblogs in China: Discriminatory Keyword Analysis and Impact Evaluation of the “Real Name Registration” Policy. IEEE Internet Computing, 17(3), 42–50.

Fu, K., & Chau, M. (2013). Reality Check for the Chinese Microblog Space: a random sampling approach. PLOS ONE, 8(3), e58356.

Jiang, M., & Akhtar, A. (2011). Peer into the Black Box of Chinese Search Engines: A Comparative Study of Baidu, Google, and Goso. Presented at the The 9th Chinese Internet Research Conference (CIRC 2011), Washington, D.C.: Institute for the Study of Diplomacy. Georgetown University.

King, G., Pan, J., & Roberts, M. (2012). How censorship in China allows government criticism but silences collective expression. In APSA 2012 Annual Meeting Paper.

Liao, H.-T. (2008). A webometric comparison of Chinese Wikipedia and Baidu Baike and its implications for understanding the Chinese-speaking Internet. In 9th annual Internet Research Conference: Rethinking Community, Rethinking Place. Copenhagen.

Liao, H.-T. (2009a). Are Chinese characters not modern enough? An essay on their role online. GLIMPSE: the art + science of seeing, 2(1), 16–24.

Liao, H.-T. (2009b). Conflict and Consensus in the Chinese version of Wikipedia. IEEE Technology and Society Magazine, 28(2), 49–56. doi:10.1109/MTS.2009.932799

Liao, H.-T. (2013, August 5). How do Baidu Baike and Chinese Wikipedia filter contribution? A case study of network gatekeeping. To be presented at the Wikisym 2013: The Joint International Symposium on Open Collaboration, Hong Kong.

Liao, H.-T., Fu, K., Jiang, M., & Wang, N. (2013, June 15). Chinese Web Data: Definition, Uses, and Scholarship. (Accepted). To be presented at the 11th Annual Chinese Internet Research Conference (CIRC 2013), Oxford, UK.

MacKinnon, R. (2008). Flatter world and thicker walls? Blogs, censorship and civic discourse in China. Public Choice, 134(1), 31–46. doi:10.1007/s11127-007-9199-0

Taneja, H., & Wu, A. X. (2013). How Does the Great Firewall of China Affect Online User Behavior? Isolated “Internets” as Culturally Defined Markets on the WWW. Presented at the 11th Annual Chinese Internet Research Conference (CIRC 2013), Oxford, UK.

Zhu, T., Bronk, C., & Wallach, D. S. (2011). An Analysis of Chinese Search Engine Filtering. arXiv:1107.3794.

Zhu, T., Phipps, D., Pridgen, A., Crandall, J. R., & Wallach, D. S. (2012). Tracking and Quantifying Censorship on a Chinese Microblogging Site. arXiv:1211.6166.

Han-Teng was talking to blog editor David Sutcliffe.

Han-Teng Liao is an OII DPhil student whose research aims to reconsider the role of keywords (as in understanding “keyword advertising” using knowledge from sociolinguistics and information science) and hyperlinks (webometrics) in shaping the sense of “fellow users” in digital networked environments. Specifically, his DPhil project is a comparative study of two major user-contributed Chinese encyclopedias, Chinese Wikipedia and Baidu Baike.

Note: This article gives the views of the authors, and not the position of the Policy and Internet Blog, nor of the Oxford Internet Institute.