Markus Krause, UC Berkeley, Berkeley, California
Margeret Hall, University of Nebraska Omaha, School of Interdisciplinary Informatics, Omaha, Nebraska
Simon Caton, National College of Ireland, Dublin
Intrinsic to the transition towards and necessary for the success of digital platforms as a service (at scale) is the notion of human computation. Going beyond ‘the wisdom of the crowd’, human computation is the engine that powers platforms and services that are now ubiquitous like Duolingo and Wikipedia. In spite of increasing research and population interest, several issues remain open and in debate on large-scale human computation projects. At the center of the debate on large-scale human computation projects is invariably a discussion of quality. This is due in part to the fact that a suitable and scalable mechanism for the ex-ante detection of consistently underperforming contributors has yet to be introduced. Commonly used tactics include qualification tests; pre-set qualifications; constructing trust (worthiness) models to determine the probability of diligent work; hidden gold standard questions; and the use of metrics such as solution acceptance. To this effect, we have investigated the effect of different quality control methods on the response quality of contributors for tasks of varying complexity.
We present a study that illustrates that consistently underperforming contributors will not take on tasks that feature quality control mechanisms. Our research employs a 3 x 5 factorial experimental design of three task types with varying complexities and five different quality control methods to measure the impact of quality control and task complexity on output quality. Our results indicate that the quality control methods do not have a significant impact on response quality. In the experiment, it was sufficient to announce a qualification test to repel consistently underperforming contributors. In our experiment, most contributors were diligent. Consistently underperforming contributors by our definition were only present in settings with no quality control. Even announcing a quality control without applying was sufficient. We found as expected that our tasks differ in complexity and confirmed the hypothesized order to be as follows semantic similarity (least complex), question answering (more complex), text translation (most complex). We found that consistently underperforming contributors (by our definition contributors with less than 40% acceptable responses) are almost not present in all conditions of our experiment with a quality control method either announced or in place. We however found a substantial amount of consistently underperforming contributors (almost 45%) in our control conditions (none) without a quality control method. Only mentioning a required introductory test (without actually doing the test, the fake level of the control factor) was sufficient to achieve the same response quality as with other quality control methods. Even immediate human generated feedback was not able to raise the response quality above the level of this fake introductory test. As hypothesized, the response quality does not differ across the different quality control methods. It only differs significantly between the none conditions (M = 0.63, SD = 0.03) and conditions with quality control (M = 0.79, SD = 0.05). This is an increase of more than 25% in response quality. We therefore conclude that consistently underperforming contributors are aware of the fact that their contribution might fall short of required quality standards when performing a task. This also means that very basic quality control methods (or the threat thereof) are sufficient to promote diligent work. We also demonstrate that extremely simple machine learning methods with task independent features as proposed by <blinded for review> can predict response quality on the fly. Such methods may provide quality control for tasks similar to the ones explored in this paper. Based on our findings, we argue that expansive quality control support and applications are unwarranted. Instead, we propose that more resources should be dedicated to adequate training of contributors in order to raise the overall quality, and by extension competence, of crowdsourced contributions.
Limitations and Future Work:
While a 3x5 factorial model is sizable, future work should cover more of the scope of quality control mechanisms to assure the transferability of these results. Furthermore, it has yet to be seen if tasks in domains other than natural language processing yield similar results. We recognize that our minimal control mechanism (fake) without enforcement is not sustainable - contributors can and will quickly realize that no quality control has in fact been enforced. A sustainable and low cost mechanism to elevate the performance of diligent but underperforming contributors must be developed and tested to complete the scope of this research. As shown in this work, even a basic control for response quality, increases per task performance considerably. A worthy area of future research is support systems for those who worked diligently but still underperform. This is a task both for the requestor's side (i.e., task description writing) as well as the contributor's side (i.e., educational materials). Particularly worthwhile would be the investigation of monetary incentivization of contributors' education. Monetized education-based tasks could create the scenario that contributors are both learning to complete more and more complex tasks, while gaining skills and funding to be applied in their offline lives. An envisioned mechanism for this could be (Micro) Massively Open Online Courses, where contributors register for the course to learn increasingly complex skills, and are financially rewarded with successful task mastery. Realized in its full depth and scope, this progressive step could contribute to the comprehensive enhancement of crowdwork from a quality perspective and the overall, real life skillset of the contributors.