Let's Agree to Disagree: Fixing Agreement Measures for Crowdsourcing
DOI:
https://doi.org/10.1609/hcomp.v5i1.13306Keywords:
crowdsourcing, inter-rater agreement, reliabilityAbstract
In the context of micro-task crowdsourcing, each task is usually performed by several workers. This allows researchers to leverage measures of the agreement among workers on the same task, to estimate the reliability of collected data and to better understand answering behaviors of the participants. While many measures of agreement between annotators have been proposed, they are known for suffering from many problems and abnormalities. In this paper, we identify the main limits of the existing agreement measures in the crowdsourcing context, both by means of toy examples as well as with real-world crowdsourcing data, and propose a novel agreement measure based on probabilistic parameter estimation which overcomes such limits. We validate our new agreement measure and show its flexibility as compared to the existing agreement measures.