Volume 49, Issue 1 p. 1-10
Article
Free Access

Evaluating rater quality and rating difficulty in online annotation activities

Peter Organisciak

Peter Organisciak

University of Illinois at Urbana-Champaign 501 E. Daniel Street, MC-492

Search for more papers by this author
Miles Efron

Miles Efron

University of Illinois at Urbana-Champaign 501 E. Daniel Street, MC-492

Search for more papers by this author
Katrina Fenlon

Katrina Fenlon

University of Illinois at Urbana-Champaign 501 E. Daniel Street, MC-492

Search for more papers by this author
Megan Senseney

Megan Senseney

University of Illinois at Urbana-Champaign 501 E. Daniel Street, MC-492

Search for more papers by this author
First published: 24 January 2013

Abstract

Gathering annotations from non-expert online raters is an attractive method for quickly completing large-scale annotation tasks, but the increased possibility of unreliable annotators and diminished work quality remains a cause for concern. In the context of information retrieval, where human-encoded relevance judgments underlie the evaluation of new systems and methods, the ability to quickly and reliably collect trustworthy annotations allows for quicker development and iteration of research.

In the context of paid online workers, this study evaluates indicators of non-expert performance along three lines: temporality, experience, and agreement. It is found that user performance is a key indicator for future performance. Additionally, the time spent by raters familiarizing themselves with a new set of tasks is important for rater quality, as is long-term familiarity with a topic being rated.

These findings may inform large-scale digital collections' use of non-expert raters for performing more purposive and affordable online annotation activities.

INTRODUCTION

Work supported through online volunteer contributions or micropayment-based labor presents a strikingly different model of annotating information items. The accessibility of large groups of contributors online – in the form of either interested volunteers or paid workers – allows for large-scale annotation tasks to be completed quickly. However, this approach also introduces new problems of reliability by problematizing assumptions about expertise and work quality. The actual raters in these tasks are generally self-selected and unvetted, making it difficult to ascertain the reliability of the ratings.

This study approaches a crucial problem: disambiguating the influence of unreliable annotators from inherent uncertainty in multi-rater aggregation. In pursuing this goal, a key assumption was made: that of a negotiated “ground truth'” over an objective one. In assuming that the truth-value is a negotiated truth, rater disagreement is not in itself a sign of bad raters, but should be considered in light of the agreement among raters.

In this paper, we make the following contributions:
  • We describe the problem of reconciling annotation contributions or work by non-expert, semi-anonymous raters.

  • We evaluate a number of approaches for separating rater quality from rating difficulty, including dwell time, rater experience, task difficulty, and agreement with other raters.

  • As the task difficulty is not directly observable, we describe an iterative algorithm that allows us to evaluate it alongside rater reliability and separate the effect of the two.

The vehicle for this study was relevance assessment for information retrieval related to the IMLS DCC cultural heritage aggregation. Relevance assessments are a vital part of information retrieval evaluation and help in addressing the unique search challenges faced by large aggregators of cultural heritage content.

PROBLEM

Online annotation generally takes the form of short, fragmented tasks. To capitalize on the scale and ephemerality of online users, services such as Amazon's Mechanical Turk (AMT)1 have emerged, which encourage the short task model as a form of on-demand human work. Amazon describes the Mechanical Turk service as “artificial artificial intelligence,” referring to the facade of computation with a human core. The purpose here is that Turk enables the easy undertaking of tasks that require human perception – tasks such as encoding, rating, and annotating.

AMT has shown itself useful in information retrieval, where many individual tasks benefit from simple parallelization. Despite the fact that workers are non-expert raters, this is not inherently a problem: past studies have found that only a few parallel annotations are required to reach expert quality (Snow et al. 2008) and that increasing the amount of parallel labor per item offers diminishing returns (Novotney and Callison-Burch 2010). Training is possible on AMT, but the large workforce and transience of individual workers means that training conflicts with the cost and speed benefits of micropayment-based labor.

As AMT has grown, however, the appeal of cheating has also grown. The workforce, who was originally a mix between those looking to pass time and those looking to earn money, has been shifting primarily to the latter (Eickhoff and Vries 2012). Since reimbursement is done per task rather than per hour, contributors have a monetary incentive to complete tasks as quickly as possible. The site's continued growth may attract more cheaters in the future, making it more important to be able to properly identify them within classification data.

Among non-malicious workers, there is still the potential problem of varying expertise. Workers begin a task with no prior experience and grow more experienced over time. When there may be hundreds or thousands of workers, each one potentially following a learning curve, the effect of inexperience should be taken more seriously than in traditional settings with only one or a few workers. Making decisions from majority voting is quite robust in many cases. However, to safeguard against the presence of cheaters and their strengthened influence in low-consensus tasks, a less naive decision-making process would be valuable.

The problem of reconciling ground truth votes from unvetted and potentially unreliable raters is not limited to the use of Mechanical Turk. Digital libraries now have the ability to interact with their users in ways that crowdsource the creation of new content or metadata. Volunteer contributions may provide entirely new content – such as suggested labels or corrections – or feedback on existing content – such as rating the quality of an item's metadata. While unpaid engagement does not have the same financial motivation for malicious raters, contributions that are open to the public are still susceptible to low-quality results: whether through recklessness, misunderstanding, mischief, or simply spam. Furthermore, even when the ratings or annotations from unvetted semi-anonymous online raters are of a high quality, there is nonetheless a need to justify the quality of those ratings.

MOTIVATION

The impetus for this research was a desire to improve the effectiveness of an information retrieval system for the Institute of Museum and Library Services Digital Collections and Content project (IMLS DCC). The IMLS DCC is a large aggregation of digital cultural heritage materials from museums, libraries, and archives across the country.

Originally launched in 2002 as a point of access to digital collections supported by the IMLS through National Leadership Grants and LSTA funding, it has since expanded its scope to provide more inclusive coverage of American history collections, regardless of funding source. As a result of its position among the largest cultural heritage aggregations in the U.S., research through the IMLS DCC looks at the problems associated with reconciling content from thousands of different providers, including metadata interoperability, collection-item relationships, and access to materials.

One of the difficulties that IMLS DCC must address is information retrieval when the metadata records in its aggregation are of inconsistent length, style, and informativeness. Overcoming these problems in order to improve subject access to the breadth of materials is being actively researched (Efron et al. 2011, Efron et al. 2012). In doing so, human relevance ratings are an invaluable resource for evaluating document relevance in a given query.

RELATED WORK

As non-expert classification has become more common, there have been a number of studies into the quality of its raters. Generally, such studies have found that, while a single rater does not match the quality of an expert, aggregating votes from multiple earnest raters can match the quality of an expert.

Snow et. al (2008), found that for natural language processing tasks, only a few redundant classifications are necessary to emulate expert quality – their task required an average of four labels. Similarly, Novotney and Callison-Burch (2010), looking at online transcription, found that the increase in quality from adding redundant annotations was small, and recommended allocation resources to collecting new data. Additionally, they noted that disagreement measures are more effective for identifying and correcting for bad workers than they are for finding good workers, due to false positives among highly ranked raters.

In understanding the role of non-expert raters, a number of studies have taken differing approaches to ranking rater reliability and dealing with noise. Some have attempted to model rater noise against gold standard labels (Hsueh et al. 2009, Eickhoff and Vries 2012). However, more commonly, researchers look at ways to understand rater quality without the presence of ground truth data.

One common approach to separate the latent variable of rater quality from task difficulty enlists the Expectation Maximization (EM) algorithm, weighing rater judgments based on past performance (Whitehill et al. 2009, Welinder and Perona 2010, Wang et al. 2011). The approach that we take is similar in principle to the EM algorithm.

Donmez et al. (2010) take a different approach, using a variant of Sequential Bayesian Estimation to estimate incremental changes in raters by building posterior probability density functions.

Raters have been seen as a between good or bad, where the nature of the problem is to identify the latter for removal (Dekel and Shamir 2009). Others have seen reliability not as an issue of replacing users, but rather of identifying low quality ratings to reinforce with additional ratings (Sheng et al. 2008).

One notably unique concept of user quality was the assumption by Donmez et al. (2010) that the quality of raters changes over time. In other words, rater quality was considered a distribution over time, rather than an overall score. Notable about this approach is that there are no assumptions about the direction of quality change by raters, allowing them to account not only for inexperience but also for occasional patches of low quality ratings by a rater.

Alongside prior work in representing non-expert annotators, research has also considered using the information for deciding on future actions. This has been considered both as an act of choosing the next tasks for a rater (Wallace et al. 2011, Welinder and Perona 2010), and alternately an act of choosing the next raters for a task Donmez et al. (2010).

In 2011, the Text Retrieval Conference (TREC) held a Crowdsourcing track for the first time, which dealt directly with the evaluation of search engines by non-expert raters hired through micropayment services. Teams looked at one or both of two tasks. The first task was to effectively collect high-quality relevance judgments. The second task, in line with the goals of this study, was to “compute consensus (aka. ‘label aggregation’) over a set of individual worker labels” (Lease and Kazai 2011).

There were two evaluation sets used with the second task of the TREC Crowdsourcing Track: one of consensus labels from among all the participating teams and one of ground truth gold labels done by professional assessors. Accuracy rates – the number of properly labeled ratings divided by all ratings – spanned from 0.35 to 0.94 with a median of 0.835 against the crowdsourced consensus labels, while the runs against the gold labels spanned from 0.44 to 0.70 with a median of 0.66. In achieving these results, the ten teams used a variety of approaches, including the EM algorithm, rules-based learning models, and topic-conditional naïve Bayes modeling (ibid).

Details are in the caption following the image

Screenshot of the rating interface.

When measured by accuracy, the EM algorithm was among the most prominent. The best performing team against each evaluation set – BUPT-WILDCAT and uc3m, respectively – both had an EM implementation in their submission, though the latter was paired with a number of additional rules. However, uc3m's second, non-official run slightly outperformed the accuracy of their official submission with a support vector machine (SVM) approach (Urbano et al. 2011).

APPROACH

The purpose of this study is to determine important factors in making truth value decisions from repeated ratings by non-expert raters. In the context of paid microtask labor, we look to simultaneously interrogate worker quality and task difficulty, allowing our estimates of one inform our estimates of the other.

The factors that we consider are in the following three areas:
  • Experience. Do raters grow more reliable over time? Can you account for the rating distribution given the rater's experience? In this study, tasks are grouped topically, by “queries.” Raters were asked ‘is this metadata record relevant to Query X’ or ‘what is the tone of Query X?’ As such, we also looked at whether a rater's experience with a query affects their performance.

  • Temporality. Does the length of time that a rater spends on a question reflect the quality of their rating?

  • Agreement. Does a rater's agreement or disagreement with other raters reflect their overall quality as a rater? We also investigated ways to quantify disagreement and to correct for it.

Most of our evaluation was quantified through accuracy, which is percentage of correct classifications that are made: equation image There are two comparison sets of data by which ‘correct’ classifications were taken. The first was against consensus labels, which were simply generated by taking the majority vote for a given task. Since these are derived from the actual dataset, they are not completely reliable. However, for comparative purposes, they offer us a metric by which to see trends in the data.

The cleaner set of ground truth data is a set of oracle ratings done by the authors. Since the authors are of known reliability and have a close understanding of both the rating task and the data being rated, the oracle judgments serve as an effective measure for evaluating the accuracy of the majority votes themselves.

DATA

We looked at two datasets in this study.

In the primary dataset, raters judged the relevance of cultural heritage documents to a given query. This data was rated with three-rater redundancy, which means that for each document, there were three raters that completed the rating task. There were three label options available to raters: relevant, non-relevant, and I don't know. The unknown option was considered a skipped option and the data was removed from the final dataset.

2

Details are in the caption following the image

Distribution of raters' agreement with other raters (x) and agreement with oracle ratings (y). Since agreement with oracle ratings tends to be higher than agreement with other raters, judging the quality of a rater by the latter measure is a conservative approach. Since oracle ratings are available for only a subset of the data, only raters with a minimum of ten ratings overlapping with the oracle set are shown here.

Details are in the caption following the image

Number of ratings contributed per rater, roughly following a power-law distribution

Annotations were collected through Amazon's Mechanical Turk service, using a custom rating interface. When a rater accepted a judgment task, they were shown a page with a query, description of the task, description of the coding manual (i.e. what types of documents should be rated as relevant), and up to ten ribbons of documents to rate (see Figure 1). The structured form of digital item records lends itself well to such tasks, which we represented through the title, description, and related image thumbnail. To aid the task of scrolling through ratings and decrease the time spent on tasks, our interface automatically scrolled to the next tasks once the previous one was rated.

The rating tasks involved very brief collection metadata documents, which were rating according to their relevance to a given query. There were two reasons that raters annotated ten items at a time. First, this allowed for less time loading and adjusting to new tasks. If there was a learning curve for each query – as this study later found to be present – it seemed sensible to allow raters some time to rate once they grew comfortable with a task. The second reason was to create a minimum usable profile of a rater's performance, which would have been difficult with fewer tasks. Note that not all sets of tasks had ten items, as our system would track tasks that were completed or in progress, serving fewer than ten when ten were not available.

Originally around 17700 data points were collected, though this was later increased to slightly less than 23000. The average amount of time spent on each individual item was 4.8 seconds, with half of all ratings being done in less than 1.8 seconds and full rating sets being completed in an average time of 37.3 seconds.

Among the raters contributing to the data, there were 157 unique raters, rating an average of 141.9 tasks. The most dedicated rater completed a total of 1404 ratings. The distribution for contribution count resembles an inverse power law, a distribution commonly seen among contributions from online users (see Figure 3).

For diversity, we also tested against a different dataset, in which raters classified the tone of a number of political tweets (Organisciak 2012). This Twitter sentiment dataset it included more classification options — raters rated the tweet as having positive, negative, or neutral tone or whether it was incoherent or spam.

For both the primary and secondary datasets, there was an accompanying set of ground truth oracle judgments done by the authors. These were used for evaluation.

FACTORS

Temporality

Among the statistics retained with our primary relevance judgment dataset was the dwell time spent on each rating (this data was not available for the Twitter dataset, which was used from an earlier study).

Our hypothesis was that dwell time was not significant when understood independently, but might indicate the quality of raters when taking into account the order in which tasks were completed. Since tasks were done in sets of ten, the order referred to where in this set they occurred. Order served as a useful grouping factor because the time spent on the first rating is confounded with the time spent reading the rating instructions, which is to say that the two are inseparable.

Figure 4 shows the distribution of rater performance by dwell time alone. As expected, while there is a slight shift upward in time spent of rating which turn out to be correct, we do not have enough evidence to reject the null hypothesis of equal distributions and conclude that dwell time is insignificant to performance when taken independently (Wilcoxon rank sum p = 0.064; p = 0.154 when excluding extreme outliers).

When considering the order of task completion alongside dwell time, pairwise Wilcoxon Rank Sum tests show that the first rating in a set was significantly different from all other ratings (p < 0.001, with Bonferroni adjustment), as were all pairwise comparisons with the second rating in a set (p = 0.02 vs order 3, p < 0.001 vs all others; Bonferroni adjustment). Notably, however, we fail to reject the null hypothesis for all other ratings in a set. In other words, there is extremely little difference between a rater's third and tenth ratings, as well as all comparisons in between. This is more abrupt than the gradual decline we expected, and suggests that the learning curve for a rater to rate comfortably is only the first two ratings.

Comparing the accuracy of ratings by dwell time, the time spent on the first rating of a set is significantly higher for ratings that are correct than those that are incorrect (Wilcoxon Rank Sum one-sided p=0.01). This stands in contrast to every rating after the first one, none of which show significant difference in time spent on ratings that are true and ones that are false.

Details are in the caption following the image

Frequency distribution of the average amount of time that users spent on the tasks that they rated incorrectly (solid line) and those they rated correctly (dotted line).

Details are in the caption following the image

Average accuracy of raters' nth overall rating. Note that there is no visible correlation between lifetime experience and performance. Only points representing 20 raters or more are shown.

Details are in the caption following the image

Density of percentage of accurate ratings after the first in a set, dependent on whether the first rating was correct. Higher density toward the right side of the graph indicates better performance. When raters started a set incorrectly, their performance skewed much lower for the rest of the set.

The fact that a rater who spends more time on the first rating is more likely to be correct suggests that there is a latent effect in how closely people read the description, which is confounded with the time of the first rating. If this is in fact what accounts for the significant different, it should be an effect that lingers across the full set of data. Figure 6 shows this to be the case, with raters that make a correct rating on the first item are much more reliable in the rest of the rating set.

As part of the rating instructions, raters were presented with a description of what types of results are relevant to the given query (see screenshot in Figure 1). If a rater does not read this section carefully, their ratings would be more interpretive, possibly resulting in inconsistencies with raters that followed the instructions more carefully.

EXPERIENCE

Does rater quality increase with practice? An extension of the order grouping factor that we looked at alongside dwell time, we considered the long-term experience of a rater. Experience was looked at in two forms: lifetime experience and query experience.

Lifetime experience is the overall number of tasks that a rater has completed. Is a rater's 100th task more likely to be correct than their first task? Our hypothesis was that over time raters would grow more reliable. However, this hypothesis proved to be incorrect.

Figure 5 shows the distribution of accuracy ratings across lifetime experience. Each point represents the percentage of the nth ratings that were correctly rated. If a point at position x shows an accuracy of 0.80, this means that 80% of tasks which were raters' xth rating agreed with the majority, our estimated value for the correct label.

The second measure of experience, query experience, refers to the number of tasks that a rater has completed within a single topical domain. In information retrieval relevance judgments, raters are asked to judge whether a document is relevant to a given query. Similarly, in our secondary dataset of Twitter sentiment ratings, raters were asked to annotate the opinion of the tweet regarding a given topic.

Details are in the caption following the image

Average accuracy of raters' nth rating with a query. An accuracy value of 0.8 at position 25 means that, for all the ratings that were a rater's 25th task with the given query, 80% of those ratings were correct (according to majority consensus). Only points representing 20 raters or more are shown. Note that accuracy does not improve among the first thirty tasks that raters complete on a topic, but increases sharply after that.

Query experience proved to be an indicator of rater quality among the most experienced users. For approximately the first thirty tasks which raters completed with a single query, they did not demonstrate any meaningful difference in quality. As demonstrated in Figure 7, however, ratings beyond that point showed a sharp increase in quality.

RATER AGREEMENT AND TASK DIFFICULTY

In addition to worker experience and time spent per tasks, we looked at the ability of rater agreement and task difficulty to discern the accuracy of ratings. The reason that these were considered together is that they are invariably confounded: a task has as little as three ratings informing any estimates of the quality, and those ratings are each biased by the quality of the specific raters involves.

There were two approaches looked at: identifying and replacing low quality workers, and an iterative algorithm for weighing workers and tasks.

Replacing Problem Workers

One of the immediate problems with our primary data was a low rater agreement (Fleiss? Kappa = 0.264). In our first attempt to improve the agreement between raters, we identified low-quality raters and replaced their contributions. First, a confusion matrix was calculated for all raters and an accuracy rate was taken as a measure of a rater's reliability. Raters below a certain threshold were removed and new raters replaced their ratings. The threshold chosen was 0.67, meaning raters whose ratings agreed with their co-raters on a task less than two-thirds of the time were removed.

The threshold for removing workers was supported by an exercise where an undiscerning rater was emulated, replacing randomly selected classifications in the data with its own random ratings. While a rater in an environment completely populated by random raters would be in the majority two-thirds of the time, inserting random raters alongside the real raters in the data gave a more realistic estimate. Across 100 runs, the mean accuracy rate of the random rater was 0.680, with a median of 0.677 and standard deviation of 0.080. In other words, the raters whose data was removed were less likely to be in the majority opinion on a rating than a randomized bot. This accuracy rate also puts our data in perspective, falling somewhere between the 0.75 agreement that would be expected of a random rater in a completely random triple-redundancy labeling system and the 0.50 agreement expected of a random rater in an ideal human setting with all other raters agreeing on ratings.

There were 23 raters below or at the threshold that were removed, accounting for 2377 ratings (0.1769 of the data). Notably, there were 10 raters with a total of 1069 ratings that had accuracy rates equal right at the threshold, meaning that nearly half of removed ratings would not have been taken out with a slightly lower threshold.

After removing problem workers, there was an increase in kappa score from 0.264 to 0.358. The increase in intercoder agreement is expected, given that our metric for problematic raters is how much they agreed with other raters. However, since these raters were by definition already in the minority much of the time, their influence on actual votes was not high. Thus, the assumption to test is whether, when low-agreement raters do end up in the majority, they cause damage by influencing votes in the wrong direction.

In fact, the negative quality impact of problem raters proved to be very small. The accuracy rate of final votes after replacing them increased from 0.856 to 0.862. In contrast, an alternative to selective replacement of problem raters is selective redundancy. Rather than removing data, one can take the approach of adding more labels, as encouraged by Sheng et al. (2008). This approach resulted in an increase to 0.859, a smaller increase than that of removing problem workers. In other words, majority rating proved fairly efficient at smoothing over individual bad raters, limiting their influence.

8

Details are in the caption following the image

Distribution of rater reliability scores based on accuracy rate, shown in descending order.

In order to further increase rater agreement, one could presumably run the replacement process again. However, when non-expert labels are being paid for, removing problematic raters can grow quite costly – especially given the low payoff in accuracy. A cheating or sloppy rater can also rate a large number of ratings quickly, making the potential lost profit even higher. However, the removal and blocking of low-agreement raters can be automated fairly easily, making it possible to incorporate in real time within a rating interface.

Why were some workers correct – or at least in the majority opinion on what a correct rating is – less than chance? One possibility is sincere raters misunderstanding the task. Wang et al. (2011) refer to such situations as recoverable error and offer a method for identifying consistently incorrect raters and correcting their votes. In the case of binary data such as our relevance judgments, this would simply mean inverting relevant votes to non-relevant, and vice-versa. However, none of the raters in our data would improve with such an approach, and it seems like an unlikely occurrence for a rater to make such a drastic mistake systematically. However, it is possible that less drastic misinterpretations can lead to problems with difficult tasks due to misunderstanding the delineation between categories. As we found in our tests on dwell time, raters that appear to spend less time on instructions tend to make more errors.

Iterative Optimization Algorithm

While removing raters based on their error rate has a positive effect on the data, it does not take into account the difficulty of the task that is being completed by the rater. If a rater has the misfortune of being assigned a particularly difficult or debatable set of results to rate, their error rate may prove to be quite high. More unfortunate still, a rater may be rating alongside numerous low quality raters. If two low quality raters disagree with one higher quality rater, the dissenting rater's reliability should reflect the circumstances. There may be latent variables that are not accounted by our system which adversely affect the data.

To account for unknown variables and separate out signal from noise, we evaluated a set of iterative algorithms for weighing rater votes and difficulty of the task. In line with the purpose of this study, this approach allows one to not only evaluate raters, but to separate out the effect of the task itself.

Our algorithm iterates over two steps. In the first, an expected vote for each document is calculated, given the information that is available about that document, the possible labels for that document, and the raters evaluating that document. Early on, that information is limited: while it is known how often each option was chosen for each document rating and how often each rater used each option, there is no information about the quality of the ratings or the raters making them.

In the second stage of our algorithm, the assigned labels of the expected votes are used to update the parameters that are used in step one. This involves assigning values of confidence for the results and determining rater quality based on that confidence value. After this stage, the algorithm was iterated again, returning to the first stage for voting on the expected true value.

This algorithm converges or approaches a convergence limit after a number of iterations. The number of iterations that are required before the data converges varies, but only a few are generally needed for relatively simple data such as information retrieval relevance judgments.

The benefit to this approach is that rater quality is less dependent on circumstance. Consider the follow scenarios:
  • A rater makes a dissenting rating on a difficult task. To form an opinion only on whether they agreed or disagreed with other raters would be unfair to this rater and possibly remove authority from a good rater. For example, in an instance with five raters, there is a difference in whether a rater is the lone dissenter against a majority of four or one of two dissenters against a majority of three. In the latter case, there is a more uncertainty in what the truth value really is. Unfortunately, this scenario is limited for instances with only two-categories and three-raters, such as a large portion of our primary dataset.

  • A cheating rater is correct by chance. As our earlier simulation found, a random voting rater will be correct 67% of the time in our primary dataset. By weighing this rater's vote according to their overall reliability, their votes, even if correct, will hold less sway. By setting their reliability score based on the confidence in their ratings, their influence will be even lower in subsequent iterations.

For confidence scores Ci ε Ci1, Ci2, … Cil where l is a set of all possible labels – L ε {0,1} for our cultural heritage relevance judgements and L ε {0,1,2,3,4} for our Twitter sentiment ratings – the truth value vote is always chosen as the highest confidence label: equation image As the vote can change in subsequent iterations, it is a soft label.

Since voting is always done on the highest confidence label, a number of methods were evaluated for assigning a confidence value to a rating.

For calculating vote confidence, we looked at the following approaches:
  • Probability of rater agreement for label j of rating task i. This approach represents simple majority voting and was used for comparison. It counts the number of i category labels, |li|, and divides it by the total number of labels received by the task: equation image Due to the lack of rater influence in the expression, this does not require iteration, as it will not change.

  • Probability of rater agreement for task i given a rater of quality U. This approach, taken before in Sheng (2008), weighs confidence C according to the mean rater reliability scores of the raters choosing each label: equation image

  • A weighted ranking function previously described in (Organisciak 2012). This approach accounts for higher numbers of redundant raters, while also offering diminishing returns on each rater added. equation image

In addition to task confidence, we considered a number of approaches to weigh rater scores. The basic approach is to use the mean confidence for every single rating that a rater has made before. However, there are two problems with doing so. First, since task confidence is bounded between zero and one, setting raters' scores based on confidence alone will result in continuously declining rater reliability scores between iterations, without any sort of convergence. Such an inequality would also be unevenly distributed, algorithmically punishing raters with more completed tasks. Secondly, since a random rater has an average accuracy of 0.67 in our dataset, the range between good and bad raters is small and skewed upward, making it ineffective for weighing votes. Ideally, on a task where two theoretical cheaters disagree with a near-perfect rater, an early iteration should flip the vote in favor of the better voter.

Rater quality was weighed in the following ways:
  • Exponential decay. Reliability scores are calculated by the mean confidence of a rater's tasks and then raised exponential, to the power of two or three, depending on how aggressively the algorithm's confidence weighting is. A decay function can disrupt an algorithm's convergence and requires lower boundaries.

  • Reliability score normalization. The mean of all reliability scores is normalized to a value of 1.0. This weighting is calculated as the sum of all reliability scores divided by the number of raters: equation image

  • Relative scoring. Reliability scores are calculated on the confidence of their ratings relative to the highest rating in each set.

For comparison, we also ran a rater reliability scoring function as described in Wang et al. (2011), which is based on the accuracy rate of the raters (i.e., how many they rated correctly compared to incorrectly) without any weight given to the confidence in the tasks that they completed.

The various techniques for calculating confidence and setting rater reliability scores were combined in sensible ways and evaluated. Accuracy rates were recorded for the number of correct labels applied at the fifth iteration of each algorithm.

Robustness was also tested, by emulating malicious raters. Bots replaced random raters' ratings with their own undiscerning ratings. The false ratings consisted of 5% of the data and were used to see whether there were differences in how the algorithms handled clear cheaters.

The algorithm combinations were as follows: Majority: The baseline vote based on majority labels.

Basic Algorithm: Described in Wang et al (2011). Confidence is weighed by rater reliability, and rater reliability is dependent on basic accuracy rate.

Basic with Reliability Decay: Modification of basic algorithm, with exponential rater reliability score decay.

Regular with Reliability Decay / Normalized / Relative Scoring: Confidence is weighed by rater reliability, and rater reliability is weighed in one of the ways introduced above.

Alternate Algorithm: Confidence is calculated using the approach previously described in Organisciak (2012).

Table 1 displays the accuracy ratings for the various runs. This can inform a number of observations.

Once again majority voting appears to be quite effective. Consider the baseline majority accuracy of 0.8573 in comparison to the similar task of relevance judgment in the TREC Crowdsourcing Track, where the best algorithms peaked at 0.70 accuracy of the gold label set, and it becomes clear that our dataset is fairly clean from the start (Lease and Kazai 2011). The effectiveness of the baseline majority vote for the primary data is also accentuated by the relatively small gains in accuracy that is gained by the algorithm combinations.

Table 1. Accuracy rates of iterative algorithms on different dataset. All iterated data shown at 5th iteration.
IMLS DCC Rel Ratings Twitter Sentiment IMLS DCC Rel Ratings (w/cheater)
Majority Vote (baseline, no iteration) 0.8573 0.5618 0.8479
Basic Algorithm (Sheng et al 2008) 0.8590 0.5876 0.8494
Basic w/ Reliability Decay 0.8669 0.6082 0.8605
Regular w/ Reliability Decay 0.8590 0.5979 0.8557
Regular w/ Reliability Normalization 0.8590 0.5876 0.8494
Regular w/ Relative Reliability 0.8621 0.5825 0.8479
Alt. Algorithm (Organisciak 2012) 0.8637 0.5928 0.8510

In contrast, the Twitter sentiment dataset has a much lower baseline. The bandwidth of contribution with this data is considerably more spread out — where with the binary categories the worst case scenario for three raters is agreement between only two, the five-category Twitter data can result in nobody agreeing. With the Twitter data, raters also showed an aversion to administrative categories: when the oracle rater would rate a message as “spam” or “incoherent,” the online raters avoided doing so. In our IMLS DCC data, this rater coyness was seen with the “I don't know” ratings, but those were treated as missing data and removed.

For its lower starting point accuracy, the Twitter data showed greater improvements in accuracy with the iterative algorithms than the relevance ratings. Similarly the iterative algorithms proved more robust against the cheaters that were artificially inserted into the data. This seems to point to their usefulness with particularly problematic data.

The iterative algorithms did not have the same effects, however. Notably, the basic algorithm with an exponential decay performed better than expected. This algorithm weighs voting according to rater reliability scores, but rather than weighing rater reliability by the confidence in the rating that the rater makes, it simply uses the rater's accuracy rate. By applying an exponential decay to the rater reliability scores, it gives the generally conservative algorithm more power to pull down low quality raters. Still, one possibility for this surprising result is that it is not as aggressive in separating out raters as the other versions. A future direction worth exploring would be a deeper look into the individual votes that flip or do not flip with these algorithms, and how often good votes are accidentally overturned.

Investigating an iterative algorithm for optimizing rater quality and task difficulty, we found that it held limited usefulness for three-annotator two-category annotation. This is likely due to the limited amount of variance allowed by the structure. There are only two states of majority opinion – three-rater consensus or a two agree/one disagree–meaning that when a rater disagrees there is little information on whether it is because they are a bad rater or because it is inherently a difficulty to agree-upon tasks. More information can become available by including more categories or increasing the number of raters. However, including more raters also has a positive effect on quality. Thus, the experience of this study is that for binary labels, majority rating is generally robust enough.

CONCLUSIONS

This study looked at the growth of online annotation microtasks and the problems of non-expert raters, looking at indicators of performance among non-expert performance.

Most significantly, we found that raters who spend more time on the first rating of a task set are significantly better performers on that task. This points to a latent variable in the instructions of the task. Indeed, the effect of extra time on the first rating seems to follow throughout a task, and raters that are correct on the first task are more likely to be correct on subsequent tasks in a set.

We also looked at the effect of experience on a rater. Generally, the amount of overall rating experience a rater had at the point of a rating did not reflect on their quality. However, a rater's query experience does result in better performance, though after some time.

Finally, this study looked at agreement as an indicator of rater quality. For simple tasks, there is a notable robustness in the basic agreement measure of whether a rater is in the majority opinion of a multi-rater annotation. For more complex tasks or noisier data, an iterative algorithm can offer slight improvements on majority opinion.

Just because there is disagreement doesn't mean that the data is problematic, however. We found that high disagreement among non-expert raters is not necessarily indicative of problematic results. Low inter-rater agreement may indicate a difficult task or individual rogue raters. While inter-rater agreement can be increased significantly by replacing the work of low quality raters, the improvement in accuracy is less defined.

We hope that these finding may inform future activities in digital libraries, whether by offering an understanding of the nature of online raters or as a stepping stone toward further work.

Acknowledgements

This work was supported Institute of Museum and Library Services LG-06-07-0020. Expressed opinions are those of the authors, not the funding agency. The project is hosted by the Center for Informatics Research in Science and Scholarship (CIRSS). Other contributing project members include Timothy W. Cole, Thomas Habing, Jacob Jett, Carole L. Palmer, and Sarah L. Shreeves.