Please Note: This project is not currently active. The content on this site is provided for reference and is not actively maintained.

In Search of Quality Control with Crowd-Based Sentiment Judgments

by March 4, 2011

In a previous post, I described our evolving approach for developing a question that can be addressed on our Pulse platform. We’ve also described previously why we think crowdsourcing is a smart way to get lots of judgments made about sentiment expressed in the social media. But, what about quality control? How can we maintain an acceptable level of quality control while relying on the crowd to make thousands and thousands of judgments?

Quality through known answers and feedback to workers. We were drawn to CrowdFlower because of their approach for ensuring quality control using what they call “gold”. In a typical “assignment” set up on the CrowdFlower platform, a worker needs to make judgments for a group, or assignment, of “units” (a unit in our case would be an individual Twitter tweet). Within every assignment, CrowdFlower includes a gold unit for which we have indicated the correct answer. By setting an assignment to include 15 tweets, it means that a worker will be presented with a gold unit within each new batch of 15 tweets.

Where it gets interesting is what happens when a worker records an answer for a gold unit that we deem to be incorrect. After submitting their assignment of, say 15 tweets, they will be presented with a screen that shows them the text of the gold unit and the correct answer. For each gold unit, we prepare a short statement that is feedback to the worker who made an incorrect judgment. Here’s an example of a judgment that differed from what we believe the correct answer to be (note that we’ve removed user names, as well as mentions of other Twitter users in case these might bias a judgment about expressed emotions):

goldexample1

 

Our feedback to any workers who recorded an answer other than negative for this tweet was “Person appears not to like it when it is very hot after school.” What if the worker doesn’t agree with our answer for a gold unit? CrowdFlower allows a worker to contest a judgment, and we’re able to forgive a worker if we messed up on assigning a gold answer, or for any other reason.

Every time a worker gets a gold unit incorrect, they essentially lose some credibility within the overall job. If they miss too many gold units, then they are removed from the job all together. Importantly, CrowdFlower uses these credibility scores to compute an overall confidence score for the ensemble judgments on a particular unit. So, if we require five trusted judgments per tweet, we will receive (and pay for) judgments from five workers who have answered a sufficient number of gold units correctly. We can access the individual judgments as well as the most common judgment for that tweet. Let’s say that the most common emotion for a tweet was “negative,” and it was made by three out of five workers, then we’d be starting with a confidence score of 0.6. However, if one of the workers who submitted a score other than negative had missed some golds, then that would be taken into account when estimating the overall confidence: all other things being equal, the overall confidence score would increase beyond 0.6. In our estimation, this confidence score turns out to be quite useful (more on that below).

Gold also provides a great feedback mechanism when setting up a new job in CrowdFlower. If lots of workers are getting the gold wrong, then chances are that there are problems with the gold. Perhaps the units identified as gold have debatable answers, or they are otherwise confusing. This happened in my first attempt to get the crowd to judge emotions about weather conditions. When picking gold initially, I assigned the majority answer from our research team of seven. This meant that members of our research team might have gotten some of the gold units incorrect. It made more sense to pick only units for which our team had unanimously agreed on an answer. This seemed to work considerably better.

Interestingly, CrowdFlower can also “front-load” gold in a job, so that workers see a string of gold units before they are given non-gold. This has the potential of serving as a good training mechanism, assuming that the workers remain engaged through the training session, during which I assume they are not being paid. We may explore this in the near future.

Running small batches and comparing data to our research team’s results. When first starting out with a new survey design (i.e., the questions we ask of the CrowdFlower workers), it is ideal to use a data set for which our research team has already provided answers. I did this with a batch of about 180 tweets. In the first run of these tweets, the agreement with our research team was quite good. For example, in the first run of the data, the most common CrowdFlower answer agreed with our research team’s answer about 70% of the time. By only comparing units that had a CrowdFlower confidence score of at least 60%, the agreement with our team increased to at least 80% and as high as 95%. I ran the same data set through several CrowdFlower jobs to get a measure of repeatability. The overlap in workers between the repeated jobs was low (only a single worker in common out of 20 or more for two jobs run only a few hours apart), and the results were very similar between three runs with identical data.

Building up a database of units with known answers (gold). CrowdFlower recommends having something like 5-10% of the total units in a job be gold. For a job with tens of thousands of tweets, that is daunting, if not infeasible, goal. They recognize this and have strategies for dealing. The concern for a very low proportion of gold is that a single user will see the same gold units over and over. One strategy for preventing this is to limit the total number of units a particular worker can do. This can solve the gold problem, yet cause jobs to take a long time to finish because more workers are needed.

I was preparing to launch a job of 12,500 tweets, but only had about 40 gold units (those tweets that our internal team had 100% agreement for). I used CrowdFlower’s strategy called “digging for gold” to build my database of gold. I ran the first 1000 in the larger job of 12,500 and then used CrowdFlower’s interface to identify potential gold units. They’ve thought this out very well, and present you with a unit and the range of judgments from their workforce. In the example below, this one looks like an ideal gold unit because the CrowdFlower workers all agreed. After doing a gut-check about whether our research team would have answered the same way or not, I entered a response to workers who answered the new gold unit incorrectly through a field in the CrowdFlower interface. I was able to rapidly increase our gold set up to 100 using this method.

digginggold1

 

Implementing ongoing quality control. The CrowdFlower team has created a powerful platform for achieving quality judgments for things like people’s emotion about the weather. Depending on one’s use for the data, I think it would be safe to stop here. However, we are giving a lot of thought to another layer of quality control, especially when we move to topics that might be more difficult than weather mood. I believe that a rather straight-forward approach would be to have our research team continually make judgments on a random sample of units. We are building a database and the capacity to move units back-and-forth to CrowdFlower jobs via their API. We are also transitioning to using CrowdFlower’s platform for our research team. So, it is easy to imagine setting up jobs for our research team that include a random sample of tweets that have already run through CrowdFlower’s worker channels, most likely limited to those with confidence scores of at least 60%. We could then compare the majority judgment from our team to that of the CrowdFlower workers on an ongoing basis to give a solid independent measure of quality.

On a final note, we would be well-positioned to create new gold using this approach. For example, we could set up a CrowdFlower job for our research team that included only those tweets for which the team had 100% agreement—and possibly those that also had very high CrowdFlower confidence scores. The only task for our research team would then to be to offer up a response for workers who missed the gold question. This streamlined process might enable us to create a lot of new gold on an ongoing basis. Now, wouldn’t it be interesting if we could use CrowdFlower workers to create responses for the gold units? The answer is probably yes, but it might take some creative thinking…


2 Responses

  1. Mar 14, 2011

    Some crowdsourcing websites have turned a blind eye to ensuring that participants that are giving their contribution online have the necessary skills and drive to give their best. This often results in information that is useless and that needs to be filtered to be considered useful for particular projects. With the implementation of quality control reigns, it makes the whole processes involved in crowdsourcing a lot easier and also adds credibility to the concept behind it.

    • Kent Cavender-Bares
      Mar 14, 2011

      This is a great point, and I couldn’t agree more. There is certainly tons of potential power when tapping the crowd, but there needs to be some filtering mechanism when doing things at any sort of scale. Thanks for commenting!

Leave a Reply