Please Note: This project is not currently active. The content on this site is provided for reference and is not actively maintained.

Grabbing A Random Sample from the Twitter River

by April 13, 2011

As we prepare to report on weather mood and emotions about gas prices on an ongoing basis, we are faced with an issue of scale. For example, we have been collecting weather-related tweets continuously for the past two weeks, using the keyword list described here, and we now have about 600,000 that have sufficient location information to be of interest to us. There are undoubtedly a bunch of duplicates in this batch, but still, it represents a huge volume of tweets from which we need to extract a sense of the authors’ emotions, and it would simply be too costly to consider sending all of them to our distributed workforce via CrowdFlower.

Clearly, we need to develop a sampling strategy with humans making judgments on a random, yet affordable, sample that is ultimately representative of the much larger volume. Our strategy has been to analyze the large batch of tweets that we had coded for our first weather pilot.

Looking at several states and metropolitan areas with many hundreds of tweets in that earlier study, I extracted a series of successively smaller, random samples. This allowed me to see how the resulting estimates of sentiment changed moving from larger to smaller random samples. Here are two examples, the first from the state of California, the second from New York City. The vertical axis shows the percent of all tweets in the sample for the indicated emotion category on the horizontal axis. Note that these emotion categories include one for people who are just sharing information that has no detectable emotion tied to it, and those that the crowd-sourced workers cannot tell. The differently colored bars represent the different sample sizes, ranging from 25 to 1349 for California.

ca_random_samples nyc_random_samples

For California, the quality of the results dropped visibly going from a sample of 50 to 25. For New York City, there was less of a clear break, but visibly we see that samples of 75 or fewer appear to be distinctly different from the larger samples.

The big difference between California and NYC is the relative proportion of positive and negative tweets. This means that, for the same size sample from both locations, there would be comparatively fewer positive tweets in the NYC sample. Looking at the data with this in mind, it would appear that the critical value for a random sample to be representative is how well the minor emotion is sampled. I’m calling the minor emotion the one that is expressed less often, so “positive” for NYC and probably also “positive” for California, although there may not be a significant difference.

My take-away is that we need to have at least 10 tweets for the minor emotion to have a decent estimate of emotion of the entire sample within a given geographic area. Presumably, larger geographic areas may require somewhat larger minimums, especially if they contain a heterogeneous mix of emotions. Implementing this is a bit tricky, given that we need to deliver to CrowdFlower the most efficient number of tweets for coding—too many and the project will have higher costs than necessary, too few and we will not be able to report sentiment values reliably. Our plan is to send 50 tweets per geographic unit for human coding, keeping an eye on the actual numbers of tweets that come back with positive or negative sentiment. Over time, we should be able to fine tune this sampling strategy to have a robust, yet cost-effective approach for ongoing estimates of sentiment.


Leave a Reply