Doing Important Research on Amazon’s Mechanical Turk?
There seems to be many important questions that need research, from the mundane (say, which of four slogans for 80,000 Hours people like best) to the interesting (say, how to convince people to donate more than they otherwise would). Unfortunately, it’s difficult to collect data in a quick, reliable, and affordable way. We generally lack access to easily survey-able populations and a lot of research has high barriers to entry for completing (such as needing to enroll in graduate school).
However, since the 2005 creation of Amazon’s Mechanical Turk, some of this has changed. Mechanical Turk is a website where anyone can create tasks for people to complete at a certain wage. These tasks can be anything, from identifying pictures to transcribing interviews to social science research.
Best of all, this is quick and cheap—for example, you could offer $0.25 to complete a short survey, put $75 in a pot, and get 300 responses within a day or two, and this should be quicker and cheaper than any other option available to you for collecting data.
But could Mechanical Turk actually be useable for answering important questions? Could running studies on Mechanical Turk be a competitive use of altruistic funds?
What Questions Would We Be Interested in Asking?
There are a variety of questions we might be interested in that would be appropriate to ask via Mechanical Turk. I don’t believe you could make a longitudinal study, so testing the effects of vegetarian ads on diets in a useful way wouldn’t be able to happen. But less conclusive diet studies could be run in this area.
Additionally, we could test to see how people respond to various marketing materials in EA space. We could explore how people think about charity and see what would make them more likely to donate and how changes in the marketing affect a willingness to donate. We could find out which arguments are more compelling. We could even test various memes against each other and see what people think of them.
Is Mechanical Turk a Reliable Source of Data?
It would only be good to use MTurk if the data you could get is useful. But is it?
Diversity of the Sample
The first question we might ask is whether MTurk produces a sample that is sufficiently diverse and representative of the United States. Unfortunately, this isn’t always the case for MTurk. In “Problems With Mechanical Turk Study Samples”, Dan Kahan noted that female populations can be overrepresented (as high as 62%), African Americans are underrepresented (5% in MTurk compared to 12% in the US), and conservatives are very underrepresented (53% liberal / 25% conservative in MTurk vs. 20% liberal and 40% conservative in real life).
MTurkers are more likely to vote and vote Obama. More concerning, Kahan also found respondents lie about their prior exposure to measures and even whether their US citizens. Additionally, repeated exposure to standard survey questions can bias responses.
But is this really a problem? First, MTurk samples are still more diverse and representative than college student samples or other surveys conducted over the internet. Second, many important questions are about items that we wouldn’t expect to be influenced by demographics. So it’s quite possible that MTurk might be the best of all possible sources by enough to make it worth it.
Wage Sensitivity
Do you have to pay more for higher quality data? Possibly not. Buhrmester, Kwang, and Gosling and another analysis both found that even changes between 2 cents and 50 cents didn’t affect the quality of data received on psychological studies, but it does buy more participants and at a higher rate of participation (get more participants faster).
Is Mechanical Turk a Competitive Use of Altruistic Funds?
It depends on the question being asked, how reliable the findings are, and how they’d be put into use.
Even though I don’t think MTurk could be used for veg flyers very well, it’s the best example I can think of right now: imagine that the current flyer converts 1% of people who read it to consider vegetarianism, but a different flyer might convert 1.05% of people. This means that every donation to veg ads now has approximately a 1.05x multiplier attached to it, because we can use the better flyer. If the MTurk study/studies to find this cost $1K, we would break even on this after distributing 100K flyers at 20 cents a flyer. I don’t know how many flyers are given out a year, so that may or may not be impressive, but the numbers are made up anyway.
The bottom line is that studies may have strong compounding effects, which will almost always beat out the relatively linear increase in impact from a donation to something like AMF. But chances might be small that MTurk will produce something useful. Likewise, it’s possible that there are yet better settings for running these tests (like split testing current materials as they are being distributed, doing longer range and more reliable tracking of impact, etc.). But MTurk could be an interesting way to supplement existing research in a quick and cheap manner.
I think it’s worth thinking about further, even if I wouldn’t act on it yet.
-
(Also cross-posted on my blog.)
Just a small quibble: in order to avoid misleading connotations and biasing intuitions, you should use more plausible numbers which are an order of magnitude (or more) smaller. You can barely get people to click on a banner ad 1% of the time, never mind rearrange their entire lives & forever give up a major source of pleasure & nutrition.
Funny you should say that because others have given me a minor quibble that the conversion number I provide is actually too small and prefer 2% or 3%.
The current (admittedly terrible) studies suggest 2%. Is this wildly optimistic? Very probably, which is why future (less terrible) studies are being done. But it does give slightly more credence to my pick than a wildly lower pick.
But this is irrelevant to the greater point of this essay. Instead, it would be worth trying MTurk enough to get actual numbers on it’s impact.
I’d disagree; the 2% only come from an absurd overreading:
So in the context of winning a contest with clear demand expectations and going only on cheap talk, without any measure of persistency over time, you only get 2% by counting anyone who claimed to be affected however ‘slightly’? I think the more honest appraisal of that little experiment would be ‘0%’.
I think debating the merits of this particular percentage is not relevant enough to my topic to discuss further here. If you think that I am in error (or, worse, actively trying to manipulate the data to make my case look good), we could continue this conversation via PM or on a more relevant thread.
For example, Reading a book can change your mind, but only some changes last for a year: food attitude changes in readers of The Omnivore’s Dilemma, Hormer et al 2013, suggests pretty minimal attitude change from reading an entire (pretty good) book, which implies even less effect on actions.
How is this a good thing? If it were that easy to indoctrinate large numbers of people it would be scary.
Well, you could at least indoctrinate people to better behavior from a utilitarian standpoint.
Whether we can indoctrinate large numbers of people is a fact about the world, and we should believe what is correct. After we discover accept the truth, we then can figure out how to work with it.
You’re the one who used the word “optimistic”.
One class of questions you didn’t bring up has to do with perceptions of risk. There was a poster at this year’s USENIX Security about a Mechanical Turk experiment that purported to be a Starbucks study evaluating the usability of a new method of accessing the wi-fi at Starbucks locations: click here to install this new root certificate! (Nearly 3⁄4 of participants did so.) I can’t find the poster online, but this short paper accompanied it at SOUPS.
Risks, biases, and heuristics would be good to look into. One would have to be careful to avoid using measures that participants have prior exposure though (a problem that is unusually common in MTurk), so surveys probably would have to create unique measures.
Most social-science studies are designed to elicit answers in such a way that the participant doesn’t realize what question is actually being asked. For example, when William Labov studied the distribution of rhoticity in spoken English, he asked people innocuous questions whose answers contained the sound /r/ in (phonetic) environments where rhoticity can occur. He’d go into a multi-story department store, look at the map, and ask an employee something along the lines of “Where can I find towels?” so that the person would answer “Those are on the fourth floor.” Similarly, the wi-fi study wasn’t looking at usability any more than Labov was interested in towels; they were really eliciting “willingness to do something dangerous” as a proxy for (lack of) risk awareness. As long as the measure is wearing unique clothing, participants shouldn’t be able to recognize it.
A sizable percentage of MTurk labor is done by Indians, who are presumably vegetarian already, and thus a bad source for this sort of information. You might be able to restrict the nationality of your workers, but those sorts of worries seem potentially significant.
Why presume that Indians are vegetarian? That most Indians are vegetarian seems implausible to me (although, yes, I bet they are much more likely to be vegetarian than Americans). Care to give a citation?
India has the lowest rate of meat consumption in the world. The plurality will eat meat, but some unknown percentage of that group will not prepare meat for consumption in their home, and only eat it when eating out.
(I did overestimate the prevalence of vegetarianism in India- I thought it was over half- but by only about 2:1.)
Perhaps the best thing to do at first would just be a standard demographics survey to see what is there.
I’ll try to dig up the paper later if I can, but if I remember correctly there’s a disturbingly large number of Turkers who use it as their primary source of income. Those people are not going to be anywhere close to your intended audience for EA stuff—people who are living well below the poverty line are going to think completely differently about helping others than people who are living comfortable and/or have disposable income.
I would expect the survey fillers to just turk out random checkmarks as fast as they can without reading the questions.
I’d expect this too. But my political science professor says this is surprisingly not the case to the extent you would think (I know this is an argument from authority, but I haven’t bothered to ask him for citations yet, and I do trust him).
Moreover, this is something that can be controlled for.
How would you practically go about controlling for it?
A fourth way: include a reading passage and then, on a separate page, a question to test to see if they read the passage.
Another thing you can do is put a timer in the survey that keeps track of how much time they spend on each question.
Here’s one example:
Q12: Thinking about the candidate that you read about, how relevant do you think the following considerations are to their judgment of right and wrong? (Pick a number on the 1-7 scale.)
(a) Whether or not someone suffered emotionally. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
(b) Whether or not someone acted unfairly. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
(c) Whether or not someone’s action showed love for his or her country. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
(d) Whether or not someone did something disgusting. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
(e) Whether or not someone enjoyed apple juice. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
(f) Whether or not someone showed a lack of respect for authority. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
Looking at this specific example and imagining myself doing this for $1.50/hour or so (with the implication that my IQ isn’t anywhere close to three digits) -- I can’t possibly give true answers because the question is far too complicated and I can’t afford to spend ten minutes to figure it out. Even if I honestly want to not “cheat”.
Well, there are two reasons why that would be the case:
1.) This question refers to a specific story that you would have read previously in the study.
2.) The formatting here is jumbled text. The format of the actual survey includes radio buttons and is much nicer.
Ah, no, let me clarify. It requires intellectual effort to untangle Q12 and understand what actually does it ask you. This is a function of the way it is formulated and has nothing to do with knowing the context or the lack of radio buttons.
It is easy for high-IQ people to untangle such questions in their heads so they don’t pay much attention to this—it’s “easy”. It is hard for low-IQ people to do this, so unless there is incentive for them to actually take the time, spend the effort, and understand the question they are not going to do it.
It’s definitely a good idea to keep the questions simple and I’d plan on paying attention to that. But this question actually was used in an MTurk sample and it went ok.
Regardless, even if the question itself is bad, the general point is that this is one way you can control for whether people are clicking randomly. Another way is to have an item and it’s inverse (“I consider myself an optimistic person” and later “I consider myself a pessimistic person”) and a third way is to run a timer in the questionnaire.
What does “went ok” mean and how do you know it?
Let’s be more precise: this is one way you can estimate whether people (or scripts) are clicking randomly. This estimate should come with its own uncertainty (=error bars, more or less) which should be folded into the overall uncertainty of survey results.
Well, the results were consistent with the hypothesis, the distribution of responses didn’t look random, not too many people failed the “apple juice” question, and the timer data looked reasonable.
~
That’s generally what I meant by “control”. But at that point, we might just be nitpicking about words.
Possibly, though I have in mind a difference in meaning or, perhaps, attitude. “Can control” implies to me that you think you can reduce this issue to irrelevance, it will not affect the results. “Will estimate” implies that this is another source of uncertainty, you’ll try to get a handle on it but still it will add to the total uncertainty of the final outcome.
Well, the most obvious misinterpretations of the question will also result in people not failing the “apple juice” question.
What cut of criteria would you use with those questions to avoid cherry picking of data?
You check to make sure that “Whether or not someone enjoyed apple juice” is put at 1 or 2 or you throw out the participant. Otherwise, you keep the response.
There are a few other tactics. Another one is to have a question like “I consider myself optimistic” and then later have a question “I consider myself pessimistic” and you check to see if the answers are in an inverse relationship.
And if they are, you mark the person as bipolar :-D
If the controlling is effective, having to discard some of the answers still drives up the cost.
Yes. But by too much to make it no longer worth doing? I don’t know.
That.
Plus, of course, there is huge selection bias.How many people with regular jobs, for example, do you think spend their evenings doing MTurk jobs?
But yes, the real issue is that you’ll have a great deal of noise in your responses and you will have serious issues trying to filter it out.
I discuss this in the “Diversity of the Sample” subsection of the “Is Mechanical Turk a Reliable Source of Data?” section.
The question is not “is MTurk representative?” but rather “Is MTurk representative enough to be useful in answering the kinds of questions we want to answer and quicker / cheaper than our alternative sample sources?”.
The first question is “Can you trust the data coming out of MTurk surveys?”
The paper which your link references is behind the paywall but it seems likely to me that they gathered the data on representativeness of MTurk workers through a survey of MTurk workers. Is there a reason to trust these numbers?
Which one? I can make it publicly available.
You can compare the answers to other samples.
Unless, of course, your concern is that the subjects are lying about their demographics, which is certainly possible. But then, it would be pretty amazing that this mix of lies and truths creates a believable sample. And what would be the motivation to lie about demographics? Would this motivation be any higher than other surveys? Do you doubt the demographics in non-MTurk samples?
I actually do agree this is a risk, so we’d have to (1) maybe run a study first to gauge how often MTurkers lie, perhaps using the Marlowe-Crowne Social Desirability Inventory, and/or (2) look through MTurker forums to see if people talk about lying on demographics. (One demographic that is known to be fabricated fairly often is nationality, because many MTurk tasks are restricted to Americans.)
Instead of dismissing MTurk based on expectations that it would be useless for research, I think it would be important to test it. After all, published social science has made use of MTurk samples, so we have some basis for expecting it to be at least worth testing to see if it’s legitimate.
The paper I mean is that one: http://cpx.sagepub.com/content/early/2013/01/31/2167702612469015.abstract
Yes. Or, rather, the subjects submit noise as data.
Consider, e.g. a Vietnamese teenager who knows some English and has declared himself as an American to MTurk. He’ll fill out a survey because he’ll get paid for it, but there is zero incentive for him to give true answers (and some questions like “Did you vote for Obama?” are meaningless for him). The rational thing for him to do is to put checkmarks into boxes as quickly as he can without being obvious about his answers being random.
I’ll rephrase this as “it would be useful and necessary to test it before we use MTurk samples for research”.
Here you go.
~
This is a good point. You still would be able to match the resulting demographics to known trends and see how reliable your sample is, however. Random answers should show, either overtly on checks, or subtlety through aggregate statistics.
~
Definitely.
A survey designer could ask the same question in different ways, or ask questions with mutually exclusive answers, and then throw away responses with contradictory answers. (This isn’t a perfect cure but it can give an idea of which survey responses are just random marks and which aren’t.)
The psychology lab I’m currently working in currently collects data using mechanical turk.
If you’re planning on using this for published work, I’ve heard secondhand that IRB protocols can sometimes throw weird stuff at you. The system doesn’t always have well-oiled procedures for this sort of thing.
If you find mTurk’s format constraining you can use this to direct turkers to any website. I’ve found it useful.
I did part of the design of a mechanical turk study on sound perception during my summer research job. I don’t know if we have any data yet, but I don’t recall any IRB weirdness.
My senior research project in Political Psychology is going to use MTurk.
This completely discounts the value of convincing people on Mechanical Turk to switch to vegetarianism.
If a flyer can convert 1% of the people who read the flyer maybe a well designed survey that get’s the participant to interact with it can convert more people.
If it costs 0.25$ to get a survey completed and 0.20$ to get a flyer delievered I would expect that survey to be better value for the money.
I had considered that, but it would be really difficult to tell whether you’ve convinced the actual MTurkers, so it would be hard to tell.
You’d probably have to either (1) do the identical study in a different realm where you can make it a surprise longitudinal study and attempt to re-contact people in a few months or (2) figure out a way to surprise re-contact the MTurkers.
But yeah, perhaps that ought to factor into the value consideration.
Edit: Apparently it is possible, though difficult, to run a longitudinal study on MTurk.
If you don’t think you can know whether you convinced the actual MTurkers, how are you going to use the data from them to know which flyers are effective?
Further tests of the changes in the field with longitudinal surveys.