Doing Important Research on Amazon’s Mechanical Turk?

Peter Wildeford25 Sep 2013 17:04 UTC

12 points

There seems to be many important questions that need research, from the mundane (say, which of four slogans for 80,000 Hours people like best) to the interesting (say, how to convince people to donate more than they otherwise would). Unfortunately, it’s difficult to collect data in a quick, reliable, and affordable way. We generally lack access to easily survey-able populations and a lot of research has high barriers to entry for completing (such as needing to enroll in graduate school).

However, since the 2005 creation of Amazon’s Mechanical Turk, some of this has changed. Mechanical Turk is a website where anyone can create tasks for people to complete at a certain wage. These tasks can be anything, from identifying pictures to transcribing interviews to social science research.

Best of all, this is quick and cheap—for example, you could offer $0.25 to complete a short survey, put $75 in a pot, and get 300 responses within a day or two, and this should be quicker and cheaper than any other option available to you for collecting data.

But could Mechanical Turk actually be useable for answering important questions? Could running studies on Mechanical Turk be a competitive use of altruistic funds?

What Questions Would We Be Interested in Asking?

There are a variety of questions we might be interested in that would be appropriate to ask via Mechanical Turk. I don’t believe you could make a longitudinal study, so testing the effects of vegetarian ads on diets in a useful way wouldn’t be able to happen. But less conclusive diet studies could be run in this area.

Additionally, we could test to see how people respond to various marketing materials in EA space. We could explore how people think about charity and see what would make them more likely to donate and how changes in the marketing affect a willingness to donate. We could find out which arguments are more compelling. We could even test various memes against each other and see what people think of them.

Is Mechanical Turk a Reliable Source of Data?

It would only be good to use MTurk if the data you could get is useful. But is it?

Diversity of the Sample

The first question we might ask is whether MTurk produces a sample that is sufficiently diverse and representative of the United States. Unfortunately, this isn’t always the case for MTurk. In “Problems With Mechanical Turk Study Samples”, Dan Kahan noted that female populations can be overrepresented (as high as 62%), African Americans are underrepresented (5% in MTurk compared to 12% in the US), and conservatives are very underrepresented (53% liberal / 25% conservative in MTurk vs. 20% liberal and 40% conservative in real life).

MTurkers are more likely to vote and vote Obama. More concerning, Kahan also found respondents lie about their prior exposure to measures and even whether their US citizens. Additionally, repeated exposure to standard survey questions can bias responses.

But is this really a problem? First, MTurk samples are still more diverse and representative than college student samples or other surveys conducted over the internet. Second, many important questions are about items that we wouldn’t expect to be influenced by demographics. So it’s quite possible that MTurk might be the best of all possible sources by enough to make it worth it.

Wage Sensitivity

Do you have to pay more for higher quality data? Possibly not. Buhrmester, Kwang, and Gosling and another analysis both found that even changes between 2 cents and 50 cents didn’t affect the quality of data received on psychological studies, but it does buy more participants and at a higher rate of participation (get more participants faster).

Is Mechanical Turk a Competitive Use of Altruistic Funds?

It depends on the question being asked, how reliable the findings are, and how they’d be put into use.

Even though I don’t think MTurk could be used for veg flyers very well, it’s the best example I can think of right now: imagine that the current flyer converts 1% of people who read it to consider vegetarianism, but a different flyer might convert 1.05% of people. This means that every donation to veg ads now has approximately a 1.05x multiplier attached to it, because we can use the better flyer. If the MTurk study/studies to find this cost $1K, we would break even on this after distributing 100K flyers at 20 cents a flyer. I don’t know how many flyers are given out a year, so that may or may not be impressive, but the numbers are made up anyway.

The bottom line is that studies may have strong compounding effects, which will almost always beat out the relatively linear increase in impact from a donation to something like AMF. But chances might be small that MTurk will produce something useful. Likewise, it’s possible that there are yet better settings for running these tests (like split testing current materials as they are being distributed, doing longer range and more reliable tracking of impact, etc.). But MTurk could be an interesting way to supplement existing research in a quick and cheap manner.

I think it’s worth thinking about further, even if I wouldn’t act on it yet.

(Also cross-posted on my blog.)

Peter Wildeford25 Sep 2013 17:04 UTC

12 points

49 comments3 min readLW link Archive

gwern 25 Sep 2013 18:20 UTC
18 points

Even though I don’t think MTurk could be used for veg flyers very well, it’s the best example I can think of right now: imagine that the current flyer converts 1% of people who read it to consider vegetarianism, but a different flyer might convert 1.05% of people.

Just a small quibble: in order to avoid misleading connotations and biasing intuitions, you should use more plausible numbers which are an order of magnitude (or more) smaller. You can barely get people to click on a banner ad 1% of the time, never mind rearrange their entire lives & forever give up a major source of pleasure & nutrition.
- Peter Wildeford 25 Sep 2013 20:50 UTC
  1 point
  Parent
  Funny you should say that because others have given me a minor quibble that the conversion number I provide is actually too small and prefer 2% or 3%.
  
  The current (admittedly terrible) studies suggest 2%. Is this wildly optimistic? Very probably, which is why future (less terrible) studies are being done. But it does give slightly more credence to my pick than a wildly lower pick.
  
  But this is irrelevant to the greater point of this essay. Instead, it would be worth trying MTurk enough to get actual numbers on it’s impact.
  - gwern 25 Sep 2013 21:19 UTC
    9 points
    Parent
    I’d disagree; the 2% only come from an absurd overreading:
    
    and 45 people reported, for example, that their chicken consumption decreased “slightly” or “significantly”.
    
    So in the context of winning a contest with clear demand expectations and going only on cheap talk, without any measure of persistency over time, you only get 2% by counting anyone who claimed to be affected however ‘slightly’? I think the more honest appraisal of that little experiment would be ‘0%’.
    - Peter Wildeford 26 Sep 2013 1:52 UTC
      2 points
      Parent
      I think debating the merits of this particular percentage is not relevant enough to my topic to discuss further here. If you think that I am in error (or, worse, actively trying to manipulate the data to make my case look good), we could continue this conversation via PM or on a more relevant thread.
  - gwern 17 Nov 2013 2:14 UTC
    0 points
    Parent
    For example, Reading a book can change your mind, but only some changes last for a year: food attitude changes in readers of The Omnivore’s Dilemma, Hormer et al 2013, suggests pretty minimal attitude change from reading an entire (pretty good) book, which implies even less effect on actions.
  - Eugine_Nier 26 Sep 2013 19:32 UTC
    0 points
    Parent
    
    The current (admittedly terrible) studies suggest 2%. Is this wildly optimistic?
    
    How is this a good thing? If it were that easy to indoctrinate large numbers of people it would be scary.
    - Peter Wildeford 26 Sep 2013 20:07 UTC
      0 points
      Parent
      Well, you could at least indoctrinate people to better behavior from a utilitarian standpoint.
      
      Whether we can indoctrinate large numbers of people is a fact about the world, and we should believe what is correct. After we discover accept the truth, we then can figure out how to work with it.
      - Eugine_Nier 26 Sep 2013 20:16 UTC
        1 point
        Parent
        
        Whether we can indoctrinate large numbers of people is a fact about the world, and we should believe what is correct.
        
        You’re the one who used the word “optimistic”.
LM7805 25 Sep 2013 17:47 UTC
8 points
One class of questions you didn’t bring up has to do with perceptions of risk. There was a poster at this year’s USENIX Security about a Mechanical Turk experiment that purported to be a Starbucks study evaluating the usability of a new method of accessing the wi-fi at Starbucks locations: click here to install this new root certificate! (Nearly ³⁄₄ of participants did so.) I can’t find the poster online, but this short paper accompanied it at SOUPS.
- Peter Wildeford 25 Sep 2013 20:44 UTC
  1 point
  Parent
  Risks, biases, and heuristics would be good to look into. One would have to be careful to avoid using measures that participants have prior exposure though (a problem that is unusually common in MTurk), so surveys probably would have to create unique measures.
  - LM7805 25 Sep 2013 21:55 UTC
    2 points
    Parent
    Most social-science studies are designed to elicit answers in such a way that the participant doesn’t realize what question is actually being asked. For example, when William Labov studied the distribution of rhoticity in spoken English, he asked people innocuous questions whose answers contained the sound /r/ in (phonetic) environments where rhoticity can occur. He’d go into a multi-story department store, look at the map, and ask an employee something along the lines of “Where can I find towels?” so that the person would answer “Those are on the fourth floor.” Similarly, the wi-fi study wasn’t looking at usability any more than Labov was interested in towels; they were really eliciting “willingness to do something dangerous” as a proxy for (lack of) risk awareness. As long as the measure is wearing unique clothing, participants shouldn’t be able to recognize it.
Vaniver 26 Sep 2013 1:05 UTC
5 points

Even though I don’t think MTurk could be used for veg flyers very well, it’s the best example I can think of right now: imagine that the current flyer converts 1% of people who read it to consider vegetarianism, but a different flyer might convert 1.05% of people.

A sizable percentage of MTurk labor is done by Indians, who are presumably vegetarian already, and thus a bad source for this sort of information. You might be able to restrict the nationality of your workers, but those sorts of worries seem potentially significant.
- Jayson_Virissimo 26 Sep 2013 2:40 UTC
  2 points
  Parent
  Why presume that Indians are vegetarian? That most Indians are vegetarian seems implausible to me (although, yes, I bet they are much more likely to be vegetarian than Americans). Care to give a citation?
  - Vaniver 26 Sep 2013 3:07 UTC
    7 points
    Parent
    
    Why presume that Indians are vegetarian?
    
    India has the lowest rate of meat consumption in the world. The plurality will eat meat, but some unknown percentage of that group will not prepare meat for consumption in their home, and only eat it when eating out.
    
    (I did overestimate the prevalence of vegetarianism in India- I thought it was over half- but by only about 2:1.)
- Peter Wildeford 26 Sep 2013 1:49 UTC
  1 point
  Parent
  Perhaps the best thing to do at first would just be a standard demographics survey to see what is there.
erratio 25 Sep 2013 19:04 UTC
5 points
I’ll try to dig up the paper later if I can, but if I remember correctly there’s a disturbingly large number of Turkers who use it as their primary source of income. Those people are not going to be anywhere close to your intended audience for EA stuff—people who are living well below the poverty line are going to think completely differently about helping others than people who are living comfortable and/or have disposable income.
Shmi 25 Sep 2013 18:03 UTC
5 points
I would expect the survey fillers to just turk out random checkmarks as fast as they can without reading the questions.
- Peter Wildeford 25 Sep 2013 20:45 UTC
  8 points
  Parent
  I’d expect this too. But my political science professor says this is surprisingly not the case to the extent you would think (I know this is an argument from authority, but I haven’t bothered to ask him for citations yet, and I do trust him).
  
  Moreover, this is something that can be controlled for.
  - ChristianKl 27 Sep 2013 11:02 UTC
    1 point
    Parent
    
    Moreover, this is something that can be controlled for.
    
    How would you practically go about controlling for it?
    - Peter Wildeford 29 Sep 2013 3:24 UTC
      2 points
      Parent
      A fourth way: include a reading passage and then, on a separate page, a question to test to see if they read the passage.
    - Peter Wildeford 27 Sep 2013 19:30 UTC
      2 points
      Parent
      Another thing you can do is put a timer in the survey that keeps track of how much time they spend on each question.
    - Peter Wildeford 27 Sep 2013 12:41 UTC
      1 point
      Parent
      Here’s one example:
      
      Q12: Thinking about the candidate that you read about, how relevant do you think the following considerations are to their judgment of right and wrong? (Pick a number on the 1-7 scale.)
      
      (a) Whether or not someone suffered emotionally. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
      
      (b) Whether or not someone acted unfairly. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
      
      (c) Whether or not someone’s action showed love for his or her country. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
      
      (d) Whether or not someone did something disgusting. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
      
      (e) Whether or not someone enjoyed apple juice. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
      
      (f) Whether or not someone showed a lack of respect for authority. Not At All Relevant 1 2 3 4 5 6 7 Extremely Relevant
      - Lumifer 27 Sep 2013 16:26 UTC
        0 points
        Parent
        Looking at this specific example and imagining myself doing this for $1.50/hour or so (with the implication that my IQ isn’t anywhere close to three digits) -- I can’t possibly give true answers because the question is far too complicated and I can’t afford to spend ten minutes to figure it out. Even if I honestly want to not “cheat”.
        Peter Wildeford 27 Sep 2013 17:54 UTC
        2 points
        Parent
        Well, there are two reasons why that would be the case:
        
        1.) This question refers to a specific story that you would have read previously in the study.
        
        2.) The formatting here is jumbled text. The format of the actual survey includes radio buttons and is much nicer.
        Lumifer 27 Sep 2013 18:57 UTC
        2 points
        Parent
        Ah, no, let me clarify. It requires intellectual effort to untangle Q12 and understand what actually does it ask you. This is a function of the way it is formulated and has nothing to do with knowing the context or the lack of radio buttons.
        
        It is easy for high-IQ people to untangle such questions in their heads so they don’t pay much attention to this—it’s “easy”. It is hard for low-IQ people to do this, so unless there is incentive for them to actually take the time, spend the effort, and understand the question they are not going to do it.
        Peter Wildeford 27 Sep 2013 19:33 UTC
        2 points
        Parent
        It’s definitely a good idea to keep the questions simple and I’d plan on paying attention to that. But this question actually was used in an MTurk sample and it went ok.
        
        Regardless, even if the question itself is bad, the general point is that this is one way you can control for whether people are clicking randomly. Another way is to have an item and it’s inverse (“I consider myself an optimistic person” and later “I consider myself a pessimistic person”) and a third way is to run a timer in the questionnaire.
        Lumifer 27 Sep 2013 19:55 UTC
        0 points
        Parent
        
        and it went ok
        
        What does “went ok” mean and how do you know it?
        
        this is one way you can control for whether people are clicking randomly
        
        Let’s be more precise: this is one way you can estimate whether people (or scripts) are clicking randomly. This estimate should come with its own uncertainty (=error bars, more or less) which should be folded into the overall uncertainty of survey results.
        Peter Wildeford 27 Sep 2013 23:47 UTC
        2 points
        Parent
        
        What does “went ok” mean and how do you know it?
        
        Well, the results were consistent with the hypothesis, the distribution of responses didn’t look random, not too many people failed the “apple juice” question, and the timer data looked reasonable.
        
        ~
        
        this is one way you can estimate whether people (or scripts) are clicking randomly.
        
        That’s generally what I meant by “control”. But at that point, we might just be nitpicking about words.
        Lumifer 30 Sep 2013 16:00 UTC
        0 points
        Parent
        
        we might just be nitpicking about words
        
        Possibly, though I have in mind a difference in meaning or, perhaps, attitude. “Can control” implies to me that you think you can reduce this issue to irrelevance, it will not affect the results. “Will estimate” implies that this is another source of uncertainty, you’ll try to get a handle on it but still it will add to the total uncertainty of the final outcome.
        Eugine_Nier 28 Sep 2013 14:13 UTC
        −2 points
        Parent
        
        Well, the results were consistent with the hypothesis, the distribution of responses didn’t look random, not too many people failed the “apple juice” question, and the timer data looked reasonable.
        
        Well, the most obvious misinterpretations of the question will also result in people not failing the “apple juice” question.
      - ChristianKl 27 Sep 2013 13:13 UTC
        0 points
        Parent
        What cut of criteria would you use with those questions to avoid cherry picking of data?
        Peter Wildeford 27 Sep 2013 17:56 UTC
        2 points
        Parent
        You check to make sure that “Whether or not someone enjoyed apple juice” is put at 1 or 2 or you throw out the participant. Otherwise, you keep the response.
        
        There are a few other tactics. Another one is to have a question like “I consider myself optimistic” and then later have a question “I consider myself pessimistic” and you check to see if the answers are in an inverse relationship.
        Lumifer 27 Sep 2013 20:08 UTC
        0 points
        Parent
        
        Another one is to have a question like “I consider myself optimistic” and then later have a question “I consider myself pessimistic” and you check to see if the answers are in an inverse relationship.
        
        And if they are, you mark the person as bipolar :-D
  - chaosmage 26 Sep 2013 16:53 UTC
    0 points
    Parent
    
    this is something that can be controlled for.
    
    If the controlling is effective, having to discard some of the answers still drives up the cost.
    - Peter Wildeford 26 Sep 2013 20:07 UTC
      1 point
      Parent
      Yes. But by too much to make it no longer worth doing? I don’t know.
- Lumifer 25 Sep 2013 18:50 UTC
  4 points
  Parent
  That.
  
  Plus, of course, there is huge selection bias.How many people with regular jobs, for example, do you think spend their evenings doing MTurk jobs?
  
  But yes, the real issue is that you’ll have a great deal of noise in your responses and you will have serious issues trying to filter it out.
  - Peter Wildeford 25 Sep 2013 20:46 UTC
    5 points
    Parent
    
    Plus, of course, there is huge selection bias. How many people with regular jobs, for example, do you think spend their evenings doing MTurk jobs?
    
    I discuss this in the “Diversity of the Sample” subsection of the “Is Mechanical Turk a Reliable Source of Data?” section.
    
    The question is not “is MTurk representative?” but rather “Is MTurk representative enough to be useful in answering the kinds of questions we want to answer and quicker / cheaper than our alternative sample sources?”.
    - Lumifer 26 Sep 2013 14:36 UTC
      0 points
      Parent
      The first question is “Can you trust the data coming out of MTurk surveys?”
      
      The paper which your link references is behind the paywall but it seems likely to me that they gathered the data on representativeness of MTurk workers through a survey of MTurk workers. Is there a reason to trust these numbers?
      - Peter Wildeford 26 Sep 2013 15:14 UTC
        1 point
        Parent
        
        The paper which your link references is behind the paywall
        
        Which one? I can make it publicly available.
        
        but it seems likely to me that they gathered the data on representativeness of MTurk workers through a survey of MTurk workers. Is there a reason to trust these numbers?
        
        You can compare the answers to other samples.
        
        Unless, of course, your concern is that the subjects are lying about their demographics, which is certainly possible. But then, it would be pretty amazing that this mix of lies and truths creates a believable sample. And what would be the motivation to lie about demographics? Would this motivation be any higher than other surveys? Do you doubt the demographics in non-MTurk samples?
        
        I actually do agree this is a risk, so we’d have to (1) maybe run a study first to gauge how often MTurkers lie, perhaps using the Marlowe-Crowne Social Desirability Inventory, and/or (2) look through MTurker forums to see if people talk about lying on demographics. (One demographic that is known to be fabricated fairly often is nationality, because many MTurk tasks are restricted to Americans.)
        
        Instead of dismissing MTurk based on expectations that it would be useless for research, I think it would be important to test it. After all, published social science has made use of MTurk samples, so we have some basis for expecting it to be at least worth testing to see if it’s legitimate.
        Lumifer 26 Sep 2013 15:27 UTC
        3 points
        Parent
        The paper I mean is that one: http://cpx.sagepub.com/content/early/2013/01/31/2167702612469015.abstract
        
        Unless, of course, your concern is that the subjects are lying about their demographics
        
        Yes. Or, rather, the subjects submit noise as data.
        
        Consider, e.g. a Vietnamese teenager who knows some English and has declared himself as an American to MTurk. He’ll fill out a survey because he’ll get paid for it, but there is zero incentive for him to give true answers (and some questions like “Did you vote for Obama?” are meaningless for him). The rational thing for him to do is to put checkmarks into boxes as quickly as he can without being obvious about his answers being random.
        
        Instead of dismissing MTurk based on expectations that it would be useless for research, I think it would be important to test it.
        
        I’ll rephrase this as “it would be useful and necessary to test it before we use MTurk samples for research”.
        Peter Wildeford 26 Sep 2013 20:12 UTC
        1 point
        Parent
        
        The paper I mean is that one: http://cpx.sagepub.com/content/early/2013/01/31/2167702612469015.abstract
        
        Here you go.
        
        ~
        
        The rational thing for him to do is to put checkmarks into boxes as quickly as he can without being obvious about his answers being random.
        
        This is a good point. You still would be able to match the resulting demographics to known trends and see how reliable your sample is, however. Random answers should show, either overtly on checks, or subtlety through aggregate statistics.
        
        ~
        
        I’ll rephrase this as “it would be useful and necessary to test it before we use MTurk samples for research”.
        
        Definitely.
- satt 25 Sep 2013 23:23 UTC
  2 points
  Parent
  A survey designer could ask the same question in different ways, or ask questions with mutually exclusive answers, and then throw away responses with contradictory answers. (This isn’t a perfect cure but it can give an idea of which survey responses are just random marks and which aren’t.)
Ishaan 26 Sep 2013 3:14 UTC
4 points
The psychology lab I’m currently working in currently collects data using mechanical turk.

If you’re planning on using this for published work, I’ve heard secondhand that IRB protocols can sometimes throw weird stuff at you. The system doesn’t always have well-oiled procedures for this sort of thing.

If you find mTurk’s format constraining you can use this to direct turkers to any website. I’ve found it useful.
Normal_Anomaly 27 Sep 2013 0:42 UTC
2 points
I did part of the design of a mechanical turk study on sound perception during my summer research job. I don’t know if we have any data yet, but I don’t recall any IRB weirdness.
- Peter Wildeford 27 Sep 2013 2:41 UTC
  2 points
  Parent
  My senior research project in Political Psychology is going to use MTurk.
ChristianKl 25 Sep 2013 21:10 UTC
1 point

Even though I don’t think MTurk could be used for veg flyers very well, it’s the best example I can think of right now: imagine that the current flyer converts 1% of people who read it to consider vegetarianism, but a different flyer might convert 1.05% of people. This means that every donation to veg ads now has approximately a 1.05x multiplier attached to it, because we can use the better flyer. If the MTurk study/studies to find this cost $1K, we would break even on this after distributing 100K flyers at 20 cents a flyer.

This completely discounts the value of convincing people on Mechanical Turk to switch to vegetarianism.

If a flyer can convert 1% of the people who read the flyer maybe a well designed survey that get’s the participant to interact with it can convert more people.

If it costs 0.25$ to get a survey completed and 0.20$ to get a flyer delievered I would expect that survey to be better value for the money.
- Peter Wildeford 25 Sep 2013 21:19 UTC
  4 points
  Parent
  
  This completely discounts the value of convincing people on Mechanical Turk to switch to vegetarianism.
  
  I had considered that, but it would be really difficult to tell whether you’ve convinced the actual MTurkers, so it would be hard to tell.
  
  You’d probably have to either (1) do the identical study in a different realm where you can make it a surprise longitudinal study and attempt to re-contact people in a few months or (2) figure out a way to surprise re-contact the MTurkers.
  
  But yeah, perhaps that ought to factor into the value consideration.
  
  Edit: Apparently it is possible, though difficult, to run a longitudinal study on MTurk.
  - ChristianKl 25 Sep 2013 21:23 UTC
    2 points
    Parent
    
    I had considered that, but it would be really difficult to tell whether you’ve convinced the actual MTurkers, so it would be hard to tell.
    
    If you don’t think you can know whether you convinced the actual MTurkers, how are you going to use the data from them to know which flyers are effective?
    - Peter Wildeford 26 Sep 2013 1:49 UTC
      1 point
      Parent
      Further tests of the changes in the field with longitudinal surveys.