The popular interpretation of the Wisdom of Crowds phenomenon is that each participant brings a certain amount of information to the result, and a certain amount of noise. Over a large enough sample size, the noise (divergent) cancels itself out, while the information converges on a value which, in the absence of systematic bias, should be proximate to the true value.
By restricting participants to two choices (one correct, one incorrect), all your noise is going to converge on the same answer.
That’s kind of Eliezer’s point when he talks about how astounding it is that human beings are unbiased estimators of beans in a jar. I’d agree that it’s astounding, but there are plenty of other statistical phenomena that astound me equally, so I’ve learned to not treat my level of astonishment as a precision tool for judging incredibility.
To some extent, I suspect the mechanism of estimation plays a significant role. I doubt very much that human beings have built-in heuristics for appraising large numbers of objects. Arithmetic is a fairly novel concept, evolutionarily speaking, and some cultures don’t even have the natural numbers.
So when we try and guess the number of beans in a jar, there’s presumably no single go-to mechanism we’re using to come up with that value. It will be some sort of aggregate of sources, such as our past experience of beans in jars, visualisations of what 200 or 400 or 600 beans all in one place might look like, or rough guesses of volume and packing density. It isn’t even necessarily a transparent process. If you try and make a rough estimate of something, aren’t you using some sort of basis for that? It’s not like the number just pops into your head. You wrestle with it for a little while.
Individual components of that estimation may be subject to bias in a given direction, but over enough sources, over enough people with many different estimation criteria, I wouldn’t trust there to necessarily be a demonstrable bias over repeated experiments without deliberate intervention on the part of the experimenter, such as using a container of an unusual shape that would result in a known overestimation of its volume.
Edit: I should also add an expectation of bias idiosyncratic to specific questions. For example, I think it was Yvain’s most recent LW membership poll that asked for the date Newton published his Philosophiæ Naturalis Principia Mathematica. If there was a widely-believed false date for this event, that would be an obvious source of noise that wouldn’t be cancelled out by corresponding noise on the other side of the true value.
According to a study cited in the Model Thinking class from Coursera.org, this is correct. Crowds which can be collectively characterized as a hedgehog do not have wisdom; crowds which are collectively foxes do have wisdom. The diversity of models is key.
Individual components of that estimation may be subject to bias in a given direction, but over enough sources, over enough people with many different estimation criteria, I wouldn’t trust there to necessarily be a demonstrable bias over repeated experiments without deliberate intervention on the part of the experimenter
This can be seen simply as a version of the central limit theorem: Any sum or average of samples from ANY distribution (with finite mean and standard deviation) will be approximately normally distributed (Gaussian) with the approximation better for larger samples. Neato!
I’d say it’s related to the central limit theorem, but would be cautious about equating the two. We would probably expect a Gaussian distribution from a variable which is the sum or product of a lot of component parts (i.e. lots of different estimator methods), but we wouldn’t necessarily expect the mean to coincide with the true value unless some of those estimator methods were reliable, and they didn’t collectively skew the distribution in one direction.
(and nit-picking, it’s “a well-defined population mean and population standard deviation”, which is required for defining the distribution. If you can’t trust your sample mean and sample SD to approximate your population mean and SD, it’s no longer reliable, and you’d have to use something else, like a t-distribution)
By restricting participants to two choices (one correct, one incorrect), all your noise is going to converge on the same answer.
I’ve been pondering this since I first saw your post, but I still have no idea what you mean. Could you clarify?
The only interpretation that I can come up with is that if, say, the two options are 10 and 1 and the real answer is 9, you would expect that the average would approach 10 over time. I don’t see why this would be obvious or even true: if people were guessing distributed around 9, we could certainly have 10% of the population closer to 1 than 10 and so the average would converge to 9.
Let’s say you’re asking a thousand people to guess the date of the Battle of Bosworth Field. If I asked this right now in Less Wrong, I imagine it would receive some wildly different answers.
If you’re me, and you remember it because its anniversary is on your birthday, (or if you were paying attention in a specific history class) you’ll know the exact year (1485). These people are probably not very numerous, but their answers will all coincide and converge. This subgroup would also have a variance of zero.
All the people who were paying only a little bit of attention in that history class, or watched the first series of Blackadder, will not know the exact date, but they’ll probably guess to within a few decades. This subgroup has a wider variance, but it’s still pretty tight, and they’re answering a convergent question. There’s a correct answer, and the answer these people give is informed by it, even if it’s not correct. In the absence of systematic bias, we would expect roughly the same number of people to answer 1480 as 1490, and so the mean of this group should converge.
We now look at a wider variance subgroup, which includes all the people who only have a sketchy idea of when this battle was and what it was about. Some people will recall it’s got something to do with the Tudor dynasty, and Henry VIII was early 16th century. Some will recall that there was a King Richard involved, and dig up a late 14th century connection. They are all contributing some information to proceedings, (14th-16th Century), but in the absence of systematic bias, we’d expect people to be as wrong on one side as they are on the other. Even greater variance subgroups, who aren’t sure whether this battle was fought by Romans or Crusaders or Confederates, are still contributing some small quantity of information by giving answers in the range of human history. No-one’s going to say 3991 AD, or 6,000,000 BC.
As the variance gets wider, the population of any given subgroup gets larger, but the coherence of their answers gets smaller. If you take a hundred people who have absolutely no knowledge of human history and ask them when the Battle of Bosworth Field occurred, you’re basically asking them to pick a number. Their answers aren’t going to converge on anything, so they won’t systematically interfere with the overall distribution, while the answers that are more informed will converge on the correct answer.
But systematic bias does occur. American education on non-American history is notoriously sketchy. If our participants included a large number of Americans, they’re more likely to guess a date in American history through the availability heuristic. All of a sudden, the uninformed answers will start converging at some point in the late 19th Century, which will skew the overall distribution and pull the mean forward in time. The least wise parts of the crowd suddenly found a way to be a whole lot louder.
That’s what I meant by your noise converging on the same answer. In giving people an incorrect choice, you’re giving all the people who have no knowledge an opportunity to pick the same incorrect answer. If they didn’t have that answer to converge on, the mean of their answer wouldn’t be able to exert as much influence on the overall distribution.
Does that make sense?
(This also does point to an obvious source of systematic bias when dealing with dates: we have better records [and hence more available knowledge] of events closer to the present. History is lumpy, and forward-weighted, so any uninformed guess on the date of an event in the past is going to be distorted around points of greater historical interest, many of which occurred over the last century).
This seems like a round-about way to describe a bell curve...
But suppose in your example that we’re only asking those silly Americans, who, like myself, have only even heard of the Battle of Bosworth as a name and really know nothing about it except maybe some English people were involved or something. And so let’s assume that people are guessing as a bell curve around 1600 with a large variance of, say, 200 years or so. If the two options are 1600 and 1200, let’s say, then 15.8% of the people will be guessing 1200 (ie. think it’s earlier than 1400) and the rest are guessing 1600. This averages out to 1536 in the limit of large numbers.
So I guess I don’t understand your point still—it’s not converging to 1600 or anything like that. It is high, but their was a systematic bias towards being high so what else would you expect? In this example (which was chosen arbitrarily) the two options gave a more correct response than the free guess. Of course, we can come up with options that would make the free response better—choosing between, say, 2600 and 1200 gives an average of 1293 .
It doesn’t have to be a Gaussian distribution. We would expect it to look like one under reasonably assumed conditions, but systematic bias would skew it. A particularly large single source (say there was a Battle of Dosworth Field that happened 400 years later) could easily result in a bimodal distribution.
In order for Wisdom of Crowds to work (as it’s expected to work), people aren’t guessing along a Gaussian distribution. They’re applying knowledge they have, and some of that knowledge is useful information, while some of that knowledge is noise. All the useful information pulls the mean towards the true value, while all the noise pulls it away. The difference is that the useful information converges on a single value, (because it’s a convergent problem with a single correct answer), while all the noise pulls arbitrarily in all directions.
Provided there isn’t some reason for the noise itself to converge on a single value (and I think this is where my previous comments have not necessarily been clear, I’m talking about the noise converging, not the overall mean), the noise should cancel itself out.
It should be obvious that if you give people a right answer and a wrong answer, the noise will be weighted in the direction of the wrong answer (because there’s no corresponding error on the other side of the true value). Even if you have two wrong answers on either side of a true value, and ask people to pick the one closest to the true value, you will still have a skew problem, because unless the two values are equidistant to the true value (which defeats the point of the question), your noise is not going to be equally distributed around the true value.
The popular interpretation of the Wisdom of Crowds phenomenon is that each participant brings a certain amount of information to the result, and a certain amount of noise. Over a large enough sample size, the noise (divergent) cancels itself out, while the information converges on a value which, in the absence of systematic bias, should be proximate to the true value.
By restricting participants to two choices (one correct, one incorrect), all your noise is going to converge on the same answer.
Is there a short explanation of why we should expect an absence of systematic bias?
That’s kind of Eliezer’s point when he talks about how astounding it is that human beings are unbiased estimators of beans in a jar. I’d agree that it’s astounding, but there are plenty of other statistical phenomena that astound me equally, so I’ve learned to not treat my level of astonishment as a precision tool for judging incredibility.
To some extent, I suspect the mechanism of estimation plays a significant role. I doubt very much that human beings have built-in heuristics for appraising large numbers of objects. Arithmetic is a fairly novel concept, evolutionarily speaking, and some cultures don’t even have the natural numbers.
So when we try and guess the number of beans in a jar, there’s presumably no single go-to mechanism we’re using to come up with that value. It will be some sort of aggregate of sources, such as our past experience of beans in jars, visualisations of what 200 or 400 or 600 beans all in one place might look like, or rough guesses of volume and packing density. It isn’t even necessarily a transparent process. If you try and make a rough estimate of something, aren’t you using some sort of basis for that? It’s not like the number just pops into your head. You wrestle with it for a little while.
Individual components of that estimation may be subject to bias in a given direction, but over enough sources, over enough people with many different estimation criteria, I wouldn’t trust there to necessarily be a demonstrable bias over repeated experiments without deliberate intervention on the part of the experimenter, such as using a container of an unusual shape that would result in a known overestimation of its volume.
Edit: I should also add an expectation of bias idiosyncratic to specific questions. For example, I think it was Yvain’s most recent LW membership poll that asked for the date Newton published his Philosophiæ Naturalis Principia Mathematica. If there was a widely-believed false date for this event, that would be an obvious source of noise that wouldn’t be cancelled out by corresponding noise on the other side of the true value.
According to a study cited in the Model Thinking class from Coursera.org, this is correct. Crowds which can be collectively characterized as a hedgehog do not have wisdom; crowds which are collectively foxes do have wisdom. The diversity of models is key.
This can be seen simply as a version of the central limit theorem: Any sum or average of samples from ANY distribution (with finite mean and standard deviation) will be approximately normally distributed (Gaussian) with the approximation better for larger samples. Neato!
I’d say it’s related to the central limit theorem, but would be cautious about equating the two. We would probably expect a Gaussian distribution from a variable which is the sum or product of a lot of component parts (i.e. lots of different estimator methods), but we wouldn’t necessarily expect the mean to coincide with the true value unless some of those estimator methods were reliable, and they didn’t collectively skew the distribution in one direction.
(and nit-picking, it’s “a well-defined population mean and population standard deviation”, which is required for defining the distribution. If you can’t trust your sample mean and sample SD to approximate your population mean and SD, it’s no longer reliable, and you’d have to use something else, like a t-distribution)
I’ve been pondering this since I first saw your post, but I still have no idea what you mean. Could you clarify?
The only interpretation that I can come up with is that if, say, the two options are 10 and 1 and the real answer is 9, you would expect that the average would approach 10 over time. I don’t see why this would be obvious or even true: if people were guessing distributed around 9, we could certainly have 10% of the population closer to 1 than 10 and so the average would converge to 9.
Let’s say you’re asking a thousand people to guess the date of the Battle of Bosworth Field. If I asked this right now in Less Wrong, I imagine it would receive some wildly different answers.
If you’re me, and you remember it because its anniversary is on your birthday, (or if you were paying attention in a specific history class) you’ll know the exact year (1485). These people are probably not very numerous, but their answers will all coincide and converge. This subgroup would also have a variance of zero.
All the people who were paying only a little bit of attention in that history class, or watched the first series of Blackadder, will not know the exact date, but they’ll probably guess to within a few decades. This subgroup has a wider variance, but it’s still pretty tight, and they’re answering a convergent question. There’s a correct answer, and the answer these people give is informed by it, even if it’s not correct. In the absence of systematic bias, we would expect roughly the same number of people to answer 1480 as 1490, and so the mean of this group should converge.
We now look at a wider variance subgroup, which includes all the people who only have a sketchy idea of when this battle was and what it was about. Some people will recall it’s got something to do with the Tudor dynasty, and Henry VIII was early 16th century. Some will recall that there was a King Richard involved, and dig up a late 14th century connection. They are all contributing some information to proceedings, (14th-16th Century), but in the absence of systematic bias, we’d expect people to be as wrong on one side as they are on the other. Even greater variance subgroups, who aren’t sure whether this battle was fought by Romans or Crusaders or Confederates, are still contributing some small quantity of information by giving answers in the range of human history. No-one’s going to say 3991 AD, or 6,000,000 BC.
As the variance gets wider, the population of any given subgroup gets larger, but the coherence of their answers gets smaller. If you take a hundred people who have absolutely no knowledge of human history and ask them when the Battle of Bosworth Field occurred, you’re basically asking them to pick a number. Their answers aren’t going to converge on anything, so they won’t systematically interfere with the overall distribution, while the answers that are more informed will converge on the correct answer.
But systematic bias does occur. American education on non-American history is notoriously sketchy. If our participants included a large number of Americans, they’re more likely to guess a date in American history through the availability heuristic. All of a sudden, the uninformed answers will start converging at some point in the late 19th Century, which will skew the overall distribution and pull the mean forward in time. The least wise parts of the crowd suddenly found a way to be a whole lot louder.
That’s what I meant by your noise converging on the same answer. In giving people an incorrect choice, you’re giving all the people who have no knowledge an opportunity to pick the same incorrect answer. If they didn’t have that answer to converge on, the mean of their answer wouldn’t be able to exert as much influence on the overall distribution.
Does that make sense?
(This also does point to an obvious source of systematic bias when dealing with dates: we have better records [and hence more available knowledge] of events closer to the present. History is lumpy, and forward-weighted, so any uninformed guess on the date of an event in the past is going to be distorted around points of greater historical interest, many of which occurred over the last century).
This seems like a round-about way to describe a bell curve...
But suppose in your example that we’re only asking those silly Americans, who, like myself, have only even heard of the Battle of Bosworth as a name and really know nothing about it except maybe some English people were involved or something. And so let’s assume that people are guessing as a bell curve around 1600 with a large variance of, say, 200 years or so. If the two options are 1600 and 1200, let’s say, then 15.8% of the people will be guessing 1200 (ie. think it’s earlier than 1400) and the rest are guessing 1600. This averages out to 1536 in the limit of large numbers.
So I guess I don’t understand your point still—it’s not converging to 1600 or anything like that. It is high, but their was a systematic bias towards being high so what else would you expect? In this example (which was chosen arbitrarily) the two options gave a more correct response than the free guess. Of course, we can come up with options that would make the free response better—choosing between, say, 2600 and 1200 gives an average of 1293 .
It doesn’t have to be a Gaussian distribution. We would expect it to look like one under reasonably assumed conditions, but systematic bias would skew it. A particularly large single source (say there was a Battle of Dosworth Field that happened 400 years later) could easily result in a bimodal distribution.
In order for Wisdom of Crowds to work (as it’s expected to work), people aren’t guessing along a Gaussian distribution. They’re applying knowledge they have, and some of that knowledge is useful information, while some of that knowledge is noise. All the useful information pulls the mean towards the true value, while all the noise pulls it away. The difference is that the useful information converges on a single value, (because it’s a convergent problem with a single correct answer), while all the noise pulls arbitrarily in all directions.
Provided there isn’t some reason for the noise itself to converge on a single value (and I think this is where my previous comments have not necessarily been clear, I’m talking about the noise converging, not the overall mean), the noise should cancel itself out.
It should be obvious that if you give people a right answer and a wrong answer, the noise will be weighted in the direction of the wrong answer (because there’s no corresponding error on the other side of the true value). Even if you have two wrong answers on either side of a true value, and ask people to pick the one closest to the true value, you will still have a skew problem, because unless the two values are equidistant to the true value (which defeats the point of the question), your noise is not going to be equally distributed around the true value.