Updating, part 1: When can you change your mind? The binary model
I was recently disturbed by my perception that, despite years of studying and debating probability problems, the LessWrong community as a whole has not markedly improved its ability to get the right answer on them.
I had expected that people would read posts and comments by other people, and take special note of comments by people who had a prior history of being right, and thereby improve their own accuracy.
But can that possibly work? How can someone who isn’t already highly-accurate, identify other people who are highly accurate?
Aumann’s agreement theorem (allegedly) says that Bayesians with the same priors agree. But it doesn’t say that doing so helps. Under what circumstances does revising your opinions, by updating in response to people you consider reliable, actually improve your accuracy?
To find out, I built a model of updating in response to the opinions of others. It did, eventually, show that Bayesians improve their collective opinions by updating in response to the opinions of other Bayesians. But this turns out not to depend on them satisfying the conditions of Aumann’s theorem, or on doing Bayesian updating. It depends only on a very simple condition, established at the start of the simulation. Can you guess what it is?
I’ll write another post describing and explaining the results if this post receives a karma score over 10.
That’s getting a bit ahead of ourselves, though. This post models only non-Bayesians, and the results are very different.
Here’s the model:
There are G people in a group such as LessWrong.
There are N problems being discussed simultaneously.
Problems are binary problems, with an answer of either 1 or 0.
Each person’s opinion on each problem is always known to all people.
Each person i has an accuracy: Their probability pi of getting any arbitrary problem correct on the first guess.
givt is what person i believes at time t is the answer to problem v (1 or 0).
pij expresses person i’s estimate of the probability that an arbitrary belief of person j is correct.
Without loss of generality, assume the correct answer to every problem is 1.
# Loop over T timesteps
For t = 0 to T-1 {
# Loop over G people
For i = 0 to G-1 {
# Loop over N problems
For v = 0 to N-1 {
If (t == 0)
# Special initialization for the first timestep
If (random in [0..1] < pi) givt := 1; Else givt := 0
Else {
# Product over all j of the probability that the answer to v is 1 given j’s answer and estimated accuracy
m1 := ∏j [ pijgjv(t-1) + (1-pij)(1-gjv(t-1)) ]
# Product over all j of the probability that the answer to v is 0 given j’s answer and estimated accuracy
m0 := ∏j [ pij(1-gjv(t-1)) + (1-pij)gjv(t-1) ]
p1 := m1 / (m0 + m1) # Normalize
If (p1 > .5) givt := 1; Else givt := 0
# Loop over G other people
For j = 0 to G-1
# Compute person i’s estimate of person j’s accuracy
pij := { Σs in [0 .. t] Σv in [s..N] [ givtgjvs + (1-givt)(1-gjvs) ] } / N
p1 is the probability that agent i assigns to problem v having the answer 1. Each term pijgjv(t-1) + (1-pij)(1-gjv(t-1)) is the probability of problem v having answer 1 computed using agent j’s beliefs, by adding either the probability that j is correct (if j believes it has answer 1), or the probability that j is wrong (if j believes it has answer 0). Agent i assumes that everyone’s opinions are independent, and multiplies all these probabilities together. The result, m1, is very small when there are very many agents (m1 is on the order of .5G), so it is normalized by computing a similar product m0 for the probability that v has answer 0, and setting p1 = m1 / (m0 + m1).
The sum of sums to compute pij (i’s opinion of j’s accuracy) computes the fraction of problems, summed over all previous time periods, on which person j has agreed with person i’s current opinions. It sums over previous time periods because otherwise, pii = 1. By summing over previous times, if person i ever changes its mind, that will decrease pii. (The inner sum starts from s instead of 0 to accomodate an addition to the model that I’ll make later, in which the true answer to problem t is revealed at the end of time t. Problems whose answer is public knowledge should not be considered in the sum after the time they became public knowledge.)
Now, what distribution should we use for the pi?
There is an infinite supply of problems. Many are so simple that everyone gets them right; many are so hard or incomprehensible that everyone performs randomly on them; and there are many, such as the Monty Haul problem, that most people get wrong because of systematic bias in our thinking. The range of population average performance pave on all possible problems thus falls within [0 .. 1].
I chose to model person accuracy instead of problem difficulty. I say “instead of”, because you can use either person accuracy or problem difficulty to set pave. Since a critical part of what we’re modeling is person i’s estimate of person j’s accuracy, person j should actually have an accuracy. I didn’t model problem difficulty partly because I assume we only talk about problems of a particular level of difficulty; partly because a person in this model can’t distinguish between “Most people disagree with me on this problem; therefore it is difficult” and “Most people disagree with me on this problem; therefore I was wrong about this problem”.
Because I assume we talk mainly about high-entropy problems, I set pave = .5. I do this by drawing pi from [0 .. 1], with a normal distribution with a mean of .5, truncated at .05 and .95. (I used a standard deviation of .15; this isn’t important.)
Because this distribution of pi is symmetric around .5, there is no way to know whether you’re living in the world where the right answer is always 1, or where the right answer is always 0. This means there’s no way, under this model, for a person to know whether they’re a crackpot (usually wrong) or a genius (usually right).
Note that these agents don’t satisfy the preconditions for Aumann agreement, because they produce 0⁄1 decisions instead of probabilities, and because some agents are biased to perform worse than random. It’s worth studying non-Bayesian agents before moving on to a model satisfying the preconditions for the theorem, if only because there are so many of them in the real world.
An important property of this model is that, if person i is highly accurate, and knows it, pii will approach 1, greatly reducing the chance that person i will change their mind about any problem. Thus, the more accurate a person becomes, the less able they are to change their minds when they are wrong—and this is not an error. It’s a natural limit on the speed at which one can converge on truth.
An obvious problem is that at t=0, person i will see that it always agrees with itself, and set pii = 1. By induction, no one will ever change their mind. (I consider this evidence for the model, rather than against it.)
The question of how people ever change their mind is key to this whole study. I use one of these two additions to the model to let people change their mind:
At the end of each timestep t, the answer to problem number t becomes mutual knowledge to the entire group. (This solves the crackpot/genius problem.)
Each person has a maximum allowable pij (including pii).
This model is difficult to solve analytically, so I wrote a Perl script to simulate it.
What do you think will happen when I run the program, or its variants?
What other variants would you like to see tested?
Is there a fundamental problem with the model?
I would like you to publish any results you may generate with your script, and promise to upvote them even if the results do not prove anything, as long as they are presented roughly as clearly as this post is.
So… why does this post have such a low rating? Comments? I find it bewildering. If you’re interested in LessWrong, you should be interested in finding out under what conditions people become less wrong.
Posts with a lot of math require me to set aside larger chunks of time to consume them. I do want to examine this but that won’t be possible until later this week, which means I don’t vote on it until then.
Thanks—good to know.
You haven’t shown that your experiment will do so. Nor have you shown that your experiment models the situation well.
What would it take to show that? It seems to me that isn’t a thing that I could “show”, even in theory, since I’ve found no existing empirical data on Aumann-agreement-type experiments in humans. If you know one, I’d appreciate a comment describing it.
I believe that one of the purposes of LessWrong is to help us gain an understanding of important epistemic issues. Proposing a new way to study the issue and potentially gain insight is therefore important.
I think that your standard implies that LessWrong is like a peer-reviewed journal: A place for people to present completed research programs; not a place for people to cooperate to find answers to difficult problems.
As I’ve said before, it’s not good to apply standards that penalize rigor. If the act of putting equations into a post means that each equation needs to be empirically validated in order to get an upvote, pretty soon nobody is going to put equations into their posts.
I’m perfectly happy to come back and vote this up after I am satisfied that it is good, and I haven’t and won’t vote it down. I think it’s a good idea to seek public comment, but the voting is supposed to indicate posts which are excellent for public consumption—this isn’t, unless it’s the technical first half of a pair of such posts. I want to know that the formalization parallels the reality, and it’s not clear that it does before it is run.
So, you don’t want to vote until you see the results; and I don’t want to waste an entire day writing up the results if few people are interested. Is there a general solution to this general problem?
(The “Part 1” in the title was supposed to indicate that it is the first part of a multi-part post.)
If you are confident in the practical value of your results, I would recommend posting. Otherwise I can’t help you.
I held off on rating the post because I just skimmed it, saw most of it was describing an algorithm/model, decided I didn’t have time to check your working, and held off on rating the post because I didn’t check your work. I might not be representative, I don’t rate most posts; I’ve rated just 6 top-level posts so far this May.
Hmm—I wish I could see whether I have few upvotes, or numerous upvotes and downvotes. They’d have very different implications for what I should do differently.
I’m rather tired of the Sleeping Beauty debate and so didn’t read it. If others have had the same reaction this might explain the low score.
Thanks for answering. This isn’t a continuation of the sleeping beauty debate. Despite what you see in the comment section, which has been hijacked by sleeping beauty.
One thing I think is missing from your model is correlation between different answers, and I think that this is actually essential to the phenomenon: ignoring it makes it look like people are failing to come to agreement at all, when what’s actually happening is that they’re aligning into various ideological groups.
That is, there’s a big difference between a group of 100 people with independent answers on 10 binary questions (random fair coinflips), and two groups of 50 who disagree on each of the 10 binary questions. I think that if you compared LW newcomers with veterans, you’d find that the newcomers more resemble the first case, and veterans more the second. This would suggest that peoples’ answers are becoming more internally coherent, at least.
In particular, I expect that on this subject the veterans split roughly as follows:
Those who subscribe to Bostrom’s SIA and are Thirders (1/3 to 1⁄2 of the LW vets)
Those who subscribe to Bostrom’s SSA and are Halfers (less than 1⁄4)
Those who reject Bostromian anthropic probabilities entirely (less than 1⁄4)
One can easily predict the responses of the first two groups on subsequent questions.
I don’t build a model by looking at the observed results of a phenomena, and building in a special component to produce each observed result. You wouldn’t learn anything from your models if you did that; they would produce what you built them to produce. I build a model by enumerating the inputs, modeling each input, and seeing how much of the observed results the output matches.
When I run the simulation, people do in fact align into different groups. So far, always 2 groups. But the alignment process doesn’t give either group better overall accuracy. This shows that you don’t need any internal coherence or problem understanding for people to align into groups. Attributing accuracy to people who tend to agree with you, and inaccuracy to those who disagree with you, produces saddle-point dynamics. Once the initial random distribution gets off the saddle point, the groups on the opposite sides each rapidly converge to their own attractor.
What’s especially interesting is that this way of judging people’s accuracy doesn’t just cause different groups to converge to different points; it causes the groups to disagree with each other on every point. There isn’t one “right” group and one “wrong” group; there are two groups that are right about different things. Their agreement within a group on some topics indirectly causes them to take the opposite opinion on any topic on which other groups have strong opinions. In other words: My enemy’s belief P is evidence against P.
(Sleeping Beauty isn’t the subject of this post.)
OK, I see what you’re doing now. It’s an interesting model, though one feature jumps out at me now:
Although this phenomenon is a well-known fallacy among human beings, it doesn’t seem like it should be the rational behavior— and then I noticed that the probabilities p_i can be less than 1⁄2 in your model, and that some of your agents are in fact reliably anti-correct. This seems like a probable cause of a binary group split, if I’m understanding correctly.
What’s the result if you make the probabilities (and accordingly, people’s estimates of the probabilities) range from 1⁄2 to 1 instead of from 0 to 1?
Then everybody converges onto agreeing on the correct answer for every question. And you just answered the question as to why Bayesians should agree to agree: Because Bayesians can’t perform worse than random on average, their accuracies range from 1⁄2 to 1, and are not biased on any problem (unless the evidence is biased, in which case you’re screwed anyway). Averaging their opinions together will thus get the right answer to every (answerable) question. Congratulations! You win 1 Internet!
(The reason for choosing 0 to 1 is explained in the post.)
The behavior in my model is rational if the results indicate that it gets the right answer. So far, it looks look it doesn’t.
You could probably get the same answer by having some problems, rather than agents, usually be answered wrong. An abundance of wrong answers makes the agents split. The agents don’t split into the correct agents and the incorrect agents, at least not for the conditions I’ve tested. There doubtless are settings that would get them to do that.
Does the 2-group split stay even if you continue the simulation until all answers have been revealed?
If you increase the standard deviation of p[i] so there are more very right and very wrong guessers, do they tend to split more into right and wrong groups? I expect they would.
Good question—no; revelation of answers eventually causes convergence into 1 group.
It makes the splitting happen faster.
It also didn’t get a lot of on-topic comments. Possibly because guessing the answers to your questions seems the wrong way to answer them—the correct way being to put it to the test with the program, which means rewriting it (wasteful) or waiting for you to post it.
Are you planning on posting the perl script? I’m a bit tempted to just translate what you’ve got in the post into python, but realistically I probably won’t get around to it anytime soon.
I think there’s a way to upload it to LessWrong and post a link to it. But I don’t know how. My email is at gmail.
Summarizing the results in the same post would result in a gigantic post that people wouldn’t want to read.
The code could be cleaner. Couldn’t
same givt gjvs
It would clean up the code a lot, and make it less of a hassle to read. I’d also prefer higher order functions to for loops, but that may just be me.
The code is written that way to accomodate the continuous case. I think people who aren’t C or assembly programmers will find the not(xor) more confusing; and people who are programmers will find the second unfamiliar.
I’m mainly saying the code is a bit opaque at the moment.
If you want to keep the continuous case, fine.
As long as you defined the same or similar function somewhere else, programmers would be fine.
Commenting the code would help people get to grips with it, if you don’t want to change it.
Good idea. Comments it is.
Re: “I had expected that people would read posts and comments by other people, and take special note of comments by people who had a prior history of being right, and thereby improve their own accuracy.”
FWIW, I think that was how I originally approached the problem. Rather than trying to solve it directly, I first looked at your response, and Robin Hanson’s response. After a little reflection, I concluded that you agreed with each other—and that you had both focused on the key issues—and got the answer right.
At that time, most of the rest of the thread was people saying the problem was ambiguous and needed a bet to clear it up—and a fair bit of confusion—with very little defense of the standard answer.
This is a really interesting topic, there are heaps of things I want to say about it. I was initially waiting to see what your results were first, to avoid spoilers with my guesses, but that’s no way to have a conversation.
First—I think there’s an error in the program: When you compute p[i][j] you take a sum then divide by N, but it looks like you should divide by the number of guesses you are adding, which can be more than N since it includes multiple rounds of guesses.
My (inconsistent) thoughts about how the model would behave:
They’d quickly learn the ratio of correct initial guesses everyone had, and make near-perfect use of that information. But they don’t distinguish between the initial guesses and later updates, so that’s not right.
Even the bad guessers will get most of their updated estimates right by the end, so their opinions will be assumed to correlate with the truth. If you then went back and posed everyone a new question, all the bad guessers could significantly mislead everyone. That’s not the procedure in your code, but you could try it.
At the start of the simulation, all the guessers are simply seeing who else agrees with them. The good guessers might be converging to a correct consensus, while the bad guessers could converge to the opposite. But as the simulation progressed and the answers were revealed, the bad guessers would lose confidence in their whole subgroup, including themselves, and follow the good guessing group.
Ideas for variants:
Make the initial guess accuracy depend on both guesser accuracy and problem difficultly/deceptiveness. I proposed a formula for this in my previous comment. In this case, the best way to update from the initial guesses would seem to be to follow the average opinion of a few of the best guessers and maybe the reverse of the worst few guessers, but I’m not sure how it would play out in the simulation where you don’t know who they are, and you have to update on each other’s updated guesses.
Make the initial guess accuracy depend on both the skill of the guesser and the difficulty of the question, but vary what weight is given to skill—some questions can be just as hard for skilled guessers as everyone else. In this case, a way to update from an initial guess would be to look at enough of the best guessers that you’re confident which way they guess on average (you’d need to sample more if they are near 50%)
Repeat the exercise—after the first set of N answers are revealed, continue with N more questions. This time the guessers start with data about each other’s accuracy. Then after they are done, N more, etc.
Instead of everyone getting the same number of updates, let some update more often.
Instead of updating everyone and revealing one answer each round, randomly pick between updating a random person and randomly revealing a correct answer just to one person, which they will be certain of for the rest of the game. You could give different people different chances of updating from group opinions, and of getting the correct answer revealed. Since people don’t know who’s had what answers revealed they don’t stop counting them when evaluating each other’s accuracy.
I’ve already stated my position there, probably too many times.
Nevertheless, it was your position I couldn’t determine (for the amount of resources I was willing to invest).
I’m interested in hearing others responses to these questions:
As for this one:
As you know, that depends on what we want to use the model for. It ignores all sorts of structure in the real world, but that could end up being a feature rather than a bug.
I want to use it to try to get a grip on what conditions must be satisfied in order that people can expect to improve their accuracy on the problems discussed on LessWrong by participating in LessWrong; and whether accuracy can go to 1, or approaches a limit.
That reminds me; I need to add a sentence about the confidence effect.