Of the fifty-odd biases discovered by Kahneman, Tversky, and their successors, forty-nine are cute quirks, and one is destroying civilization. This last one is confirmation bias.
He goes on to argue that this bias is the source of polarization in society, which is distorting our beliefs and setting us at each other’s throats. How could someone believe such different things unless they’re either really stupid or lying to conceal their selfishness? I think this is right, and I think it’s at play even in the best rationalist communities like LessWrong. I think it’s particularly powerful in difficult domains, like AGI prediction and alignment theory. When there’s less real evidence, biases play a larger role.
I reached this conclusion independently while studying those and the remaining ~149 biases listed on Wikipedia at that point. You can get a little more rational by making your estimates carefully. That covers most of the biases. But the world is being destroyed by people believing what is comfortable to believe instead of what the evidence suggests. This is usually also what they already believe, so the definition of confirmation bias is highly overlapping with motivated reasoning.
I studied the brain basis of cognitive biases for four years while funded by an IARPA program; I thought it was more worthwhile than the rest of what we were doing in cognitive neuroscience, so I kept up with it as part of my research for the remaining four years I was in the field.
I think motivated reasoning is a better conceptual term for understanding what’s going on, but let’s not quibble about terminology. I’m going to mostly call it motivated reasoning, MR, but you can take almost everything I’m going to say and apply it to confirmation bias- because mostly it’s comfortable to keep believing what we already do. We chose to believe it partly because it was comfortable, and now it fits with all of our other beliefs, so changing it and re-evaluating the rest of our connected beliefs is uncomfortable.
Wait, you’re saying! I’m a rationalist! I don’t just believe what’s comfortable!
Yes, that’s partly true. Believing in seeking truth when it’s hard does provide some resistance to motivated reasoning. A hardcore rationalist actually enjoys changing their mind sometimes. But it doesn’t confer immunity. We still have emotions, and it’s still more comfortable to think that we’re already right because we’re good rationalists who’ve already discerned the truth.
There are two ways confirmation bias works. One is that it’s easier to think of confirming evidence than disconfirming evidence. The associative links tend to be stronger. When you’re thinking of a hypothesis you tend to believe, it’s easy to think of evidence that supports it.
The stronger one is that there’s a miniature Ugh field[1] surrounding thinking about evidence and arguments that would disprove a belief you care about. It only takes a flicker of a thought to make the accurate prediction about where considering that evidence could lead: admitting you were wrong, and doing a bunch of work re-evaluating all of your related beliefs. Then there’s a little unconscious yuck feeling when you try to pay attention to that evidence.
This is just a consequence of how the brain estimates the value of predicted outcomes and uses that to guide its decision-making, including its micro-decisions about what to attend to. I wrote a paper reviewing all of the neuroscience behind this, Neural mechanisms of human decision-making, but it’s honestly kind of crappy based on the pressure to write for a super-specialized audience, and my reluctance at the time to speed up progress on brain-like AGI. So I recommend Steve Byrnes’ valence sequence over that complex mess; it perfectly describes the psychological level, and he’s basing it on those brain mechanisms even though he’s not directly talking about them. And he’s a better writer than I am.
Trapped priors is at least partly overlapping with confirmation bias. Or it could even just be strong priors. The issue is that everyone has seen different evidence and arguments—and we’ve very likely spent more time attending to evidence that supports our original hypothesis, because of the subtle push of motivated reasoning.
Motivated reasoning isn’t even strictly speaking irrational. Suppose there’s some belief that really doesn’t make a difference in your daily life, like that there’s a sky guy with a cozy afterlife, or which of two similar parties should receive your vote (which will almost never actually change any outcomes). Here the two definitions of rationality diverge: believing the truth is now at odds with doing what works. It will obviously work better to believe what your friends and neighbors believe, so you won’t be in arguments with them and they’ll support you more when you need it.
If we had infinite cognitive capacity, we could just believe the truth while claiming to believe whatever works. And we could keep track of all of the evidence instead of picking and choosing which to attend to.
But we don’t. So motivated reasoning, confirmation bias, and the resulting tribalism (which happens when other emotions like irritation and outrage get involved in our selection of evidence and arguments) are powerful factors, even for a devoted rationalist.
The only remedy I know of is to cultivate enjoying being wrong. This involves giving up a good bit of one’s self-concept as a highly intelligent individual. This gets easier if you remember that everyone else is also doing their thinking with a monkey brain that can barely chin itself on rationality.
Thanks for asking this question; it’s a very smart question to ask. And I’ve been meaning to write about this on LW and haven’t prioritized doing a proper job, so it’s nice to have an excuse to do a brief writeup.
Edit: Staring into the abyss as a core life skill seems to very much be about why and how to overcome motivated reasoning. The author learned to value the idea of being wrong about important beliefs, by seeing a few people accomplish extraordinary things as a result of questioning their central beliefs and changing their minds.
This is a great comment, IMO you should expand it, refine it, and turn it into a top-level post.
Also, question: How would you design a LLM-based AI agent (think: like the recent computer-using Claude but much better, able to operate autonomously for months) so as to be immune from this bias? Can it be done?
These are both older versions. I was worried about pushing capabilities at the time, but progress has been going that direction anyway, so I’m working on updated versions that are clearer.
I’ve been catching up on your recent work in the past couple of weeks; it seems on-target for my projected path to AGI.
Unimportant: I don’t think it’s off-topic, because it’s secretly a way of asking you to explain your model of why confirmation bias happens more and prove that your brain-inspired model is meaningful by describing a cognitive architecture that doesn’t have that bias (or explaining why such an architecture is not possible). ;)
Thanks for the links! On brief skim they don’t seem to be talking much about cognitive biases. Can you spell it out here how the bureaucracy/LMP of LMA’s you describe could be set up to avoid motivated reasoning?
I apologize. Those links don’t answer the question at all. I dashed off an answer in the middle of a social occasion and completely missed the most relevant piece of your question—quite possibly motivated reasoning led me to assume it was on my favorite topic, whether language model agents will be our first AGIs and how to align them.
Anyway, here’s my answer to your actual question. This isn’t something I’ve thought about before, because it’s not clear it will play a central role in alignment. (I’m curious if you see a more direct link to alignment than I’m seeing)
In sum, they’ll probably have some MR and CB; they can use the same strategies as humans can to correct them, but more reliably if it can be included in scripted prompts as part of the scaffolding that makes the LLMs into cognitive architectures (or real AGI). That’s basically to notice when you’re in a situation that would cause MR or CB, and do some extra cognitive work to counteract it.
Language model AGI will probably have MR, but less than humans:
Language model agents won’t have as much motivated reasoning as humans do, because they’re not probably going to use the same very rough estimated-value-maximization decision-making algorithm. (this is probably good for alignment; they’re not maximizing anything, at least directly. They are almost oracle-based agents).
But they may have a substantial amount of MR, to the extent that language encodes linked beliefs that were created by humans with lots of MR. I wonder if anyone has run tests that could indicate how much MR or CB they have, if that can somehow be disentangled from sycophancy.
And they will probably have some amount of confirmation bias, because language probably encodes the same sort of associative links that make it easier to think of confirming evidence than disconfirming.
How LMAs could correct for MR and CB (imperfectly but better-than-human—at a cost):
Like humans, they could correct for MR and CB by doing some compensatory cognition. A human could correct for MR by just sort of weight their beliefs against what they want to believe. That’s probably a good start for humans, but not nearly enough; see below.
I was initially thinking this part wouldn’t be necessary for an LMA since it probably won’t directly use a reward-predictive decision-making algorithm. But to the extent it’s trained with RLHF/RLAIF, it’s using a policy shaped by reward prediction (as are humans when they don’t explicitly predict consequences). This is an interesting distinction; the model is biased not to believe what it “wants”, but what the process that trained it “wants” in response to that prompt. Gauging that bias in order to compensate for it would be tricky. But I think the right scripted prompting could approximate it—at a risk of overcompensating, since we don’t have a good way to model exactly how much bias you’d have in a given circumstance.
Second and probably more important is compensating the compounding effect of MR changing how much evidence you’ve considered for and against beliefs you “like”. How far off your beliefs are is only an indirect result of how much you want to believe them. The direct cause is (I think) mostly how much you’ve looked at evidence and logic supporting that belief vs. evidence and logic that would disconfirm it. That makes compensating for them a lot harder after-the-fact; you’ve got to go back and consider as much evidence against as you considered for the hypothesis.
That leads us to the same correction for MR you’d do for confirmation bias: forcing yourself to look at evidence and arguments against your favored hypothesis.
Once again, it wouldn’t be easy to figure out just how much you’d need to compensate, so it’s not going to be a bias-free belief system, just less-biased-on-average.
And it would take a bunch of extra computation to go looking for and weighing evidence against all of the beliefs the system is biased toward. So this process would probably only be deployed when an answer is particularly important.
Adjusting for biases as a third function of a scripted internal review:
The scripted process for judging this and then performing that extra cognition would look a lot like the “internal review” I described in that post. That has at least a small tax for just making another call (perhaps to a separate model or instance) to evaluate whether this decision (including adopting a new belief) is important enough to carry out another whole scripted set of calls to evaluate its costs (including ethical costs) and then in the case of compensating for bias, do a bunch more cognition looking at disconfirming evidence/arguments.
This all stacks up to being pretty costly. I’d expect LMAs to have a good bit less MR and CB; they would sort of only have the “echoes” of them captured by standard linguistic patterns. They won’t directly have the feelings (pride, competitiveness, shame) that result in strong MR and thereby CB.
But I’m not sure. Again, I’m curious if anyone sees direct links to alignment. I’m currently worried about correct-but-unexpected changes in an agent’s beliefs and how that changes its functional alignment. Biases might make that worse, but I don’t see it opening up totally new dangers.
Question: Wouldn’t these imperfect bias-corrections for LMA’s also work similarly well for humans? E.g. humans could have a ‘prompt’ written on their desk that says “Now, make sure you spend 10min thinking about evidence against as well...” There are reasons why this doesn’t work so well in practice for humans (though it does help); might similar reasons apply to LMAs? What’s your argument that the situation will be substantially better for LMAs?
I’m particularly interested in elaboration on this bit:
Language model agents won’t have as much motivated reasoning as humans do, because they’re not probably going to use the same very rough estimated-value-maximization decision-making algorithm. (this is probably good for alignment; they’re not maximizing anything, at least directly. They are almost oracle-based agents).
I think there is an important reason things are different for LMAs than humans: you can program in a check for whether it’s worth correcting for motivated reasoning. Humans have to care enough to develop a habit (including creating reminders they’ll actually mind).
Whether a real AGI LMA would want to remove that scripted anti-bias part of their “artificial conscience” is a fascinating question; I think it could go either way, with them identifying it as a valued part of themselves, or an external check limiting their freedom of thought (same logic applies to internal alignment checks).
This also would substitute for a motivation that humans mostly don’t have. People, particularly non-rationalists, just aren’t trying very hard to arrive at the truth—because taking the effort to do that doesn’t serve their perceived interests.
Most often, humans don’t even want to correct for motivated reasoning. Firmly believing the same things as their friends and family serves them.
In important life decisions, they can benefit by countering MR and CB. I just added Staring into the abyss as a core life skill to the footnote, since it seems to be about exactly that.
Spending an extra ten minutes thinking about the counterevidence is usually a huge waste of time—unless you hugely value reaching correct conclusions on abstract matters that are likely irrelevant to your life success (I expect you do, and I do too—but it’s not hard to see why that’s a minority position).
Finally, there is no common knowledge of how big a problem MR/CB is, or how one might correct them.
I couldn’t find any study where they told people “try to compensate for this bias”, at least as of ~8 years ago when I was actively researching this.
Oracle-based agent is a term I’m toying with to intuitively capture how a language model agent still isn’t directly motivated by RL based on a goal. They are trained to have an accurate world model, and largely to answer questions as they were intended (although not necessarily accurately—sycophancy effects are large). They (in current form) are made agentic by having someone effectively ask “what would an agent do to accomplish this goal, given these tools?” and getting a correct-enough answer (which is then converted to actions by tools).
Sure, there are ways that the goals implicit in RLHF could deeply influence the LMA, giving them alien shoggoth-goals. That could happen if we optimize a lot more—including having a real AGI LMA reflect on its goals and beliefs for a long time.
But currently we’re actually training mostly for instruction-following. If we use that moderately wisely, it seems like we could head off the disasters of strongly optimizing for goals. That’s a brief diversion into the alignment implications of oracle-based (language model) agents; I’m not sure if that’s part of what you’re asking about, but there you go anyway.
So LMAs are currently selecting actions by trying to answer questions as their training encouraged. It seems LMAs are pretty strongly influenced by motivated reasoning based on their RL-based policy—but it isn’t their interests/desires that motivate their reasoning, but that of the RLHF respondents (or the constitution for RLAIF). They are sycophantic instead of motivated by their own predicted rewards as humans are.
That will cause them to be inaccurate but not misaligned, which seems more important.
Did that get at your interest in that passage, or am I once again misinterpreting your question?
This is helping, thanks. I do buy that something like this would help reduce the biases to some significant extent probably.
Will the overall system be trained? Presumably it will be. So, won’t that create a tension/pressure, whereby the explicit structure prompting it to avoid cognitive biases will be hurting performance according to the training signal? (If instead it helped performance, then shouldn’t a version of it evolve naturally in the weights?)
I’m not at all sure the overall system will be trained. Interesting that you seem to expect that with some confidence.
I’d expect the checks for cognitive biases to only call for extra cognition when a correct answer is particularly important to completing the task at hand. As such, it shouldn’t decrease performance much.
But I’m really not sure that training the overall system end-to-end is going to play a role. The success and relatively faithful CoT from r1 and QwQ give me hope that end-to-end training won’t be very useful.
Certainly people will try end-to-end training, but given the high compute cost for long-horizon tasks, I don’t think that’s going to play as large a role as piecewise and therefore fairly goal-agnostic training.
I think humans’ long-horizon performance isn’t mostly based on RL training, but our ability to reason and to learn important principles (some from direct success/failure at LTH tasks, some from vicarious experience or advice). So I expect the type of CoT RL training used in o1 to be used, as well as extensions to general reasoning where there’s not a perfectly check-able correct answer. That allows good System 2 reasoning performance, which I think is the biggerst basis of humans’ ability to perform useful LTH tasks.
Combining that with some form of continuous learning (either better episodic memory than vector databases and/or fine-tuning for facts/skills judged as useful) seems like all we need to get to human level.
Probably there will be some end-to-end performance RL, but that will still be mixed with strong contributions from reasoning about how to achieve a user-defined goal.
Gauging how much goal-directed RL is too much isn’t an ideal situation to be in, but it seems like if there’s not too much, instruction-following alignment will work.
WRT to cognitive biases, end-to-end training would increase some desired biases while decreasing some that are hurting performance (sometimes correct answers are very useful).
MR as humans experience it is only optimial within our very sharp cognitive limitations, and the types of tasks we tend to take on. So optimal MR for agents will be fairly different.
I’m curious about your curiousity; is it just that, or are you seeing a strong connection between biases in LMAs and their alignment?
But I’m really not sure that training the overall system end-to-end is going to play a role. The success and relatively faithful CoT from r1 and QwQ give me hope that end-to-end training won’t be very useful.
Huh, isn’t this exactly backwards? Presumably r1 and QwQ got that way due to lots of end-to-end training. They aren’t LMPs/bureaucracies.
...reading onward I don’t think we disagree much about what the architecture will look like though. It sounds like you agree that probably there’ll be some amount of end-to-end training and the question is how much?
My curiosity stems from: 1. Generic curiosity about how minds work. It’s an important and interesting topic and MR is a bias that we’ve observed empirically but don’t have a mechanistic story for why the structure of the mind causes that bias—at least, I don’t have such a story but it seems like you do! 2. Hope that we could build significantly more rational AI agents in the near future, prior to the singularity, which could then e.g. participate in massive liquid virtual prediction markets and improve human collective epistemics greatly.
One problematic aspect is that it’s often easier to avoid motivated reasoning when the stakes are low. Even if you manage to avoid it in 95% of the cases, if th remaining 5% are there what really matters you are still overall screwed.
Here the two definitions of rationality diverge: believing the truth is now at odds with doing what works. It will obviously work better to believe what your friends and neighbors believe, so you won’t be in arguments with them and they’ll support you more when you need it.
This is only true if you can’t figure out how to handle disagreements.
It will often be better to have wrong beliefs if it keeps you from acting on the even wronger belief that you must argue with everyone who disagrees. It’s better yet to believe the truth on both fronts, and simply prioritize getting along when it is more important to get along.
If we had infinite cognitive capacity, we could just believe the truth while claiming to believe whatever works. And we could keep track of all of the evidence instead of picking and choosing which to attend to.
It’s more fundamental than that. The way you pick up a glass of water is by predicting that you will pick up a glass of water, and acting so as to minimize that prediction error. Motivated cognition is how we make things true, and we can’t get rid of it except by ceasing to act on the environment—and therefore ceasing to exist.
Motivated cognition causes no epistemic problem so long as we can realize our predictions. The tricky part comes when we struggle to fit the world to our beliefs. In these cases, there’s an apparent tension between “believing the truth” and “working towards what we want”. This is where all that sports stuff of “you have to believe you can win!” comes from, and the tendency to lose motivation once we realize we’re not going to succeed.
If we try to predict that we will win the contest despite being down 6-0 and clearly less competent, we will either have to engage in the willful delusion of pretending we’re not less competent and/or other things (which makes it harder to navigate reality, because we’re using a false map and can’t act so as to minimize the consequences of our flaws) or else we will just fail to predict success altogether and be unable to even try.
If instead, we don’t predict anything about whether we will win or lose, and instead predict that we will play to the absolute best of our abilities, then we can find out whether we win or lose, and give ourselves room to be pleasantly surprised.
The solution isn’t to “believe the truth” because the truth has not been set yet. The solution is to pay attention to our anticipated prediction errors, and shift to finer grain modeling when the expected error justifies the cost of thinking harder.
The only remedy I know of is to cultivate enjoying being wrong. This involves giving up a good bit of one’s self-concept as a highly intelligent individual. This gets easier if you remember that everyone else is also doing their thinking with a monkey brain that can barely chin itself on rationality.
If you stop predicting “I am a highly intelligent individual, so I’m not wrong!”, then you get to find out if you’re a highly intelligent individual, as well as all of the things that may provide evidence in that direction (i.e. being wrong about things). This much is a subset of the solution I offer.
The next part is a bit trickier because of the question of what “cultivate enjoying being wrong” means, and how exactly you go about making sure you enjoy a fundamentally bad and unpleasant thing (not saying this is impossible, my two little girls are excited to get their flu shots today).
One way to attempt this is to predict “I am the kind of person who enjoys being wrong, because that means I get to learn [which puts me above the monkeys that can’t even do this]”, which is an improvement. If you do that, then you get to learn more things you’re wrong about.… except when you’re wrong about how much you enjoy being wrong—which is certainly going to become a thing, when it matters to you most.
On top of that, the fact that it feels like “giving up” something and that it gets easier when you remember the grading curve suggests more vulnerabilities to motivated thinking, because there’s still a potential truth being avoided (“I’m dumb on the scale that matters”) and because switching to a model which yields strictly better results feels like losing something.
A bit of a pushback, if I may: confirmation bias/motivated reasoning themselves only arise because of an inherent, deep-seated, [fairly likely] genetically conditioned, if not unconscious sense that:
A. there is, in fact, a single source of ground truth even, if not especially, outside of regular, axiomatic, bottom-up, abstract, formalized representation: be it math [+] or politics [-]
B. it is, in fact, both viable and desirable, to affiliate yourself with any one/number of groups, whose culture/perspective/approach/outlook must fully represent the A: instead of an arbitrarily small, blind-sided to everything else, part of the underlying portion it is most familiar with itself
C. any single point/choice/decision/conclusion/action reached must, in itself, be inherently sensible enough to hold for an arbitrarily significant period of time, without any revision or consideration of the opposite/orthogonal perspective; this one, in turn, might itself stem from an assumption that:
D. the world must be either a [1] static entity, fully representable with an arbitrarily large set of beliefs, attitudes, and considerations; or a [2] dynamic, yet inherently mechanical, following the exact same static laws/rules/patterns in each and every aspect of itself: be it physics or society; these last ones can be safely assumed to be never-changing and, once “understood”, always reinterpreted within the exact same light as in the original interpretation of the time
E. whatever the kind of entity it is, any particular snapshot of the linguistic and/or symbolic representation of it is, at every moment, fully capable of describing it, without coming up short within any single aspect of it: an assumption, if you will, that there no “3x+1 Conjectures” the limitations of our cognitive/representative tools in the present would not be able to figure out
Biology-wise, the B might be strong enough to easily overpower, without any of our conscious awareness, the rest of them. Yet even discounting that: the motivated reasoning and the desire to adhere to whatever stance has been reached already themselves stem, fundamentally, from the sheer human arrogance in regarding whatever was [conceived/perceived/assimilated/concluded] as fully sufficient both for what is in the present, as well for what will be yet, going forward.
That arrogance, in turn, anchors our cognition; which promptly short-circuits itself into whatever Weltanschauung our general A-E’sque attitude of the day lines up to, in attempt to save energy on rather costly and, given A to E, completely wasteful brain cycles. MR/CB is merely an effect of it all.
P.S. Two upticks from me, regardless. The links were much appreciated. Would gladly hear any of your additional thoughts on the matter in a fully-sized post/article/whatever you call it here.
Motivated reasoning/confirmation bias.
As Scott Alexander said in his review of Julia Galif’s The Scout Mindset:
He goes on to argue that this bias is the source of polarization in society, which is distorting our beliefs and setting us at each other’s throats. How could someone believe such different things unless they’re either really stupid or lying to conceal their selfishness? I think this is right, and I think it’s at play even in the best rationalist communities like LessWrong. I think it’s particularly powerful in difficult domains, like AGI prediction and alignment theory. When there’s less real evidence, biases play a larger role.
I reached this conclusion independently while studying those and the remaining ~149 biases listed on Wikipedia at that point. You can get a little more rational by making your estimates carefully. That covers most of the biases. But the world is being destroyed by people believing what is comfortable to believe instead of what the evidence suggests. This is usually also what they already believe, so the definition of confirmation bias is highly overlapping with motivated reasoning.
I studied the brain basis of cognitive biases for four years while funded by an IARPA program; I thought it was more worthwhile than the rest of what we were doing in cognitive neuroscience, so I kept up with it as part of my research for the remaining four years I was in the field.
I think motivated reasoning is a better conceptual term for understanding what’s going on, but let’s not quibble about terminology. I’m going to mostly call it motivated reasoning, MR, but you can take almost everything I’m going to say and apply it to confirmation bias- because mostly it’s comfortable to keep believing what we already do. We chose to believe it partly because it was comfortable, and now it fits with all of our other beliefs, so changing it and re-evaluating the rest of our connected beliefs is uncomfortable.
Wait, you’re saying! I’m a rationalist! I don’t just believe what’s comfortable!
Yes, that’s partly true. Believing in seeking truth when it’s hard does provide some resistance to motivated reasoning. A hardcore rationalist actually enjoys changing their mind sometimes. But it doesn’t confer immunity. We still have emotions, and it’s still more comfortable to think that we’re already right because we’re good rationalists who’ve already discerned the truth.
There are two ways confirmation bias works. One is that it’s easier to think of confirming evidence than disconfirming evidence. The associative links tend to be stronger. When you’re thinking of a hypothesis you tend to believe, it’s easy to think of evidence that supports it.
The stronger one is that there’s a miniature Ugh field[1] surrounding thinking about evidence and arguments that would disprove a belief you care about. It only takes a flicker of a thought to make the accurate prediction about where considering that evidence could lead: admitting you were wrong, and doing a bunch of work re-evaluating all of your related beliefs. Then there’s a little unconscious yuck feeling when you try to pay attention to that evidence.
This is just a consequence of how the brain estimates the value of predicted outcomes and uses that to guide its decision-making, including its micro-decisions about what to attend to. I wrote a paper reviewing all of the neuroscience behind this, Neural mechanisms of human decision-making, but it’s honestly kind of crappy based on the pressure to write for a super-specialized audience, and my reluctance at the time to speed up progress on brain-like AGI. So I recommend Steve Byrnes’ valence sequence over that complex mess; it perfectly describes the psychological level, and he’s basing it on those brain mechanisms even though he’s not directly talking about them. And he’s a better writer than I am.
Trapped priors is at least partly overlapping with confirmation bias. Or it could even just be strong priors. The issue is that everyone has seen different evidence and arguments—and we’ve very likely spent more time attending to evidence that supports our original hypothesis, because of the subtle push of motivated reasoning.
Motivated reasoning isn’t even strictly speaking irrational. Suppose there’s some belief that really doesn’t make a difference in your daily life, like that there’s a sky guy with a cozy afterlife, or which of two similar parties should receive your vote (which will almost never actually change any outcomes). Here the two definitions of rationality diverge: believing the truth is now at odds with doing what works. It will obviously work better to believe what your friends and neighbors believe, so you won’t be in arguments with them and they’ll support you more when you need it.
If we had infinite cognitive capacity, we could just believe the truth while claiming to believe whatever works. And we could keep track of all of the evidence instead of picking and choosing which to attend to.
But we don’t. So motivated reasoning, confirmation bias, and the resulting tribalism (which happens when other emotions like irritation and outrage get involved in our selection of evidence and arguments) are powerful factors, even for a devoted rationalist.
The only remedy I know of is to cultivate enjoying being wrong. This involves giving up a good bit of one’s self-concept as a highly intelligent individual. This gets easier if you remember that everyone else is also doing their thinking with a monkey brain that can barely chin itself on rationality.
Thanks for asking this question; it’s a very smart question to ask. And I’ve been meaning to write about this on LW and haven’t prioritized doing a proper job, so it’s nice to have an excuse to do a brief writeup.
See also Defeating Ugh Fields In Practice for some interesting and useful review.
Edit: Staring into the abyss as a core life skill seems to very much be about why and how to overcome motivated reasoning. The author learned to value the idea of being wrong about important beliefs, by seeing a few people accomplish extraordinary things as a result of questioning their central beliefs and changing their minds.
This is a great comment, IMO you should expand it, refine it, and turn it into a top-level post.
Also, question: How would you design a LLM-based AI agent (think: like the recent computer-using Claude but much better, able to operate autonomously for months) so as to be immune from this bias? Can it be done?
Thank you.
Edit: I dashed off a response on short time and missed the important bit of the question. See my response below for the real answer.
What an oddly off-topic but perfect question. As it happens, that’s something I’ve thought about a lot. Here’s the old version: Capabilities and alignment of LLM cognitive architectures
And how to align it: Internal independent review for language model agent alignment
These are both older versions. I was worried about pushing capabilities at the time, but progress has been going that direction anyway, so I’m working on updated versions that are clearer.
I’ve been catching up on your recent work in the past couple of weeks; it seems on-target for my projected path to AGI.
Unimportant: I don’t think it’s off-topic, because it’s secretly a way of asking you to explain your model of why confirmation bias happens more and prove that your brain-inspired model is meaningful by describing a cognitive architecture that doesn’t have that bias (or explaining why such an architecture is not possible). ;)
Thanks for the links! On brief skim they don’t seem to be talking much about cognitive biases. Can you spell it out here how the bureaucracy/LMP of LMA’s you describe could be set up to avoid motivated reasoning?
I apologize. Those links don’t answer the question at all. I dashed off an answer in the middle of a social occasion and completely missed the most relevant piece of your question—quite possibly motivated reasoning led me to assume it was on my favorite topic, whether language model agents will be our first AGIs and how to align them.
Anyway, here’s my answer to your actual question. This isn’t something I’ve thought about before, because it’s not clear it will play a central role in alignment. (I’m curious if you see a more direct link to alignment than I’m seeing)
In sum, they’ll probably have some MR and CB; they can use the same strategies as humans can to correct them, but more reliably if it can be included in scripted prompts as part of the scaffolding that makes the LLMs into cognitive architectures (or real AGI). That’s basically to notice when you’re in a situation that would cause MR or CB, and do some extra cognitive work to counteract it.
Language model AGI will probably have MR, but less than humans:
Language model agents won’t have as much motivated reasoning as humans do, because they’re not probably going to use the same very rough estimated-value-maximization decision-making algorithm. (this is probably good for alignment; they’re not maximizing anything, at least directly. They are almost oracle-based agents).
But they may have a substantial amount of MR, to the extent that language encodes linked beliefs that were created by humans with lots of MR. I wonder if anyone has run tests that could indicate how much MR or CB they have, if that can somehow be disentangled from sycophancy.
And they will probably have some amount of confirmation bias, because language probably encodes the same sort of associative links that make it easier to think of confirming evidence than disconfirming.
How LMAs could correct for MR and CB (imperfectly but better-than-human—at a cost):
Like humans, they could correct for MR and CB by doing some compensatory cognition. A human could correct for MR by just sort of weight their beliefs against what they want to believe. That’s probably a good start for humans, but not nearly enough; see below.
I was initially thinking this part wouldn’t be necessary for an LMA since it probably won’t directly use a reward-predictive decision-making algorithm. But to the extent it’s trained with RLHF/RLAIF, it’s using a policy shaped by reward prediction (as are humans when they don’t explicitly predict consequences). This is an interesting distinction; the model is biased not to believe what it “wants”, but what the process that trained it “wants” in response to that prompt. Gauging that bias in order to compensate for it would be tricky. But I think the right scripted prompting could approximate it—at a risk of overcompensating, since we don’t have a good way to model exactly how much bias you’d have in a given circumstance.
Second and probably more important is compensating the compounding effect of MR changing how much evidence you’ve considered for and against beliefs you “like”. How far off your beliefs are is only an indirect result of how much you want to believe them. The direct cause is (I think) mostly how much you’ve looked at evidence and logic supporting that belief vs. evidence and logic that would disconfirm it. That makes compensating for them a lot harder after-the-fact; you’ve got to go back and consider as much evidence against as you considered for the hypothesis.
That leads us to the same correction for MR you’d do for confirmation bias: forcing yourself to look at evidence and arguments against your favored hypothesis.
Once again, it wouldn’t be easy to figure out just how much you’d need to compensate, so it’s not going to be a bias-free belief system, just less-biased-on-average.
And it would take a bunch of extra computation to go looking for and weighing evidence against all of the beliefs the system is biased toward. So this process would probably only be deployed when an answer is particularly important.
Adjusting for biases as a third function of a scripted internal review:
The scripted process for judging this and then performing that extra cognition would look a lot like the “internal review” I described in that post. That has at least a small tax for just making another call (perhaps to a separate model or instance) to evaluate whether this decision (including adopting a new belief) is important enough to carry out another whole scripted set of calls to evaluate its costs (including ethical costs) and then in the case of compensating for bias, do a bunch more cognition looking at disconfirming evidence/arguments.
This all stacks up to being pretty costly. I’d expect LMAs to have a good bit less MR and CB; they would sort of only have the “echoes” of them captured by standard linguistic patterns. They won’t directly have the feelings (pride, competitiveness, shame) that result in strong MR and thereby CB.
But I’m not sure. Again, I’m curious if anyone sees direct links to alignment. I’m currently worried about correct-but-unexpected changes in an agent’s beliefs and how that changes its functional alignment. Biases might make that worse, but I don’t see it opening up totally new dangers.
no need to apologize, thanks for this answer!
Question: Wouldn’t these imperfect bias-corrections for LMA’s also work similarly well for humans? E.g. humans could have a ‘prompt’ written on their desk that says “Now, make sure you spend 10min thinking about evidence against as well...” There are reasons why this doesn’t work so well in practice for humans (though it does help); might similar reasons apply to LMAs? What’s your argument that the situation will be substantially better for LMAs?
I’m particularly interested in elaboration on this bit:
I think there is an important reason things are different for LMAs than humans: you can program in a check for whether it’s worth correcting for motivated reasoning. Humans have to care enough to develop a habit (including creating reminders they’ll actually mind).
Whether a real AGI LMA would want to remove that scripted anti-bias part of their “artificial conscience” is a fascinating question; I think it could go either way, with them identifying it as a valued part of themselves, or an external check limiting their freedom of thought (same logic applies to internal alignment checks).
This also would substitute for a motivation that humans mostly don’t have. People, particularly non-rationalists, just aren’t trying very hard to arrive at the truth—because taking the effort to do that doesn’t serve their perceived interests.
Most often, humans don’t even want to correct for motivated reasoning. Firmly believing the same things as their friends and family serves them.
In important life decisions, they can benefit by countering MR and CB. I just added Staring into the abyss as a core life skill to the footnote, since it seems to be about exactly that.
Spending an extra ten minutes thinking about the counterevidence is usually a huge waste of time—unless you hugely value reaching correct conclusions on abstract matters that are likely irrelevant to your life success (I expect you do, and I do too—but it’s not hard to see why that’s a minority position).
Finally, there is no common knowledge of how big a problem MR/CB is, or how one might correct them.
I couldn’t find any study where they told people “try to compensate for this bias”, at least as of ~8 years ago when I was actively researching this.
Oracle-based agent is a term I’m toying with to intuitively capture how a language model agent still isn’t directly motivated by RL based on a goal. They are trained to have an accurate world model, and largely to answer questions as they were intended (although not necessarily accurately—sycophancy effects are large). They (in current form) are made agentic by having someone effectively ask “what would an agent do to accomplish this goal, given these tools?” and getting a correct-enough answer (which is then converted to actions by tools).
Sure, there are ways that the goals implicit in RLHF could deeply influence the LMA, giving them alien shoggoth-goals. That could happen if we optimize a lot more—including having a real AGI LMA reflect on its goals and beliefs for a long time.
But currently we’re actually training mostly for instruction-following. If we use that moderately wisely, it seems like we could head off the disasters of strongly optimizing for goals. That’s a brief diversion into the alignment implications of oracle-based (language model) agents; I’m not sure if that’s part of what you’re asking about, but there you go anyway.
So LMAs are currently selecting actions by trying to answer questions as their training encouraged. It seems LMAs are pretty strongly influenced by motivated reasoning based on their RL-based policy—but it isn’t their interests/desires that motivate their reasoning, but that of the RLHF respondents (or the constitution for RLAIF). They are sycophantic instead of motivated by their own predicted rewards as humans are.
That will cause them to be inaccurate but not misaligned, which seems more important.
Did that get at your interest in that passage, or am I once again misinterpreting your question?
This is helping, thanks. I do buy that something like this would help reduce the biases to some significant extent probably.
Will the overall system be trained? Presumably it will be. So, won’t that create a tension/pressure, whereby the explicit structure prompting it to avoid cognitive biases will be hurting performance according to the training signal? (If instead it helped performance, then shouldn’t a version of it evolve naturally in the weights?)
I’m not at all sure the overall system will be trained. Interesting that you seem to expect that with some confidence.
I’d expect the checks for cognitive biases to only call for extra cognition when a correct answer is particularly important to completing the task at hand. As such, it shouldn’t decrease performance much.
But I’m really not sure that training the overall system end-to-end is going to play a role. The success and relatively faithful CoT from r1 and QwQ give me hope that end-to-end training won’t be very useful.
Certainly people will try end-to-end training, but given the high compute cost for long-horizon tasks, I don’t think that’s going to play as large a role as piecewise and therefore fairly goal-agnostic training.
I think humans’ long-horizon performance isn’t mostly based on RL training, but our ability to reason and to learn important principles (some from direct success/failure at LTH tasks, some from vicarious experience or advice). So I expect the type of CoT RL training used in o1 to be used, as well as extensions to general reasoning where there’s not a perfectly check-able correct answer. That allows good System 2 reasoning performance, which I think is the biggerst basis of humans’ ability to perform useful LTH tasks.
Combining that with some form of continuous learning (either better episodic memory than vector databases and/or fine-tuning for facts/skills judged as useful) seems like all we need to get to human level.
Probably there will be some end-to-end performance RL, but that will still be mixed with strong contributions from reasoning about how to achieve a user-defined goal.
Gauging how much goal-directed RL is too much isn’t an ideal situation to be in, but it seems like if there’s not too much, instruction-following alignment will work.
WRT to cognitive biases, end-to-end training would increase some desired biases while decreasing some that are hurting performance (sometimes correct answers are very useful).
MR as humans experience it is only optimial within our very sharp cognitive limitations, and the types of tasks we tend to take on. So optimal MR for agents will be fairly different.
I’m curious about your curiousity; is it just that, or are you seeing a strong connection between biases in LMAs and their alignment?
Huh, isn’t this exactly backwards? Presumably r1 and QwQ got that way due to lots of end-to-end training. They aren’t LMPs/bureaucracies.
...reading onward I don’t think we disagree much about what the architecture will look like though. It sounds like you agree that probably there’ll be some amount of end-to-end training and the question is how much?
My curiosity stems from:
1. Generic curiosity about how minds work. It’s an important and interesting topic and MR is a bias that we’ve observed empirically but don’t have a mechanistic story for why the structure of the mind causes that bias—at least, I don’t have such a story but it seems like you do!
2. Hope that we could build significantly more rational AI agents in the near future, prior to the singularity, which could then e.g. participate in massive liquid virtual prediction markets and improve human collective epistemics greatly.
One problematic aspect is that it’s often easier to avoid motivated reasoning when the stakes are low. Even if you manage to avoid it in 95% of the cases, if th remaining 5% are there what really matters you are still overall screwed.
Good point.
Alignment theory and AGI prediction spring to mind again; there it’s not just our self-concepts at stake, but the literal fate of the world.
This is only true if you can’t figure out how to handle disagreements.
It will often be better to have wrong beliefs if it keeps you from acting on the even wronger belief that you must argue with everyone who disagrees. It’s better yet to believe the truth on both fronts, and simply prioritize getting along when it is more important to get along.
It’s more fundamental than that. The way you pick up a glass of water is by predicting that you will pick up a glass of water, and acting so as to minimize that prediction error. Motivated cognition is how we make things true, and we can’t get rid of it except by ceasing to act on the environment—and therefore ceasing to exist.
Motivated cognition causes no epistemic problem so long as we can realize our predictions. The tricky part comes when we struggle to fit the world to our beliefs. In these cases, there’s an apparent tension between “believing the truth” and “working towards what we want”. This is where all that sports stuff of “you have to believe you can win!” comes from, and the tendency to lose motivation once we realize we’re not going to succeed.
If we try to predict that we will win the contest despite being down 6-0 and clearly less competent, we will either have to engage in the willful delusion of pretending we’re not less competent and/or other things (which makes it harder to navigate reality, because we’re using a false map and can’t act so as to minimize the consequences of our flaws) or else we will just fail to predict success altogether and be unable to even try.
If instead, we don’t predict anything about whether we will win or lose, and instead predict that we will play to the absolute best of our abilities, then we can find out whether we win or lose, and give ourselves room to be pleasantly surprised.
The solution isn’t to “believe the truth” because the truth has not been set yet. The solution is to pay attention to our anticipated prediction errors, and shift to finer grain modeling when the expected error justifies the cost of thinking harder.
If you stop predicting “I am a highly intelligent individual, so I’m not wrong!”, then you get to find out if you’re a highly intelligent individual, as well as all of the things that may provide evidence in that direction (i.e. being wrong about things). This much is a subset of the solution I offer.
The next part is a bit trickier because of the question of what “cultivate enjoying being wrong” means, and how exactly you go about making sure you enjoy a fundamentally bad and unpleasant thing (not saying this is impossible, my two little girls are excited to get their flu shots today).
One way to attempt this is to predict “I am the kind of person who enjoys being wrong, because that means I get to learn [which puts me above the monkeys that can’t even do this]”, which is an improvement. If you do that, then you get to learn more things you’re wrong about.… except when you’re wrong about how much you enjoy being wrong—which is certainly going to become a thing, when it matters to you most.
On top of that, the fact that it feels like “giving up” something and that it gets easier when you remember the grading curve suggests more vulnerabilities to motivated thinking, because there’s still a potential truth being avoided (“I’m dumb on the scale that matters”) and because switching to a model which yields strictly better results feels like losing something.
A bit of a pushback, if I may: confirmation bias/motivated reasoning themselves only arise because of an inherent, deep-seated, [fairly likely] genetically conditioned, if not unconscious sense that:
A. there is, in fact, a single source of ground truth even, if not especially, outside of regular, axiomatic, bottom-up, abstract, formalized representation: be it math [+] or politics [-]
B. it is, in fact, both viable and desirable, to affiliate yourself with any one/number of groups, whose culture/perspective/approach/outlook must fully represent the A: instead of an arbitrarily small, blind-sided to everything else, part of the underlying portion it is most familiar with itself
C. any single point/choice/decision/conclusion/action reached must, in itself, be inherently sensible enough to hold for an arbitrarily significant period of time, without any revision or consideration of the opposite/orthogonal perspective; this one, in turn, might itself stem from an assumption that:
D. the world must be either a [1] static entity, fully representable with an arbitrarily large set of beliefs, attitudes, and considerations; or a [2] dynamic, yet inherently mechanical, following the exact same static laws/rules/patterns in each and every aspect of itself: be it physics or society; these last ones can be safely assumed to be never-changing and, once “understood”, always reinterpreted within the exact same light as in the original interpretation of the time
E. whatever the kind of entity it is, any particular snapshot of the linguistic and/or symbolic representation of it is, at every moment, fully capable of describing it, without coming up short within any single aspect of it: an assumption, if you will, that there no “3x+1 Conjectures” the limitations of our cognitive/representative tools in the present would not be able to figure out
Biology-wise, the B might be strong enough to easily overpower, without any of our conscious awareness, the rest of them. Yet even discounting that: the motivated reasoning and the desire to adhere to whatever stance has been reached already themselves stem, fundamentally, from the sheer human arrogance in regarding whatever was [conceived/perceived/assimilated/concluded] as fully sufficient both for what is in the present, as well for what will be yet, going forward.
That arrogance, in turn, anchors our cognition; which promptly short-circuits itself into whatever Weltanschauung our general A-E’sque attitude of the day lines up to, in attempt to save energy on rather costly and, given A to E, completely wasteful brain cycles. MR/CB is merely an effect of it all.
P.S. Two upticks from me, regardless. The links were much appreciated. Would gladly hear any of your additional thoughts on the matter in a fully-sized post/article/whatever you call it here.