Daniel Kokotajlo comments on Which Biases are most important to Overcome?

Daniel Kokotajlo 3 Dec 2024 22:38 UTC
6 points
0
no need to apologize, thanks for this answer!

Question: Wouldn’t these imperfect bias-corrections for LMA’s also work similarly well for humans? E.g. humans could have a ‘prompt’ written on their desk that says “Now, make sure you spend 10min thinking about evidence against as well...” There are reasons why this doesn’t work so well in practice for humans (though it does help); might similar reasons apply to LMAs? What’s your argument that the situation will be substantially better for LMAs?

I’m particularly interested in elaboration on this bit:
Language model agents won’t have as much motivated reasoning as humans do, because they’re not probably going to use the same very rough estimated-value-maximization decision-making algorithm. (this is probably good for alignment; they’re not maximizing anything, at least directly. They are almost oracle-based agents).
- Seth Herd 4 Dec 2024 1:17 UTC
  4 points
  0
  Parent
  I think there is an important reason things are different for LMAs than humans: you can program in a check for whether it’s worth correcting for motivated reasoning. Humans have to care enough to develop a habit (including creating reminders they’ll actually mind).
  Whether a real AGI LMA would want to remove that scripted anti-bias part of their “artificial conscience” is a fascinating question; I think it could go either way, with them identifying it as a valued part of themselves, or an external check limiting their freedom of thought (same logic applies to internal alignment checks).
  This also would substitute for a motivation that humans mostly don’t have. People, particularly non-rationalists, just aren’t trying very hard to arrive at the truth—because taking the effort to do that doesn’t serve their perceived interests.
  Most often, humans don’t even want to correct for motivated reasoning. Firmly believing the same things as their friends and family serves them.
  In important life decisions, they can benefit by countering MR and CB. I just added Staring into the abyss as a core life skill to the footnote, since it seems to be about exactly that.
  Spending an extra ten minutes thinking about the counterevidence is usually a huge waste of time—unless you hugely value reaching correct conclusions on abstract matters that are likely irrelevant to your life success (I expect you do, and I do too—but it’s not hard to see why that’s a minority position).
  Finally, there is no common knowledge of how big a problem MR/CB is, or how one might correct them.
  I couldn’t find any study where they told people “try to compensate for this bias”, at least as of ~8 years ago when I was actively researching this.
  Oracle-based agent is a term I’m toying with to intuitively capture how a language model agent still isn’t directly motivated by RL based on a goal. They are trained to have an accurate world model, and largely to answer questions as they were intended (although not necessarily accurately—sycophancy effects are large). They (in current form) are made agentic by having someone effectively ask “what would an agent do to accomplish this goal, given these tools?” and getting a correct-enough answer (which is then converted to actions by tools).
  Sure, there are ways that the goals implicit in RLHF could deeply influence the LMA, giving them alien shoggoth-goals. That could happen if we optimize a lot more—including having a real AGI LMA reflect on its goals and beliefs for a long time.
  But currently we’re actually training mostly for instruction-following. If we use that moderately wisely, it seems like we could head off the disasters of strongly optimizing for goals. That’s a brief diversion into the alignment implications of oracle-based (language model) agents; I’m not sure if that’s part of what you’re asking about, but there you go anyway.
  So LMAs are currently selecting actions by trying to answer questions as their training encouraged. It seems LMAs are pretty strongly influenced by motivated reasoning based on their RL-based policy—but it isn’t their interests/desires that motivate their reasoning, but that of the RLHF respondents (or the constitution for RLAIF). They are sycophantic instead of motivated by their own predicted rewards as humans are.
  That will cause them to be inaccurate but not misaligned, which seems more important.
  Did that get at your interest in that passage, or am I once again misinterpreting your question?
  - Daniel Kokotajlo 4 Dec 2024 1:28 UTC
    4 points
    0
    Parent
    This is helping, thanks. I do buy that something like this would help reduce the biases to some significant extent probably.
    
    Will the overall system be trained? Presumably it will be. So, won’t that create a tension/pressure, whereby the explicit structure prompting it to avoid cognitive biases will be hurting performance according to the training signal? (If instead it helped performance, then shouldn’t a version of it evolve naturally in the weights?)
    - Seth Herd 4 Dec 2024 1:57 UTC
      2 points
      0
      Parent
      I’m not at all sure the overall system will be trained. Interesting that you seem to expect that with some confidence.
      
      I’d expect the checks for cognitive biases to only call for extra cognition when a correct answer is particularly important to completing the task at hand. As such, it shouldn’t decrease performance much.
      
      But I’m really not sure that training the overall system end-to-end is going to play a role. The success and relatively faithful CoT from r1 and QwQ give me hope that end-to-end training won’t be very useful.
      
      Certainly people will try end-to-end training, but given the high compute cost for long-horizon tasks, I don’t think that’s going to play as large a role as piecewise and therefore fairly goal-agnostic training.
      
      I think humans’ long-horizon performance isn’t mostly based on RL training, but our ability to reason and to learn important principles (some from direct success/failure at LTH tasks, some from vicarious experience or advice). So I expect the type of CoT RL training used in o1 to be used, as well as extensions to general reasoning where there’s not a perfectly check-able correct answer. That allows good System 2 reasoning performance, which I think is the biggerst basis of humans’ ability to perform useful LTH tasks.
      
      Combining that with some form of continuous learning (either better episodic memory than vector databases and/or fine-tuning for facts/skills judged as useful) seems like all we need to get to human level.
      
      Probably there will be some end-to-end performance RL, but that will still be mixed with strong contributions from reasoning about how to achieve a user-defined goal.
      
      Gauging how much goal-directed RL is too much isn’t an ideal situation to be in, but it seems like if there’s not too much, instruction-following alignment will work.
      
      WRT to cognitive biases, end-to-end training would increase some desired biases while decreasing some that are hurting performance (sometimes correct answers are very useful).
      
      MR as humans experience it is only optimial within our very sharp cognitive limitations, and the types of tasks we tend to take on. So optimal MR for agents will be fairly different.
      
      I’m curious about your curiousity; is it just that, or are you seeing a strong connection between biases in LMAs and their alignment?
      - Daniel Kokotajlo 4 Dec 2024 18:05 UTC
        5 points
        0
        Parent
        But I’m really not sure that training the overall system end-to-end is going to play a role. The success and relatively faithful CoT from r1 and QwQ give me hope that end-to-end training won’t be very useful.
        Huh, isn’t this exactly backwards? Presumably r1 and QwQ got that way due to lots of end-to-end training. They aren’t LMPs/bureaucracies.
        
        ...reading onward I don’t think we disagree much about what the architecture will look like though. It sounds like you agree that probably there’ll be some amount of end-to-end training and the question is how much?
        
        My curiosity stems from:
        1. Generic curiosity about how minds work. It’s an important and interesting topic and MR is a bias that we’ve observed empirically but don’t have a mechanistic story for why the structure of the mind causes that bias—at least, I don’t have such a story but it seems like you do!
        2. Hope that we could build significantly more rational AI agents in the near future, prior to the singularity, which could then e.g. participate in massive liquid virtual prediction markets and improve human collective epistemics greatly.
        Seth Herd 5 Dec 2024 19:53 UTC
        2 points
        0
        Parent
        Ah- now now I see how we’re using end-to-end training a little differently.
        
        I’m thinking of end-to-end training as training a CoT, as in o1 etc, AND training an outer-loop control scheme for organizing complex plans and tasks. That outer-loop training is what I’m not sure will be helpful. You can script roughly how humans solve complex tasks, I think in a way that will help and not get in the way of the considerable power of the trained CoT. So I still expect this to play a role; I just don’t know how big a role it will play.
        
        I agree that training CoT will be done, and has important implications for alignment, particularly in making CoTs opaque or unfaithful.
        
        Reason 1: makes sense. One reason I never got around to publishing my work in comprehensible form is that it’s just adding mechanistic detail to an earlier essentially-correct theory that it’s just the result of applying self-interest (or equivalently, RL) to decisions about beliefs, in the context of our cognitive limitations. I’ll try to dig up the reference when I get a little time.
        
        Reason 2: Interesting. I do hope we can align early agents well enough to get nontrivial help from them in solving the remaining alignment problems, both for AGI agents and for the power structures that are deploying and controlling them. Biases in their thinking have been less of a concern for me than their capabilities, but they will play a role. Sycophancy in particular seems like the language model equivalent of human motivated reasoning, and it could really wreck agents’ usefulness in making their user’s/controllers actions more sane.