Daniel Kokotajlo comments on Which Biases are most important to Overcome?

Daniel Kokotajlo 4 Dec 2024 18:05 UTC
5 points
0
But I’m really not sure that training the overall system end-to-end is going to play a role. The success and relatively faithful CoT from r1 and QwQ give me hope that end-to-end training won’t be very useful.
Huh, isn’t this exactly backwards? Presumably r1 and QwQ got that way due to lots of end-to-end training. They aren’t LMPs/bureaucracies.

...reading onward I don’t think we disagree much about what the architecture will look like though. It sounds like you agree that probably there’ll be some amount of end-to-end training and the question is how much?

My curiosity stems from:
1. Generic curiosity about how minds work. It’s an important and interesting topic and MR is a bias that we’ve observed empirically but don’t have a mechanistic story for why the structure of the mind causes that bias—at least, I don’t have such a story but it seems like you do!
2. Hope that we could build significantly more rational AI agents in the near future, prior to the singularity, which could then e.g. participate in massive liquid virtual prediction markets and improve human collective epistemics greatly.
- Seth Herd 5 Dec 2024 19:53 UTC
  2 points
  0
  Parent
  Ah- now now I see how we’re using end-to-end training a little differently.
  
  I’m thinking of end-to-end training as training a CoT, as in o1 etc, AND training an outer-loop control scheme for organizing complex plans and tasks. That outer-loop training is what I’m not sure will be helpful. You can script roughly how humans solve complex tasks, I think in a way that will help and not get in the way of the considerable power of the trained CoT. So I still expect this to play a role; I just don’t know how big a role it will play.
  
  I agree that training CoT will be done, and has important implications for alignment, particularly in making CoTs opaque or unfaithful.
  
  Reason 1: makes sense. One reason I never got around to publishing my work in comprehensible form is that it’s just adding mechanistic detail to an earlier essentially-correct theory that it’s just the result of applying self-interest (or equivalently, RL) to decisions about beliefs, in the context of our cognitive limitations. I’ll try to dig up the reference when I get a little time.
  
  Reason 2: Interesting. I do hope we can align early agents well enough to get nontrivial help from them in solving the remaining alignment problems, both for AGI agents and for the power structures that are deploying and controlling them. Biases in their thinking have been less of a concern for me than their capabilities, but they will play a role. Sycophancy in particular seems like the language model equivalent of human motivated reasoning, and it could really wreck agents’ usefulness in making their user’s/controllers actions more sane.