This is helping, thanks. I do buy that something like this would help reduce the biases to some significant extent probably.
Will the overall system be trained? Presumably it will be. So, won’t that create a tension/pressure, whereby the explicit structure prompting it to avoid cognitive biases will be hurting performance according to the training signal? (If instead it helped performance, then shouldn’t a version of it evolve naturally in the weights?)
I’m not at all sure the overall system will be trained. Interesting that you seem to expect that with some confidence.
I’d expect the checks for cognitive biases to only call for extra cognition when a correct answer is particularly important to completing the task at hand. As such, it shouldn’t decrease performance much.
But I’m really not sure that training the overall system end-to-end is going to play a role. The success and relatively faithful CoT from r1 and QwQ give me hope that end-to-end training won’t be very useful.
Certainly people will try end-to-end training, but given the high compute cost for long-horizon tasks, I don’t think that’s going to play as large a role as piecewise and therefore fairly goal-agnostic training.
I think humans’ long-horizon performance isn’t mostly based on RL training, but our ability to reason and to learn important principles (some from direct success/failure at LTH tasks, some from vicarious experience or advice). So I expect the type of CoT RL training used in o1 to be used, as well as extensions to general reasoning where there’s not a perfectly check-able correct answer. That allows good System 2 reasoning performance, which I think is the biggerst basis of humans’ ability to perform useful LTH tasks.
Combining that with some form of continuous learning (either better episodic memory than vector databases and/or fine-tuning for facts/skills judged as useful) seems like all we need to get to human level.
Probably there will be some end-to-end performance RL, but that will still be mixed with strong contributions from reasoning about how to achieve a user-defined goal.
Gauging how much goal-directed RL is too much isn’t an ideal situation to be in, but it seems like if there’s not too much, instruction-following alignment will work.
WRT to cognitive biases, end-to-end training would increase some desired biases while decreasing some that are hurting performance (sometimes correct answers are very useful).
MR as humans experience it is only optimial within our very sharp cognitive limitations, and the types of tasks we tend to take on. So optimal MR for agents will be fairly different.
I’m curious about your curiousity; is it just that, or are you seeing a strong connection between biases in LMAs and their alignment?
But I’m really not sure that training the overall system end-to-end is going to play a role. The success and relatively faithful CoT from r1 and QwQ give me hope that end-to-end training won’t be very useful.
Huh, isn’t this exactly backwards? Presumably r1 and QwQ got that way due to lots of end-to-end training. They aren’t LMPs/bureaucracies.
...reading onward I don’t think we disagree much about what the architecture will look like though. It sounds like you agree that probably there’ll be some amount of end-to-end training and the question is how much?
My curiosity stems from: 1. Generic curiosity about how minds work. It’s an important and interesting topic and MR is a bias that we’ve observed empirically but don’t have a mechanistic story for why the structure of the mind causes that bias—at least, I don’t have such a story but it seems like you do! 2. Hope that we could build significantly more rational AI agents in the near future, prior to the singularity, which could then e.g. participate in massive liquid virtual prediction markets and improve human collective epistemics greatly.
This is helping, thanks. I do buy that something like this would help reduce the biases to some significant extent probably.
Will the overall system be trained? Presumably it will be. So, won’t that create a tension/pressure, whereby the explicit structure prompting it to avoid cognitive biases will be hurting performance according to the training signal? (If instead it helped performance, then shouldn’t a version of it evolve naturally in the weights?)
I’m not at all sure the overall system will be trained. Interesting that you seem to expect that with some confidence.
I’d expect the checks for cognitive biases to only call for extra cognition when a correct answer is particularly important to completing the task at hand. As such, it shouldn’t decrease performance much.
But I’m really not sure that training the overall system end-to-end is going to play a role. The success and relatively faithful CoT from r1 and QwQ give me hope that end-to-end training won’t be very useful.
Certainly people will try end-to-end training, but given the high compute cost for long-horizon tasks, I don’t think that’s going to play as large a role as piecewise and therefore fairly goal-agnostic training.
I think humans’ long-horizon performance isn’t mostly based on RL training, but our ability to reason and to learn important principles (some from direct success/failure at LTH tasks, some from vicarious experience or advice). So I expect the type of CoT RL training used in o1 to be used, as well as extensions to general reasoning where there’s not a perfectly check-able correct answer. That allows good System 2 reasoning performance, which I think is the biggerst basis of humans’ ability to perform useful LTH tasks.
Combining that with some form of continuous learning (either better episodic memory than vector databases and/or fine-tuning for facts/skills judged as useful) seems like all we need to get to human level.
Probably there will be some end-to-end performance RL, but that will still be mixed with strong contributions from reasoning about how to achieve a user-defined goal.
Gauging how much goal-directed RL is too much isn’t an ideal situation to be in, but it seems like if there’s not too much, instruction-following alignment will work.
WRT to cognitive biases, end-to-end training would increase some desired biases while decreasing some that are hurting performance (sometimes correct answers are very useful).
MR as humans experience it is only optimial within our very sharp cognitive limitations, and the types of tasks we tend to take on. So optimal MR for agents will be fairly different.
I’m curious about your curiousity; is it just that, or are you seeing a strong connection between biases in LMAs and their alignment?
Huh, isn’t this exactly backwards? Presumably r1 and QwQ got that way due to lots of end-to-end training. They aren’t LMPs/bureaucracies.
...reading onward I don’t think we disagree much about what the architecture will look like though. It sounds like you agree that probably there’ll be some amount of end-to-end training and the question is how much?
My curiosity stems from:
1. Generic curiosity about how minds work. It’s an important and interesting topic and MR is a bias that we’ve observed empirically but don’t have a mechanistic story for why the structure of the mind causes that bias—at least, I don’t have such a story but it seems like you do!
2. Hope that we could build significantly more rational AI agents in the near future, prior to the singularity, which could then e.g. participate in massive liquid virtual prediction markets and improve human collective epistemics greatly.