The strongest argument for reward-maximization which I’m aware of is: Human brains do RL and often care about some kind of tight reward-correlate, to some degree. Humans are like deep learning systems in some ways, and so that’s evidence that “learning setups which work in reality” can come to care about their own training signals.
Isn’t there a similar argument for “plausible that we just keep scaling up networks, run pretraining + light RLHF, and then produce a schemer”? Namely the way we train our kids seems pretty similar to “pretraining + light RLHF” and we often do end up with scheming/deceptive kids. (I’m speaking partly from experience.) ETA: On second thought, maybe it’s not that similar? In any case, I’d be interested in an explanation of what the differences are and why one type of training produces schemers and the other is very unlikely to.
Also, in this post you argue against several arguments for high risk of scheming/deception from this kind of training but I can’t find where you talk about why you think the risk is so low (“not plausible”). You just say ‘Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.’ but why is your prior for it so low? I would be interested in whatever reasons/explanations you can share. The same goes for others who have indicated agreement with Alex’s assessment of this particular risk being low.
I strongly disagree with the words “we train our kids”. I think kids learn via within-lifetime RL, where the reward function is installed by evolution inside the kid’s own brain. Parents and friends are characters in the kid’s training environment, but that’s very different from the way that “we train” a neural network, and very different from RLHF.
What does “Parents and friends are characters in the kid’s training environment” mean? Here’s an example. In principle, I could hire a bunch of human Go players on MTurk (for reward-shaping purposes we’ll include some MTurkers who have never played before, all the way to experts), and make a variant of AlphaZero that has no self-play at all, it’s 100% trained on play-against-humans, but is otherwise the same as the traditional AlphaZero. Then we can say “The MTurkers are part of the AlphaZero training environment”, but it would be very misleading to say “the MTurkers trained the AlphaZero model”. The MTurkers are certainly affecting the model, but the model is not imitating the MTurkers, nor is it doing what the MTurkers want, nor is it listening to the MTurkers’ advice. Instead the model is learning to exploit weaknesses in the MTurkers’ play, including via weird out-of-the-box strategies that would have never occurred to the MTurkers themselves.
When you think “parents and friends are characters in the kid’s training environment”, I claim that this MTurk-AlphaGo mental image should be in your head just as much as the mental image of LLM-like self-supervised pretraining.
Yeah, this makes sense, thanks. I think I’ve read one or maybe both of your posts, which is probably why I started having second thoughts about my comment soon after posting it. :)
The MTurkers are certainly affecting the model, but the model is not imitating the MTurkers, nor is it doing what the MTurkers want, nor is it listening to the MTurkers’ advice. Instead the model is learning to exploit weaknesses in the MTurkers’ play, including via weird out-of-the-box strategies that would have never occurred to the MTurkers themselves.
In RLHF, if you want the AI to do X, then you look at the two options and give a give thumbs-up to the one where it’s doing more X rather than less X. Very straightforward!
By contrast, if the MTurkers want AlphaZero-MTurk to do X, then they have their work cut out. Their basic strategy would have to be: Wait for AlphaZero-MTurk to do X, and then immediately throw the game (= start deliberately making really bad moves). But there are a bunch of reasons that might not work well, or at all: (1) if AlphaZero-MTurk is already in a position where it can definitely win, then the MTurkers lose their ability to throw the game (i.e., if they start making deliberately bad moves, then AlphaZero-MTurk would have its win probability change from ≈100% to ≈100%), (2) there’s a reward-shaping challenge (i.e., if AlphaZero-MTurk does something close to X but not quite X, should you throw the game or not? I guess you could start playing slightly worse, in proportion to how close the AI is to doing X, but it’s probably really hard to exercise such fine-grained control over your move quality), (3) If X is a time-extended thing as opposed to a single move (e.g. “X = playing in a conservative style” or whatever), then what are you supposed to do? (4) Maybe other things too.
In any case, I’d be interested in an explanation of what the differences are and why one type of training produces schemers and the other is very unlikely to.
Thinking over this question myself, I think I’ve found a reasonable answer. Still interested in your thoughts but I’ll write down mine:
It seems like evolution “wanted” us to be (in part) reward-correlate maximizers (i.e., being a reward-correlate maximizer was adaptive in our ancestral environment), and “implemented” this by having our brains internally do “heavy RL” throughout our life. So we become reward-correlate maximizing agents early in life, and then when our parents do something like RLHF on top of that, we become schemers pretty easily.
So the important difference is that with “pretraining + light RLHF” there’s no “heavy RL” step.
See footnote 5 for a nearby argument which I think is valid:
The strongest argument for reward-maximization which I’m aware of is: Human brains do RL and often care about some kind of tight reward-correlate, to some degree. Humans are like deep learning systems in some ways, and so that’s evidence that “learning setups which work in reality” can come to care about their own training signals.
Isn’t there a similar argument for “plausible that we just keep scaling up networks, run pretraining + light RLHF, and then produce a schemer”? Namely the way we train our kids seems pretty similar to “pretraining + light RLHF” and we often do end up with scheming/deceptive kids. (I’m speaking partly from experience.) ETA: On second thought, maybe it’s not that similar? In any case, I’d be interested in an explanation of what the differences are and why one type of training produces schemers and the other is very unlikely to.
Also, in this post you argue against several arguments for high risk of scheming/deception from this kind of training but I can’t find where you talk about why you think the risk is so low (“not plausible”). You just say ‘Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.’ but why is your prior for it so low? I would be interested in whatever reasons/explanations you can share. The same goes for others who have indicated agreement with Alex’s assessment of this particular risk being low.
I strongly disagree with the words “we train our kids”. I think kids learn via within-lifetime RL, where the reward function is installed by evolution inside the kid’s own brain. Parents and friends are characters in the kid’s training environment, but that’s very different from the way that “we train” a neural network, and very different from RLHF.
What does “Parents and friends are characters in the kid’s training environment” mean? Here’s an example. In principle, I could hire a bunch of human Go players on MTurk (for reward-shaping purposes we’ll include some MTurkers who have never played before, all the way to experts), and make a variant of AlphaZero that has no self-play at all, it’s 100% trained on play-against-humans, but is otherwise the same as the traditional AlphaZero. Then we can say “The MTurkers are part of the AlphaZero training environment”, but it would be very misleading to say “the MTurkers trained the AlphaZero model”. The MTurkers are certainly affecting the model, but the model is not imitating the MTurkers, nor is it doing what the MTurkers want, nor is it listening to the MTurkers’ advice. Instead the model is learning to exploit weaknesses in the MTurkers’ play, including via weird out-of-the-box strategies that would have never occurred to the MTurkers themselves.
When you think “parents and friends are characters in the kid’s training environment”, I claim that this MTurk-AlphaGo mental image should be in your head just as much as the mental image of LLM-like self-supervised pretraining.
For more related discussion see my posts “Thoughts on “AI is easy to control” by Pope & Belrose” sections 3 & 4, and Heritability, Behaviorism, and Within-Lifetime RL.
Yeah, this makes sense, thanks. I think I’ve read one or maybe both of your posts, which is probably why I started having second thoughts about my comment soon after posting it. :)
How is this very different from RLHF?
In RLHF, if you want the AI to do X, then you look at the two options and give a give thumbs-up to the one where it’s doing more X rather than less X. Very straightforward!
By contrast, if the MTurkers want AlphaZero-MTurk to do X, then they have their work cut out. Their basic strategy would have to be: Wait for AlphaZero-MTurk to do X, and then immediately throw the game (= start deliberately making really bad moves). But there are a bunch of reasons that might not work well, or at all: (1) if AlphaZero-MTurk is already in a position where it can definitely win, then the MTurkers lose their ability to throw the game (i.e., if they start making deliberately bad moves, then AlphaZero-MTurk would have its win probability change from ≈100% to ≈100%), (2) there’s a reward-shaping challenge (i.e., if AlphaZero-MTurk does something close to X but not quite X, should you throw the game or not? I guess you could start playing slightly worse, in proportion to how close the AI is to doing X, but it’s probably really hard to exercise such fine-grained control over your move quality), (3) If X is a time-extended thing as opposed to a single move (e.g. “X = playing in a conservative style” or whatever), then what are you supposed to do? (4) Maybe other things too.
Thinking over this question myself, I think I’ve found a reasonable answer. Still interested in your thoughts but I’ll write down mine:
It seems like evolution “wanted” us to be (in part) reward-correlate maximizers (i.e., being a reward-correlate maximizer was adaptive in our ancestral environment), and “implemented” this by having our brains internally do “heavy RL” throughout our life. So we become reward-correlate maximizing agents early in life, and then when our parents do something like RLHF on top of that, we become schemers pretty easily.
So the important difference is that with “pretraining + light RLHF” there’s no “heavy RL” step.
See footnote 5 for a nearby argument which I think is valid: