This is a good point that I think people often forget (particularly in AI Safety) but I think it’s also misleading in its own way.
It’s true that models don’t have this direct reward where that’s all they care about, and that instead their behavior (incl. preferences and goals) is ‘selected for’ (via SGD, not evolution, but still) during training. But a key point which this post doesn’t really focus on is this line “Consider a hypothetical model that chooses actions by optimizing towards some internal goal which is highly correlated with the reward”.
Basically, any model that is being trained against some goal can (and arguably will as size and complexity increases) learn a feature that is a close representation of that ‘invisible’ goal or loss function, and will learn to try to maximize / minimize that goal or loss function. So now you have a model that istrying to maximize that feature that is extremely correlated with reward (or is exactly the reward), even if the model never actually “experiences” or even sees reward itself.
This can happen with all neural networks (the more complex and powerful, the more likely), but specifically with reinforcement learning I think it’s particularly likely because the models are built to try and estimate reward directly over multiple time-steps based on potential actions (via value-functions). So we’re like, very specifically trying to get it to recognize a discrete expected reward score, and strategize and choose options that will maximize that expected score. So you can expect this kind of behavior (where the behavior is that the reward itself or something very similar to it is a ‘known’ mesa-optimization goal within the model’s weights) to “evolve” much more quickly (i.e. with less complex models).
→ It’s true that like, a reinforcement-learning model might still generalize poorly because it’s value-function was trained with training data that had a reward always come from the left side of the screen instead of the right side or whatever, but an advanced RL algorithm playing a video game with a score should be able to learn that it’s also key to focus on the actual score in the corner of the screen (this generalizes better). And similarly, an advanced RL algorithm that is trained to get positive feedback from humans will learn patterns that help you do the thing humans wanted, but an even more generalizable approach is to learn patterns that will make humans think you did the thing they wanted (works in more scenarios, and with potentially higher reward).
Sure, we can (and should) invent training regimes that will make the model really not want to be caught in a lie, and be risk-averse, and therefore much less likely to act deceptively, etc. But it’s I think important to remember that as such a model’s intelligence increases, and as its ability to generalize to new circumstances increases, it’s:
1) slightly more likely to have these ‘mesa-optimization’ goals that are very close to the training rewards, and 2) much more likely be able to come up with strategies that perhaps wouldn’t have worked on many things during training, but it thinks is likely to work in some new production scenario to achieve a mesa-optimization goal (e.g. deception or power-seeking behaviors to achieve positive human feedback).
From this lens, models trying to ‘get more reward’ is perhaps not ideally worded, but I think also a fairly valid shorthand for how we expect large models to behave.
This is a good point that I think people often forget (particularly in AI Safety) but I think it’s also misleading in its own way.
It’s true that models don’t have this direct reward where that’s all they care about, and that instead their behavior (incl. preferences and goals) is ‘selected for’ (via SGD, not evolution, but still) during training. But a key point which this post doesn’t really focus on is this line “Consider a hypothetical model that chooses actions by optimizing towards some internal goal which is highly correlated with the reward”.
Basically, any model that is being trained against some goal can (and arguably will as size and complexity increases) learn a feature that is a close representation of that ‘invisible’ goal or loss function, and will learn to try to maximize / minimize that goal or loss function. So now you have a model that is trying to maximize that feature that is extremely correlated with reward (or is exactly the reward), even if the model never actually “experiences” or even sees reward itself.
This can happen with all neural networks (the more complex and powerful, the more likely), but specifically with reinforcement learning I think it’s particularly likely because the models are built to try and estimate reward directly over multiple time-steps based on potential actions (via value-functions). So we’re like, very specifically trying to get it to recognize a discrete expected reward score, and strategize and choose options that will maximize that expected score. So you can expect this kind of behavior (where the behavior is that the reward itself or something very similar to it is a ‘known’ mesa-optimization goal within the model’s weights) to “evolve” much more quickly (i.e. with less complex models).
→ It’s true that like, a reinforcement-learning model might still generalize poorly because it’s value-function was trained with training data that had a reward always come from the left side of the screen instead of the right side or whatever, but an advanced RL algorithm playing a video game with a score should be able to learn that it’s also key to focus on the actual score in the corner of the screen (this generalizes better). And similarly, an advanced RL algorithm that is trained to get positive feedback from humans will learn patterns that help you do the thing humans wanted, but an even more generalizable approach is to learn patterns that will make humans think you did the thing they wanted (works in more scenarios, and with potentially higher reward).
Sure, we can (and should) invent training regimes that will make the model really not want to be caught in a lie, and be risk-averse, and therefore much less likely to act deceptively, etc. But it’s I think important to remember that as such a model’s intelligence increases, and as its ability to generalize to new circumstances increases, it’s:
1) slightly more likely to have these ‘mesa-optimization’ goals that are very close to the training rewards, and
2) much more likely be able to come up with strategies that perhaps wouldn’t have worked on many things during training, but it thinks is likely to work in some new production scenario to achieve a mesa-optimization goal (e.g. deception or power-seeking behaviors to achieve positive human feedback).
From this lens, models trying to ‘get more reward’ is perhaps not ideally worded, but I think also a fairly valid shorthand for how we expect large models to behave.