By and large, we expect trained models to do (1) things that are directly incentivized by the training signal (intentionally or not), and (2) things that are indirectly incentivized by the training signal (they’re instrumentally useful, or they’re a side-effect, or they “come along for the ride” for some other reason), (3) things that are so simple to do that they can happen randomly.
We can also get a model that has an objective that is different from the intended formal objective (never mind whether the latter is aligned with us). For example, SGD may create a model with a different objective that is identical to the intended objective just during training (or some part thereof). Why would this be unlikely? The intended objective is not privileged over such other objectives, from the perspective the training process.
Evan gave an example related to this, where the intention was to train a myopic RL agent that goes through blue doors in the current epoch episode, but the result is an agent with a more general objective that cares about blue doors in future epochs episodes as well. In Evan’s words (from the Future of Life podcast):
You can imagine a situation where every situation where the model has seen a blue door, it’s been like, “Oh, going through this blue is really good,” and it’s learned an objective that incentivizes going through blue doors. If it then later realizes that there are more blue doors than it thought because there are other blue doors in other episodes, I think you should generally expect it’s going to care about those blue doors as well.
Similar concerns are relevant for (self-)supervised models, in the limit of capability. If a network can model our world very well, the objective that SGD yields may correspond to caring about the actual physical RAM of the computer on which the inference runs (specifically, the memory location that stores the loss of the inference). Also, if any part of the network, at any point during training, corresponds to dangerous logic that cares about our world, the outcome can be catastrophic (and the probability of this seems to increase with the scale of the network and training compute).
Like, if we do gradient descent, and the training signal is “get a high score in PacMan”, then “mesa-optimize for a high score in PacMan” is incentivized by the training signal, and “mesa-optimize for making paperclips, and therefore try to get a high score in PacMan as an instrumental strategy towards the eventual end of making paperclips” is also incentivized by the training signal.
For example, if at some point in training, the model is OK-but-not-great at figuring out how to execute a deceptive strategy, gradient descent will make it better and better at figuring out how to execute a deceptive strategy.
Here’s a nice example. Let’s say we do RL, and our model is initialized with random weights. The training signal is “get a high score in PacMan”. We start training, and after a while, we look at the partially-trained model with interpretability tools, and we see that it’s fabulously effective at calculating digits of π—it calculates them by the billions—and it’s doing nothing else, it has no knowledge whatsoever of PacMan, it has no self-awareness about the training situation that it’s in, it has no proclivities to gradient-hack or deceive, and it never did anything like that anytime during training. It literally just calculates digits of π. I would sure be awfully surprised to see that! Wouldn’t you? If so, then you agree with me that “reasoning about training incentives” is a valid type of reasoning about what to expect from trained ML models. I don’t think it’s a controversial opinion...
Again, I did not (and don’t) claim that this type of reasoning should lead people to believe that mesa-optimizers won’t happen, because there do tend to be training incentives for mesa-optimization.
I would sure be awfully surprised to see that! Wouldn’t you?
My surprise would stem from observing that RL in a trivial environment yielded a system that is capable of calculating/reasoning-about π. If you replace the PacMan environment with a complex environment and sufficiently scale up the architecture and training compute, I wouldn’t be surprised to learn the system is doing very impressive computations that have nothing to do with the intended objective.
Note that the examples in my comment don’t rely on deceptive alignment. To “convert” your PacMan RL agent example to the sort of examples I was talking about: suppose that the objective the agent ends up with is “make the relevant memory location in the RAM say that I won the game”, or “win the game in all future episodes”.
My hunch is that we don’t disagree about anything. I think you keep trying to convince me of something that I already agree with, and meanwhile I keep trying to make a point which is so trivially obvious that you’re misinterpreting me as saying something more interesting than I am.
We can also get a model that has an objective that is different from the intended formal objective (never mind whether the latter is aligned with us). For example, SGD may create a model with a different objective that is identical to the intended objective just during training (or some part thereof). Why would this be unlikely? The intended objective is not privileged over such other objectives, from the perspective the training process.
Evan gave an example related to this, where the intention was to train a myopic RL agent that goes through blue doors in the current
epochepisode, but the result is an agent with a more general objective that cares about blue doors in futureepochsepisodes as well. In Evan’s words (from the Future of Life podcast):Similar concerns are relevant for (self-)supervised models, in the limit of capability. If a network can model our world very well, the objective that SGD yields may correspond to caring about the actual physical RAM of the computer on which the inference runs (specifically, the memory location that stores the loss of the inference). Also, if any part of the network, at any point during training, corresponds to dangerous logic that cares about our world, the outcome can be catastrophic (and the probability of this seems to increase with the scale of the network and training compute).
Also, a malign prior problem may manifest in (self-)supervised learning settings. (Maybe you consider this to be a special case of (2).)
Like, if we do gradient descent, and the training signal is “get a high score in PacMan”, then “mesa-optimize for a high score in PacMan” is incentivized by the training signal, and “mesa-optimize for making paperclips, and therefore try to get a high score in PacMan as an instrumental strategy towards the eventual end of making paperclips” is also incentivized by the training signal.
For example, if at some point in training, the model is OK-but-not-great at figuring out how to execute a deceptive strategy, gradient descent will make it better and better at figuring out how to execute a deceptive strategy.
Here’s a nice example. Let’s say we do RL, and our model is initialized with random weights. The training signal is “get a high score in PacMan”. We start training, and after a while, we look at the partially-trained model with interpretability tools, and we see that it’s fabulously effective at calculating digits of π—it calculates them by the billions—and it’s doing nothing else, it has no knowledge whatsoever of PacMan, it has no self-awareness about the training situation that it’s in, it has no proclivities to gradient-hack or deceive, and it never did anything like that anytime during training. It literally just calculates digits of π. I would sure be awfully surprised to see that! Wouldn’t you? If so, then you agree with me that “reasoning about training incentives” is a valid type of reasoning about what to expect from trained ML models. I don’t think it’s a controversial opinion...
Again, I did not (and don’t) claim that this type of reasoning should lead people to believe that mesa-optimizers won’t happen, because there do tend to be training incentives for mesa-optimization.
My surprise would stem from observing that RL in a trivial environment yielded a system that is capable of calculating/reasoning-about π. If you replace the PacMan environment with a complex environment and sufficiently scale up the architecture and training compute, I wouldn’t be surprised to learn the system is doing very impressive computations that have nothing to do with the intended objective.
Note that the examples in my comment don’t rely on deceptive alignment. To “convert” your PacMan RL agent example to the sort of examples I was talking about: suppose that the objective the agent ends up with is “make the relevant memory location in the RAM say that I won the game”, or “win the game in all future episodes”.
My hunch is that we don’t disagree about anything. I think you keep trying to convince me of something that I already agree with, and meanwhile I keep trying to make a point which is so trivially obvious that you’re misinterpreting me as saying something more interesting than I am.