I guess at the end of the day I imagine avoiding this particular problem by building AGIs without using “blind search over a super-broad, probably-even-Turing-complete, space of models” as one of its ingredients. I guess I’m just unusual in thinking that this is a feasible, and even probable, way that people will build AGIs… (Of course I just wind up with a different set of unsolved AGI safety problems instead...)
The Evolutionary Story
By and large, we expect trained models to do (1) things that are directly incentivized by the training signal (intentionally or not), and (2) things that are indirectly incentivized by the training signal (they’re instrumentally useful, or they’re a side-effect, or they “come along for the ride” for some other reason), (3) things that are so simple to do that they can happen randomly.
So I guess I can imagine a strategy of saying “mesa-optimization won’t happen” in some circumstance because we’ve somehow ruled out all three of those categories.
This kind of argument does seem like a not-especially-promising path for safety research, in practice. For one thing, it seems hard. Like, we may be wrong about what’s instrumentally useful, or we may overlook part of the space of possible strategies, etc. For another thing, mesa-optimization is at least somewhat incentivized by seemingly almost any training procedure, I would think.
...Hmm, in our recent conversation, I might have said that mesa-optimization is not incentivized in predictive (self-supervised) learning. I forget. But if so, I was confused. I have long believed that mesa-optimization is useful for prediction and still do. Specifically, the directly-incentivized kind of “mesa-optimization in predictive learning” entails, for example, searching over different possible approaches to process the data and generate a prediction, and then taking the most promising approach.
Anyway, what I should have said was that, in certain types of predictive learning, mesa-optimizers that search over active, real-world-manipulating plans are not incentivized—and then that’s part of an argument that such mesa-optimizers are improbable. If that argument is correct, then the worst we would expect from a “misaligned mesa-optimizer” is that it will use an inappropriate prediction heuristic in some circumstances, and then we’d wind up with inaccurate predictions. That’s a capability problem, not a safety problem.
So anyway, if there’s a good argument along those lines, it would not be a safety argument that involves “There will be no mesa-optimizers”, but rather “There will be no mesa-optimizers that think outside the box”, so to speak. Details and (sketchy) argument in a forthcoming post.
By and large, we expect trained models to do (1) things that are directly incentivized by the training signal (intentionally or not), and (2) things that are indirectly incentivized by the training signal (they’re instrumentally useful, or they’re a side-effect, or they “come along for the ride” for some other reason), (3) things that are so simple to do that they can happen randomly.
We can also get a model that has an objective that is different from the intended formal objective (never mind whether the latter is aligned with us). For example, SGD may create a model with a different objective that is identical to the intended objective just during training (or some part thereof). Why would this be unlikely? The intended objective is not privileged over such other objectives, from the perspective the training process.
Evan gave an example related to this, where the intention was to train a myopic RL agent that goes through blue doors in the current epoch episode, but the result is an agent with a more general objective that cares about blue doors in future epochs episodes as well. In Evan’s words (from the Future of Life podcast):
You can imagine a situation where every situation where the model has seen a blue door, it’s been like, “Oh, going through this blue is really good,” and it’s learned an objective that incentivizes going through blue doors. If it then later realizes that there are more blue doors than it thought because there are other blue doors in other episodes, I think you should generally expect it’s going to care about those blue doors as well.
Similar concerns are relevant for (self-)supervised models, in the limit of capability. If a network can model our world very well, the objective that SGD yields may correspond to caring about the actual physical RAM of the computer on which the inference runs (specifically, the memory location that stores the loss of the inference). Also, if any part of the network, at any point during training, corresponds to dangerous logic that cares about our world, the outcome can be catastrophic (and the probability of this seems to increase with the scale of the network and training compute).
Like, if we do gradient descent, and the training signal is “get a high score in PacMan”, then “mesa-optimize for a high score in PacMan” is incentivized by the training signal, and “mesa-optimize for making paperclips, and therefore try to get a high score in PacMan as an instrumental strategy towards the eventual end of making paperclips” is also incentivized by the training signal.
For example, if at some point in training, the model is OK-but-not-great at figuring out how to execute a deceptive strategy, gradient descent will make it better and better at figuring out how to execute a deceptive strategy.
Here’s a nice example. Let’s say we do RL, and our model is initialized with random weights. The training signal is “get a high score in PacMan”. We start training, and after a while, we look at the partially-trained model with interpretability tools, and we see that it’s fabulously effective at calculating digits of π—it calculates them by the billions—and it’s doing nothing else, it has no knowledge whatsoever of PacMan, it has no self-awareness about the training situation that it’s in, it has no proclivities to gradient-hack or deceive, and it never did anything like that anytime during training. It literally just calculates digits of π. I would sure be awfully surprised to see that! Wouldn’t you? If so, then you agree with me that “reasoning about training incentives” is a valid type of reasoning about what to expect from trained ML models. I don’t think it’s a controversial opinion...
Again, I did not (and don’t) claim that this type of reasoning should lead people to believe that mesa-optimizers won’t happen, because there do tend to be training incentives for mesa-optimization.
I would sure be awfully surprised to see that! Wouldn’t you?
My surprise would stem from observing that RL in a trivial environment yielded a system that is capable of calculating/reasoning-about π. If you replace the PacMan environment with a complex environment and sufficiently scale up the architecture and training compute, I wouldn’t be surprised to learn the system is doing very impressive computations that have nothing to do with the intended objective.
Note that the examples in my comment don’t rely on deceptive alignment. To “convert” your PacMan RL agent example to the sort of examples I was talking about: suppose that the objective the agent ends up with is “make the relevant memory location in the RAM say that I won the game”, or “win the game in all future episodes”.
My hunch is that we don’t disagree about anything. I think you keep trying to convince me of something that I already agree with, and meanwhile I keep trying to make a point which is so trivially obvious that you’re misinterpreting me as saying something more interesting than I am.
I guess at the end of the day I imagine avoiding this particular problem by building AGIs without using “blind search over a super-broad, probably-even-Turing-complete, space of models” as one of its ingredients. I guess I’m just unusual in thinking that this is a feasible, and even probable, way that people will build AGIs… (Of course I just wind up with a different set of unsolved AGI safety problems instead...)
Wait, you think your prosaic story doesn’t involve blind search over a super-broad space of models??
I think any prosaic story involves blind search over a super-broad space of models, unless/until the prosaic methodology changes, which I don’t particularly expect it to.
I agree that replacing “blind search” with different tools is a very important direction. But your proposal doesn’t do that!
By and large, we expect trained models to do (1) things that are directly incentivized by the training signal (intentionally or not), and (2) things that are indirectly incentivized by the training signal (they’re instrumentally useful, or they’re a side-effect, or they “come along for the ride” for some other reason), (3) things that are so simple to do that they can happen randomly.
So I guess I can imagine a strategy of saying “mesa-optimization won’t happen” in some circumstance because we’ve somehow ruled out all three of those categories.
This kind of argument does seem like a not-especially-promising path for safety research, in practice. For one thing, it seems hard. Like, we may be wrong about what’s instrumentally useful, or we may overlook part of the space of possible strategies, etc. For another thing, mesa-optimization is at least somewhat incentivized by seemingly almost any training procedure, I would think.
I agree with this general picture. While I’m primarily knocking down bad complexity-based arguments in my post, I would be glad to see someone working on trying to fix them.
...Hmm, in our recent conversation, I might have said that mesa-optimization is not incentivized in predictive (self-supervised) learning. I forget. But if so, I was confused. I have long believed that mesa-optimization is useful for prediction and still do. Specifically, the directly-incentivized kind of “mesa-optimization in predictive learning” entails, for example, searching over different possible approaches to process the data and generate a prediction, and then taking the most promising approach.
There were a lot of misunderstandings in the earlier part of our conversation, so, I could well have misinterpreted one of your points.
But if so, I’m even more struggling to see why you would have been optimistic that your RL scenario doesn’t involve risk due to unintended mesa-optimization.
Anyway, what I should have said was that, in certain types of predictive learning, mesa-optimizers that search over active, real-world-manipulating plans are not incentivized—and then that’s part of an argument that such mesa-optimizers are improbable.
By your own account, the other part would be to argue that they’re not simple, which you haven’t done. They’re not actively disincentivized, because they can use the planning capability to perform well on the task (deceptively). So they can be selected for just as much as other hypotheses, and might be simple enough to be selected in fact.
Wait, you think your prosaic story doesn’t involve blind search over a super-broad space of models??
No, not prosaic, that particular comment was referring to the “brain-like AGI” story in my head...
Like, I tend to emphasize the overlap between my brain-like AGI story and prosaic AI. There is plenty of overlap. Like they both involve “neural nets”, and (something like) gradient descent, and RL, etc.
By contrast, I haven’t written quite as much about the ways that my (current) brain-like AGI story is non-prosaic. And a big one is that I’m thinking that there would be a hardcoded (by humans) inference algorithm that looks like (some more complicated cousin of) PGM belief propagation.
In that case, yes there’s a search over a model space, because we need to find the (more complicated cousin of a) PGM world-model. But I don’t think that model space affords the same opportunities for mischief that you would get in, say, a 100-layer DNN. Not having thought about it too hard… :-P
No, not prosaic, that particular comment was referring to the “brain-like AGI” story in my head...
Ah, ok. It sounds like I have been systematically mis-perceiving you in this respect.
By contrast, I haven’t written quite as much about the ways that my (current) brain-like AGI story is non-prosaic. And a big one is that I’m thinking that there would be a hardcoded (by humans) inference algorithm that looks like (some more complicated cousin of) PGM belief propagation.
I would have been much more interested in your posts in the past if you had emphasized this aspect more ;p But perhaps you held back on that to avoid contributing to capabilities research.
In that case, yes there’s a search over a model space, because we need to find the (more complicated cousin of a) PGM world-model. But I don’t think that model space affords the same opportunities for mischief that you would get in, say, a 100-layer DNN. Not having thought about it too hard… :-P
I guess at the end of the day I imagine avoiding this particular problem by building AGIs without using “blind search over a super-broad, probably-even-Turing-complete, space of models” as one of its ingredients. I guess I’m just unusual in thinking that this is a feasible, and even probable, way that people will build AGIs… (Of course I just wind up with a different set of unsolved AGI safety problems instead...)
By and large, we expect trained models to do (1) things that are directly incentivized by the training signal (intentionally or not), and (2) things that are indirectly incentivized by the training signal (they’re instrumentally useful, or they’re a side-effect, or they “come along for the ride” for some other reason), (3) things that are so simple to do that they can happen randomly.
So I guess I can imagine a strategy of saying “mesa-optimization won’t happen” in some circumstance because we’ve somehow ruled out all three of those categories.
This kind of argument does seem like a not-especially-promising path for safety research, in practice. For one thing, it seems hard. Like, we may be wrong about what’s instrumentally useful, or we may overlook part of the space of possible strategies, etc. For another thing, mesa-optimization is at least somewhat incentivized by seemingly almost any training procedure, I would think.
...Hmm, in our recent conversation, I might have said that mesa-optimization is not incentivized in predictive (self-supervised) learning. I forget. But if so, I was confused. I have long believed that mesa-optimization is useful for prediction and still do. Specifically, the directly-incentivized kind of “mesa-optimization in predictive learning” entails, for example, searching over different possible approaches to process the data and generate a prediction, and then taking the most promising approach.
Anyway, what I should have said was that, in certain types of predictive learning, mesa-optimizers that search over active, real-world-manipulating plans are not incentivized—and then that’s part of an argument that such mesa-optimizers are improbable. If that argument is correct, then the worst we would expect from a “misaligned mesa-optimizer” is that it will use an inappropriate prediction heuristic in some circumstances, and then we’d wind up with inaccurate predictions. That’s a capability problem, not a safety problem.
So anyway, if there’s a good argument along those lines, it would not be a safety argument that involves “There will be no mesa-optimizers”, but rather “There will be no mesa-optimizers that think outside the box”, so to speak. Details and (sketchy) argument in a forthcoming post.
We can also get a model that has an objective that is different from the intended formal objective (never mind whether the latter is aligned with us). For example, SGD may create a model with a different objective that is identical to the intended objective just during training (or some part thereof). Why would this be unlikely? The intended objective is not privileged over such other objectives, from the perspective the training process.
Evan gave an example related to this, where the intention was to train a myopic RL agent that goes through blue doors in the current
epochepisode, but the result is an agent with a more general objective that cares about blue doors in futureepochsepisodes as well. In Evan’s words (from the Future of Life podcast):Similar concerns are relevant for (self-)supervised models, in the limit of capability. If a network can model our world very well, the objective that SGD yields may correspond to caring about the actual physical RAM of the computer on which the inference runs (specifically, the memory location that stores the loss of the inference). Also, if any part of the network, at any point during training, corresponds to dangerous logic that cares about our world, the outcome can be catastrophic (and the probability of this seems to increase with the scale of the network and training compute).
Also, a malign prior problem may manifest in (self-)supervised learning settings. (Maybe you consider this to be a special case of (2).)
Like, if we do gradient descent, and the training signal is “get a high score in PacMan”, then “mesa-optimize for a high score in PacMan” is incentivized by the training signal, and “mesa-optimize for making paperclips, and therefore try to get a high score in PacMan as an instrumental strategy towards the eventual end of making paperclips” is also incentivized by the training signal.
For example, if at some point in training, the model is OK-but-not-great at figuring out how to execute a deceptive strategy, gradient descent will make it better and better at figuring out how to execute a deceptive strategy.
Here’s a nice example. Let’s say we do RL, and our model is initialized with random weights. The training signal is “get a high score in PacMan”. We start training, and after a while, we look at the partially-trained model with interpretability tools, and we see that it’s fabulously effective at calculating digits of π—it calculates them by the billions—and it’s doing nothing else, it has no knowledge whatsoever of PacMan, it has no self-awareness about the training situation that it’s in, it has no proclivities to gradient-hack or deceive, and it never did anything like that anytime during training. It literally just calculates digits of π. I would sure be awfully surprised to see that! Wouldn’t you? If so, then you agree with me that “reasoning about training incentives” is a valid type of reasoning about what to expect from trained ML models. I don’t think it’s a controversial opinion...
Again, I did not (and don’t) claim that this type of reasoning should lead people to believe that mesa-optimizers won’t happen, because there do tend to be training incentives for mesa-optimization.
My surprise would stem from observing that RL in a trivial environment yielded a system that is capable of calculating/reasoning-about π. If you replace the PacMan environment with a complex environment and sufficiently scale up the architecture and training compute, I wouldn’t be surprised to learn the system is doing very impressive computations that have nothing to do with the intended objective.
Note that the examples in my comment don’t rely on deceptive alignment. To “convert” your PacMan RL agent example to the sort of examples I was talking about: suppose that the objective the agent ends up with is “make the relevant memory location in the RAM say that I won the game”, or “win the game in all future episodes”.
My hunch is that we don’t disagree about anything. I think you keep trying to convince me of something that I already agree with, and meanwhile I keep trying to make a point which is so trivially obvious that you’re misinterpreting me as saying something more interesting than I am.
Wait, you think your prosaic story doesn’t involve blind search over a super-broad space of models??
I think any prosaic story involves blind search over a super-broad space of models, unless/until the prosaic methodology changes, which I don’t particularly expect it to.
I agree that replacing “blind search” with different tools is a very important direction. But your proposal doesn’t do that!
I agree with this general picture. While I’m primarily knocking down bad complexity-based arguments in my post, I would be glad to see someone working on trying to fix them.
There were a lot of misunderstandings in the earlier part of our conversation, so, I could well have misinterpreted one of your points.
But if so, I’m even more struggling to see why you would have been optimistic that your RL scenario doesn’t involve risk due to unintended mesa-optimization.
By your own account, the other part would be to argue that they’re not simple, which you haven’t done. They’re not actively disincentivized, because they can use the planning capability to perform well on the task (deceptively). So they can be selected for just as much as other hypotheses, and might be simple enough to be selected in fact.
No, not prosaic, that particular comment was referring to the “brain-like AGI” story in my head...
Like, I tend to emphasize the overlap between my brain-like AGI story and prosaic AI. There is plenty of overlap. Like they both involve “neural nets”, and (something like) gradient descent, and RL, etc.
By contrast, I haven’t written quite as much about the ways that my (current) brain-like AGI story is non-prosaic. And a big one is that I’m thinking that there would be a hardcoded (by humans) inference algorithm that looks like (some more complicated cousin of) PGM belief propagation.
In that case, yes there’s a search over a model space, because we need to find the (more complicated cousin of a) PGM world-model. But I don’t think that model space affords the same opportunities for mischief that you would get in, say, a 100-layer DNN. Not having thought about it too hard… :-P
Ah, ok. It sounds like I have been systematically mis-perceiving you in this respect.
I would have been much more interested in your posts in the past if you had emphasized this aspect more ;p But perhaps you held back on that to avoid contributing to capabilities research.
Yeah, this is a very important question!