If we make an AGI, and the AGI starts doing Anki because it’s instrumentally useful, then I don’t care, that doesn’t seem safety-relevant. I definitely think things like this happen by default.
If we make an AGI and the AGI develops (self-reflective) preferences about its own preferences, I care very much, because now it’s potentially motivated to change its preferences, which can be good (if its meta-preferences are aligned with what I was hoping for) or bad (if misaligned). See here. I note that intervening on an AGI’s meta-preferences seems hard. Like, if the AGI turns to look at an apple, we can make a reasonable guess that it might be thinking about apples at that moment, and that at least helps us get our foot in the door (cf. Section 4.1 in OP)—but there isn’t an analogous trick for meta-preferences. (This is a reason that I’m very interested in the nuts-and-bolts of how self-concept works in the human brain. Haven’t made much progress on that though.)
I’m not sure what you mean by “separate training for cognitive strategy”. Also, “give rise to a self-reflective mesa-optimizer that’s capable of taking over the outer process” doesn’t parse for me. If it’s important, can you explain in more detail?
Also, “give rise to a self-reflective mesa-optimizer that’s capable of taking over the outer process” doesn’t parse for me. If it’s important, can you explain in more detail?
So, parsing it a bit at a time (being more thorough than is strictly necessary):
What does it mean for some instrumentally-useful behavior (let’s call it behavior “X”) to give rise to a mesa-optimizer?
It means that if X is useful for a system in training, that system might learn to do X by instantiating an agent who wants X to happen. So if X is “trying to have good cognitive habits,” there might be some mesa-optimizer that literally wants the whole system to have good cognitive habits (in whatever sense was rewarded on the training data), even if “trying to have good cognitive habits” was never explicitly rewarded.
What’s “self-reflective” and why might we expect it?
“Self-reflective” means doing a good job of modeling how you fit into the world, how you work, and how those workings might be affected by your actions. A non-self-reflective optimizer is like a chess-playing agent—it makes moves that it thinks will put the board in a better state, but it doesn’t make any plans about itself, since it’s not on the board. An optimizer that’s self-reflective will represent itself when making plans, and if this helps the agent do its job, we should expect learning process to lead to self-reflective agents.
What does a self-reflective mesa-optimizer do?
It makes plans so that it doesn’t get changed or removed by the dynamics of the process that gave rise to it. Without such plans, it wouldn’t be able to stay the same agent for very long.
Why would a mesa-optimizer want to take over the outer process?
Suppose there’s some large system being trained (the “outer process”) that has instantiated a mesa-optimizer that’s smaller than the system as a whole. The smaller mesa-optimizer wants to control the larger system to satisfy its own preferences. If the mesa-optimizer wants “good cognitive habits,” for instance, it might want to obtain lots of resources to run really good cognitive habits on.
[And by “but I mostly expect gradient descent to work” I meant that I expect gradient descent to suppress the formation of such mesa-optimizers.]
Thanks. I’m generally thinking about model-based RL where the whole system is unambiguously an agent that’s trying to do things, and the things it’s trying to do are related to items in the world-model that the value-function thinks are high-value, and “world-model” and “value function” are labeled boxes in the source code, and inside those boxes a learning algorithm builds unlabeled trained models. (We can separately argue about whether that’s a good thing to be thinking about.)
In this picture, you can still have subagents / Society-Of-Mind; for example, if the value function assigns high value to the world-model concept “I will follow through on my commitment to exercise” and also assigns high value to the world-model concept “I will watch TV”, then this situation can be alternatively reframed as two subagents duking it out. But still, insofar as the subagents are getting anything done, they’re getting things done in a way that uses the world-model as a world-model, and uses the value function as a value function, etc.
By contrast, when people talk about mesa-optimizers, they normally have in mind something like RFLO, where agency & planning wind up emerging entirely inside a single black box. I don’t expect that to happen for various reasons, cf. here and here.
OK, so if we restrict to model-based RL, and we forget about mesa-optimizers, then my best-guess translation of “Is separate training for cognitive strategy useful?” into my ontology is something like “Should we set up the AGI’s internal reward function to “care about” cognitive strategy explicitly, and not just let the cognitive strategy emerge by instrumental reasoning?” I mostly don’t have any great plan for the AGI’s internal reward function in the first place, so I don’t want to rule anything out. I can vaguely imagine possible reasons that doing this might be a good idea; e.g. if we want the AGI to avoid out-of-the-box solutions or human-manipulation-related solutions to its problems, we would at least possibly implement that via a reward function term related to cognitive strategy.
I still suspect that we’re probably talking about different things and having two parallel independent conversations. ¯\_(ツ)_/¯
If we make an AGI, and the AGI starts doing Anki because it’s instrumentally useful, then I don’t care, that doesn’t seem safety-relevant. I definitely think things like this happen by default.
If we make an AGI and the AGI develops (self-reflective) preferences about its own preferences, I care very much, because now it’s potentially motivated to change its preferences, which can be good (if its meta-preferences are aligned with what I was hoping for) or bad (if misaligned). See here. I note that intervening on an AGI’s meta-preferences seems hard. Like, if the AGI turns to look at an apple, we can make a reasonable guess that it might be thinking about apples at that moment, and that at least helps us get our foot in the door (cf. Section 4.1 in OP)—but there isn’t an analogous trick for meta-preferences. (This is a reason that I’m very interested in the nuts-and-bolts of how self-concept works in the human brain. Haven’t made much progress on that though.)
I’m not sure what you mean by “separate training for cognitive strategy”. Also, “give rise to a self-reflective mesa-optimizer that’s capable of taking over the outer process” doesn’t parse for me. If it’s important, can you explain in more detail?
So, parsing it a bit at a time (being more thorough than is strictly necessary):
What does it mean for some instrumentally-useful behavior (let’s call it behavior “X”) to give rise to a mesa-optimizer?
It means that if X is useful for a system in training, that system might learn to do X by instantiating an agent who wants X to happen. So if X is “trying to have good cognitive habits,” there might be some mesa-optimizer that literally wants the whole system to have good cognitive habits (in whatever sense was rewarded on the training data), even if “trying to have good cognitive habits” was never explicitly rewarded.
What’s “self-reflective” and why might we expect it?
“Self-reflective” means doing a good job of modeling how you fit into the world, how you work, and how those workings might be affected by your actions. A non-self-reflective optimizer is like a chess-playing agent—it makes moves that it thinks will put the board in a better state, but it doesn’t make any plans about itself, since it’s not on the board. An optimizer that’s self-reflective will represent itself when making plans, and if this helps the agent do its job, we should expect learning process to lead to self-reflective agents.
What does a self-reflective mesa-optimizer do?
It makes plans so that it doesn’t get changed or removed by the dynamics of the process that gave rise to it. Without such plans, it wouldn’t be able to stay the same agent for very long.
Why would a mesa-optimizer want to take over the outer process?
Suppose there’s some large system being trained (the “outer process”) that has instantiated a mesa-optimizer that’s smaller than the system as a whole. The smaller mesa-optimizer wants to control the larger system to satisfy its own preferences. If the mesa-optimizer wants “good cognitive habits,” for instance, it might want to obtain lots of resources to run really good cognitive habits on.
[And by “but I mostly expect gradient descent to work” I meant that I expect gradient descent to suppress the formation of such mesa-optimizers.]
Thanks. I’m generally thinking about model-based RL where the whole system is unambiguously an agent that’s trying to do things, and the things it’s trying to do are related to items in the world-model that the value-function thinks are high-value, and “world-model” and “value function” are labeled boxes in the source code, and inside those boxes a learning algorithm builds unlabeled trained models. (We can separately argue about whether that’s a good thing to be thinking about.)
In this picture, you can still have subagents / Society-Of-Mind; for example, if the value function assigns high value to the world-model concept “I will follow through on my commitment to exercise” and also assigns high value to the world-model concept “I will watch TV”, then this situation can be alternatively reframed as two subagents duking it out. But still, insofar as the subagents are getting anything done, they’re getting things done in a way that uses the world-model as a world-model, and uses the value function as a value function, etc.
By contrast, when people talk about mesa-optimizers, they normally have in mind something like RFLO, where agency & planning wind up emerging entirely inside a single black box. I don’t expect that to happen for various reasons, cf. here and here.
OK, so if we restrict to model-based RL, and we forget about mesa-optimizers, then my best-guess translation of “Is separate training for cognitive strategy useful?” into my ontology is something like “Should we set up the AGI’s internal reward function to “care about” cognitive strategy explicitly, and not just let the cognitive strategy emerge by instrumental reasoning?” I mostly don’t have any great plan for the AGI’s internal reward function in the first place, so I don’t want to rule anything out. I can vaguely imagine possible reasons that doing this might be a good idea; e.g. if we want the AGI to avoid out-of-the-box solutions or human-manipulation-related solutions to its problems, we would at least possibly implement that via a reward function term related to cognitive strategy.
I still suspect that we’re probably talking about different things and having two parallel independent conversations. ¯\_(ツ)_/¯