This is excellent, it feels way better as a definition of optimization than past attempts :) Thanks in particular for the academic style, specifically relating it to previous work, it made it much more accessible for me.
Let’s try to build up some core AI alignment arguments with this definition.
Task: A task is simply an “environment” along with a target configuration set. Whenever I talk about a “task” below, assume that I mean an “interesting” task, i.e. something like “build a chair”, as opposed to “have the air molecules be in one of these particular configurations”.
Solving a task: An object O solves a task T if adding O to T’s environment transforms it into an optimizing system for the T’s target configuration set.
Performance on the task: If O solves task T, its performance is quantified by how quickly it reaches the target configuration set, and how robust it is to perturbations.
Generality of intelligence: The generality of O’s intelligence is a function of the number and diversity of tasks T that it can solve, as well as its performance on those tasks.
Optimizing AI: A computer program for which there exists an interesting task such that the computer program solves that task.
This isn’t exactly right, as it includes e.g. accounting programs or video games, which when paired with a human form an optimizing system for correct financials and winning the game, respectively. You might be able to fix this by saying that the optimizing system has to be robust to perturbations in any human behavior in the environment.
AGI: An optimizing AI whose generality of intelligence is at least as great as that of humans.
Argument for AI risk: As optimizing AIs become more and more general, we will apply them to more economically useful tasks T. However, they also become more and more robust to perturbations, possibly including perturbations such as “we try to turn off the AI”. As a result, we might eventually have AIs that form strong optimizing systems for some task T that isn’t the one we actually wanted, which tends to be bad due to fragility of value.
Deep learning AGI implies mesa optimization: Since deep learning is so sample inefficient, it cannot reach human levels of performance if we apply deep learning directly to each possible task T. (For example, it has to relearn how the world works separately for each task T.) As a result, if we do get AGI primarily via deep learning, it must be that we used deep learning to create a new optimizing AI system, and that system was the AGI.
Argument for mesa optimization: Due to the complexity and noise in the real world, most economically useful tasks require setting up a robust optimizing system, rather than directly creating the target configuration state. (See also the importance of feedback for more on this intuition.) It seems likely that humans will find it easier to create algorithms that then find AGIs that can create these robust optimizing systems, rather than creating an algorithm that is directly an AGI.
(The previous argument also applies: this is basically just a generalization of the previous point to arbitrary AI systems, instead of only deep learning.)
I want to note that under this approach the notion of “search” and “mesa objective” are less natural, which I see as a pro of this approach (see also here): the argument is that we’ll get a general inner optimizing AI, but it doesn’t say much about what task that AI will be optimizing for (and it could be an optimizing AI that is retargetable by human instructions).
Outer alignment: ??? Seems hard to formalize in this framework. This makes me feel like outer alignment is less important as a concept. (I also don’t particularly like formalizations outside of this framework.)
Inner alignment: Ensuring that (conditional on mesa optimization occurring) the inner AGI is aligned with the operator / user, that is, combined with the user it forms an optimizing system for “doing what the user wants”. (Note that this is explicitly notintent alignment, as it is hard to formalize intent alignment in this framework.)
Intent alignment: ??? As mentioned above, it’s hard to formalize in this framework, as intent alignment really does require some notion of “motivation”, “goals”, or “trying”, which this framework explicitly leaves out. I see this as a con of this framework.
Expected utility maximization: One particular architecture that could qualify as an AGI (if the utility function is treated as part of the environment, and not part of the AGI). I see the fact that EU maximization is no longer highlighted as a pro of this approach.
Wireheading: Special case of the argument for AI risk with a weird task of “maximize the number in this register”. Unnatural in this framing of the AI risk problem. I see this as a pro of this framing of the problem, though I expect people disagree with me on this point.
Thanks for the very thoughtful comment Rohin. I was on retreat last week after I published the article and upon returning to computer usage I was delighted by the engagement from you and others.
Generality of intelligence: The generality of O’s intelligence is a function of the number and diversity of tasks T that it can solve, as well as its performance on those tasks.
I like this.
We’ll presumably need to give O some information about the goal / target configuration set for each task. We could say that a robot capable of moving a vase around is a little bit general since we can have it solve the tasks of placing the vase at many different locations by inputting some latitude/longitude into some appropriate memory location. But this means we’re actually pasting in a different object O for each task T—each of the objects differs in those memory locations into which we’re pasting the latitude/longitude. It might be helpful to think of a “agent schema” function that maps goals to objects, so we take the goal part of the task, compute the object O for that goal, then paste this object into the environment.
It’s also important that O be able to solve the task for a reasonably broad range of environments.
Inner alignment
Perhaps we could look at it this way: take a system containing a human that is trying to get something done. This is presumably an optimizing system as humans often robustly move their environment towards some desired target configuration set. Then an inner-aligned AI is an object O such that adding it to this environment does not change the target configuration set, but does change the speed and/or robustness of convergence to that target configuration set.
Intent alignment
Yup very difficult to say much about intentions using the pure outside view approach of this framework. Perhaps we could say that an intent-aligned AI is an inner-aligned AI modulo less robustness. Or perhaps we could say that an intent-aligned AI is an AI that would achieve the goal in a large set of benign environments, but might not achieve it in the presence of unlikely mistakes, unlikely environmental conditions, or the presence of other powerful basins of attraction.
But this doesn’t really get at the spirit of Paul’s idea, which I think is about really looking inside the AI and understanding its goals.
We’ll presumably need to give O some information about the goal / target configuration set for each task.
I was imagining that the tasks can come equipped with some specification, but some sort of counterfactual also makes sense. This also gets around issues of the AI system not being appropriately “motivated”—e.g. I might be capable of performing the task “lock up puppies in cages”, but I wouldn’t do it, and so if you only look at my behavior you couldn’t say that I was capable of doing that task.
But this doesn’t really get at the spirit of Paul’s idea, which I think is about really looking inside the AI and understanding its goals.
Mild optimization: the easiest way to solve hard tasks may be to specify a proxy, which an AI maximizes. The AI steers into configurations which maximize the proxy function. Simple proxies don’t usually have target sets which we like, because human value is complex. However, maybe we just want the AI to randomly select a configuration which satisfies the proxy, instead of finding the maximally-proxy-ness configuration, which may be bad due to extremal Goodhart.
Quantilization tries to solve this by randomly selecting a target configuration from some top quantile, but this is sensitive to how world states are individuated.
This makes sense, but I think you’d need a different notion of optimizing systems than the one used in this post. (In particular, instead of a target configuration set, you want a continuous notion of goodness, like a utility function / reward function.)
I’m saying the target set for non-mild optimization is the set of configurations which maximize proxy-ness. Just take the argmax. By contrast, we might want to sample uniformly randomly from the set of satisficing configurations, which is much larger.
It sounds like you’re assuming that the target configuration set is built into the AI system. According to me, a major point of this post / framework is to avoid that assumption altogether, and only describe problems in terms of the actual observed system behavior.
(This is why within this framework I couldn’t formalize outer alignment, and why wireheading and the search / mesa-objective split is unnatural.)
I see the tension you’re pointing at. I think I had in mind something like “an AI is reliably optimizing utility function u over the configuration space (but not necessarily over universe-histories!) if it reliably moves into high-rated configurations”, and you could draw different epsilon-neighborhoods of optimality in configuration space. It seems like you should be able to talk about dog-maximizers without requiring that the agent robustly end up in the maximum-dog configurations (and not in max-minus-one-dog configs).
Deep learning AGI implies mesa optimization: Since deep learning is so sample inefficient, it cannot reach human levels of performance if we apply deep learning directly to each possible task T. (For example, it has to relearn how the world works separately for each task T.) As a result, if we do get AGI primarily via deep learning, it must be that we used deep learning to create a new optimizing AI system, and that system was the AGI.
I don’t quite understand what this is saying.
Suppose we train a giant deep learning model via self-supervised learning on a ton of real-world data (like GPT-N, but w/ other sensory modalities besides text), and then we build a second system designed to provide a nice interface to the giant model.
We’d give task specifications to the interface, and it would have some smarts about how to consult the model to figure out what to do. (The interface might also be learned, via reinforcement or supervised learning, or it might be hand-coded.)
It seems plausible to me that a system comprising these two pieces, the model and the interface, could be an AGI according to the definition here, in that when combined with a very wide variety of environments (including the task specification in the environment), it could perform at least as well as a human.
And since most of the smarts seem like they’d be in the model rather than the interface, I’d count it as getting AGI “primarily via deep learning”, even if the interface was hand-coded.
But it’s not clear to me whether that would count as using deep learning to “create a new optimizing AI system”, which is itself the AGI. The whole system is an Optimizing AI, according to the definition given above, but neither of the two parts is by itself, and it doesn’t seem to have the flavor of mesa-optimization, as I understand it. So it seems like a contradiction to the quoted claim.
Have I misunderstood what you’re saying here, or do you disagree with the characterization I gave of the hypothetical model + interface system? (Or have I perhaps misunderstood mesa-optimization?)
The whole system is an Optimizing AI, according to the definition given above, but neither of the two parts is by itself
Yeah, I’m talking about the whole system.
it doesn’t seem to have the flavor of mesa-optimization
Yeah, I agree it doesn’t fit the explanation / definition in Risks from Learned Optimization. I don’t like that definition, and usually mean something like “running the model parameters instantiates a computation that does ‘reasoning’”, which I think does fit this example. I mentioned this a bit later in the comment:
I want to note that under this approach the notion of “search” and “mesa objective” are less natural, which I see as a pro of this approach [...]: the argument is that we’ll get a general inner optimizing AI, but it doesn’t say much about what task that AI will be optimizing for (and it could be an optimizing AI that is retargetable by human instructions).
Thanks for the additions here. I’m also unsure how to gel this definition (which I quite like) with the inner/outer/mesa terminology. Here is my knuckle dragging model of the post’s implication:
target_set = f(env, agent)
So if we plug in a bunch of values for agent and hope for the best, the target_set we get might might not be what we desired. This would be misalignment. Whereas the alignment task is more like to fix target_set and env and solve for agent.
The stuff about mesa optimisers mainly sounds like inadequate (narrow) modelling of what env, agent and target_set are. Usually fixating on some fraction of the problem (win the battle, lose the war problem).
This is excellent, it feels way better as a definition of optimization than past attempts :) Thanks in particular for the academic style, specifically relating it to previous work, it made it much more accessible for me.
Let’s try to build up some core AI alignment arguments with this definition.
Task: A task is simply an “environment” along with a target configuration set. Whenever I talk about a “task” below, assume that I mean an “interesting” task, i.e. something like “build a chair”, as opposed to “have the air molecules be in one of these particular configurations”.
Solving a task: An object O solves a task T if adding O to T’s environment transforms it into an optimizing system for the T’s target configuration set.
Performance on the task: If O solves task T, its performance is quantified by how quickly it reaches the target configuration set, and how robust it is to perturbations.
Generality of intelligence: The generality of O’s intelligence is a function of the number and diversity of tasks T that it can solve, as well as its performance on those tasks.
Optimizing AI: A computer program for which there exists an interesting task such that the computer program solves that task.
This isn’t exactly right, as it includes e.g. accounting programs or video games, which when paired with a human form an optimizing system for correct financials and winning the game, respectively. You might be able to fix this by saying that the optimizing system has to be robust to perturbations in any human behavior in the environment.
AGI: An optimizing AI whose generality of intelligence is at least as great as that of humans.
Argument for AI risk: As optimizing AIs become more and more general, we will apply them to more economically useful tasks T. However, they also become more and more robust to perturbations, possibly including perturbations such as “we try to turn off the AI”. As a result, we might eventually have AIs that form strong optimizing systems for some task T that isn’t the one we actually wanted, which tends to be bad due to fragility of value.
Deep learning AGI implies mesa optimization: Since deep learning is so sample inefficient, it cannot reach human levels of performance if we apply deep learning directly to each possible task T. (For example, it has to relearn how the world works separately for each task T.) As a result, if we do get AGI primarily via deep learning, it must be that we used deep learning to create a new optimizing AI system, and that system was the AGI.
Argument for mesa optimization: Due to the complexity and noise in the real world, most economically useful tasks require setting up a robust optimizing system, rather than directly creating the target configuration state. (See also the importance of feedback for more on this intuition.) It seems likely that humans will find it easier to create algorithms that then find AGIs that can create these robust optimizing systems, rather than creating an algorithm that is directly an AGI.
(The previous argument also applies: this is basically just a generalization of the previous point to arbitrary AI systems, instead of only deep learning.)
I want to note that under this approach the notion of “search” and “mesa objective” are less natural, which I see as a pro of this approach (see also here): the argument is that we’ll get a general inner optimizing AI, but it doesn’t say much about what task that AI will be optimizing for (and it could be an optimizing AI that is retargetable by human instructions).
Outer alignment: ??? Seems hard to formalize in this framework. This makes me feel like outer alignment is less important as a concept. (I also don’t particularly like formalizations outside of this framework.)
Inner alignment: Ensuring that (conditional on mesa optimization occurring) the inner AGI is aligned with the operator / user, that is, combined with the user it forms an optimizing system for “doing what the user wants”. (Note that this is explicitly not intent alignment, as it is hard to formalize intent alignment in this framework.)
Intent alignment: ??? As mentioned above, it’s hard to formalize in this framework, as intent alignment really does require some notion of “motivation”, “goals”, or “trying”, which this framework explicitly leaves out. I see this as a con of this framework.
Expected utility maximization: One particular architecture that could qualify as an AGI (if the utility function is treated as part of the environment, and not part of the AGI). I see the fact that EU maximization is no longer highlighted as a pro of this approach.
Wireheading: Special case of the argument for AI risk with a weird task of “maximize the number in this register”. Unnatural in this framing of the AI risk problem. I see this as a pro of this framing of the problem, though I expect people disagree with me on this point.
Thanks for the very thoughtful comment Rohin. I was on retreat last week after I published the article and upon returning to computer usage I was delighted by the engagement from you and others.
I like this.
We’ll presumably need to give O some information about the goal / target configuration set for each task. We could say that a robot capable of moving a vase around is a little bit general since we can have it solve the tasks of placing the vase at many different locations by inputting some latitude/longitude into some appropriate memory location. But this means we’re actually pasting in a different object O for each task T—each of the objects differs in those memory locations into which we’re pasting the latitude/longitude. It might be helpful to think of a “agent schema” function that maps goals to objects, so we take the goal part of the task, compute the object O for that goal, then paste this object into the environment.
It’s also important that O be able to solve the task for a reasonably broad range of environments.
Perhaps we could look at it this way: take a system containing a human that is trying to get something done. This is presumably an optimizing system as humans often robustly move their environment towards some desired target configuration set. Then an inner-aligned AI is an object O such that adding it to this environment does not change the target configuration set, but does change the speed and/or robustness of convergence to that target configuration set.
Yup very difficult to say much about intentions using the pure outside view approach of this framework. Perhaps we could say that an intent-aligned AI is an inner-aligned AI modulo less robustness. Or perhaps we could say that an intent-aligned AI is an AI that would achieve the goal in a large set of benign environments, but might not achieve it in the presence of unlikely mistakes, unlikely environmental conditions, or the presence of other powerful basins of attraction.
But this doesn’t really get at the spirit of Paul’s idea, which I think is about really looking inside the AI and understanding its goals.
+1 to all of this.
I was imagining that the tasks can come equipped with some specification, but some sort of counterfactual also makes sense. This also gets around issues of the AI system not being appropriately “motivated”—e.g. I might be capable of performing the task “lock up puppies in cages”, but I wouldn’t do it, and so if you only look at my behavior you couldn’t say that I was capable of doing that task.
+1 especially to this
Mild optimization: the easiest way to solve hard tasks may be to specify a proxy, which an AI maximizes. The AI steers into configurations which maximize the proxy function. Simple proxies don’t usually have target sets which we like, because human value is complex. However, maybe we just want the AI to randomly select a configuration which satisfies the proxy, instead of finding the maximally-proxy-ness configuration, which may be bad due to extremal Goodhart.
Quantilization tries to solve this by randomly selecting a target configuration from some top quantile, but this is sensitive to how world states are individuated.
This makes sense, but I think you’d need a different notion of optimizing systems than the one used in this post. (In particular, instead of a target configuration set, you want a continuous notion of goodness, like a utility function / reward function.)
I’m saying the target set for non-mild optimization is the set of configurations which maximize proxy-ness. Just take the argmax. By contrast, we might want to sample uniformly randomly from the set of satisficing configurations, which is much larger.
(This is assuming a fixed initial state)
It sounds like you’re assuming that the target configuration set is built into the AI system. According to me, a major point of this post / framework is to avoid that assumption altogether, and only describe problems in terms of the actual observed system behavior.
(This is why within this framework I couldn’t formalize outer alignment, and why wireheading and the search / mesa-objective split is unnatural.)
I see the tension you’re pointing at. I think I had in mind something like “an AI is reliably optimizing utility function u over the configuration space (but not necessarily over universe-histories!) if it reliably moves into high-rated configurations”, and you could draw different epsilon-neighborhoods of optimality in configuration space. It seems like you should be able to talk about dog-maximizers without requiring that the agent robustly end up in the maximum-dog configurations (and not in max-minus-one-dog configs).
I’m still confused about parts of this.
I don’t quite understand what this is saying.
Suppose we train a giant deep learning model via self-supervised learning on a ton of real-world data (like GPT-N, but w/ other sensory modalities besides text), and then we build a second system designed to provide a nice interface to the giant model.
We’d give task specifications to the interface, and it would have some smarts about how to consult the model to figure out what to do. (The interface might also be learned, via reinforcement or supervised learning, or it might be hand-coded.)
It seems plausible to me that a system comprising these two pieces, the model and the interface, could be an AGI according to the definition here, in that when combined with a very wide variety of environments (including the task specification in the environment), it could perform at least as well as a human.
And since most of the smarts seem like they’d be in the model rather than the interface, I’d count it as getting AGI “primarily via deep learning”, even if the interface was hand-coded.
But it’s not clear to me whether that would count as using deep learning to “create a new optimizing AI system”, which is itself the AGI. The whole system is an Optimizing AI, according to the definition given above, but neither of the two parts is by itself, and it doesn’t seem to have the flavor of mesa-optimization, as I understand it. So it seems like a contradiction to the quoted claim.
Have I misunderstood what you’re saying here, or do you disagree with the characterization I gave of the hypothetical model + interface system? (Or have I perhaps misunderstood mesa-optimization?)
Yeah, I’m talking about the whole system.
Yeah, I agree it doesn’t fit the explanation / definition in Risks from Learned Optimization. I don’t like that definition, and usually mean something like “running the model parameters instantiates a computation that does ‘reasoning’”, which I think does fit this example. I mentioned this a bit later in the comment:
Thanks for the additions here. I’m also unsure how to gel this definition (which I quite like) with the inner/outer/mesa terminology. Here is my knuckle dragging model of the post’s implication:
target_set = f(env, agent)
So if we plug in a bunch of values for
agent
and hope for the best, thetarget_set
we get might might not be what we desired. This would be misalignment. Whereas the alignment task is more like to fixtarget_set
andenv
and solve foragent
.The stuff about mesa optimisers mainly sounds like inadequate (narrow) modelling of what
env
,agent
andtarget_set
are. Usually fixating on some fraction of the problem (win the battle, lose the war problem).