I’m confused why the inner alignment problem is conceptually different from the outer alignment problem. From a general perspective, we can think of the task of building any AI system as humans trying to optimizer their values by searching over some solution space. In this scenario, the programmer becomes the base optimizer and the AI system becomes the mesa optimizer. The outer alignment problem thus seems like a particular manifestation of the inner alignment problem where the base optimizer is a human.
In particular, if there exists a robust solution to the outer alignment problem, then presumably there’s some property P that we want the AI system to have and that we have some process Pmeta that convinces us that the AI system has property P. I don’t see why we can’t just give the AI system the ability to enact Pmeta to ensure that any optimizer’s that it creates have property P (modulo the problem of ensuring that the system has Pmeta with Pmetameta, ensuring that with Pmetametameta, etc.). I guess you can have a solution to the outer alignment problem by having P,Pmeta and not have the recursive tower needed to solve the inner alignment problem, but that seems like not the issues that were being brought up. (something something Lobian Obstacle)
In particular,
We will call the problem of eliminating the base-mesa objective gap the inner alignment problem, which we will contrast with the outer alignment problem of eliminating the gap between the base objective and the intended goal of the programmers. This terminology is motivated by the fact that the inner alignment problem is an alignment problem entirely internal to the machine learning system, whereas the outer alignment problem is an alignment problem between the system and the humans outside of it (specifically between the base objective and the programmer’s intentions). In the context of machine learning, outer alignment refers to aligning the specified loss function with the intended goal, whereas inner alignment refers to aligning the mesa-objective of a mesa-optimizer with the specified loss function.
My view says that if M is the machine learning system and P are the programmers, we can view P as the “machine learning system” and M as a mesa-optimizer. The task of aligning the the mesa-objective with the specific loss seems the same type of problem as aligning the loss function of M with the programmers values.
Maybe the important thing is that loss functions are functions and values are not, so the point is that even if we have a function that represents our values, things can still go wrong. That is, before people thought that the problem was that finding a function that does what we want when it gets optimized was the main problem, but mesa-optimizer pseudo-alignment shows that even if we have that then we can’t just optimize the function.
An implication is that all the reasons why mesa-optimizers can cause problems are reasons why strategies for trying to turn human values into a function can go wrong too. For example, value learning strategies seem vulnerable to the same pseudo-alignment problems. Admittedly, I do not have a good understanding of current approaches to value learning, so I am not sure if this is a real concern. (Assuming that the authors of this post are adequate, if such a similar concern existed in value learning, I think they would have mentioned it. This suggests that either I am wrong about this being a problem or that no one has given it serious thought. My priors are on the former, but I want to know why I’m wrong.)
I suspect that I’ve failed to understand something fundamental because it seems like a lot of people that know a lot of stuff think this is really important. In general, I think this paper has been well written and extremely accessible to someone like me who has only recently started reading about AI safety.
I’m confused why the inner alignment problem is conceptually different from the outer alignment problem. From a general perspective, we can think of the task of building any AI system as humans trying to optimizer their values by searching over some solution space. In this scenario, the programmer becomes the base optimizer and the AI system becomes the mesa optimizer. The outer alignment problem thus seems like a particular manifestation of the inner alignment problem where the base optimizer is a human.
In particular, if there exists a robust solution to the outer alignment problem, then presumably there’s some property P that we want the AI system to have and that we have some process Pmeta that convinces us that the AI system has property P. I don’t see why we can’t just give the AI system the ability to enact Pmeta to ensure that any optimizer’s that it creates have property P (modulo the problem of ensuring that the system has Pmeta with Pmetameta, ensuring that with Pmetametameta, etc.). I guess you can have a solution to the outer alignment problem by having P,Pmeta and not have the recursive tower needed to solve the inner alignment problem, but that seems like not the issues that were being brought up. (something something Lobian Obstacle)
In particular,
My view says that if M is the machine learning system and P are the programmers, we can view P as the “machine learning system” and M as a mesa-optimizer. The task of aligning the the mesa-objective with the specific loss seems the same type of problem as aligning the loss function of M with the programmers values.
Maybe the important thing is that loss functions are functions and values are not, so the point is that even if we have a function that represents our values, things can still go wrong. That is, before people thought that the problem was that finding a function that does what we want when it gets optimized was the main problem, but mesa-optimizer pseudo-alignment shows that even if we have that then we can’t just optimize the function.
An implication is that all the reasons why mesa-optimizers can cause problems are reasons why strategies for trying to turn human values into a function can go wrong too. For example, value learning strategies seem vulnerable to the same pseudo-alignment problems. Admittedly, I do not have a good understanding of current approaches to value learning, so I am not sure if this is a real concern. (Assuming that the authors of this post are adequate, if such a similar concern existed in value learning, I think they would have mentioned it. This suggests that either I am wrong about this being a problem or that no one has given it serious thought. My priors are on the former, but I want to know why I’m wrong.)
I suspect that I’ve failed to understand something fundamental because it seems like a lot of people that know a lot of stuff think this is really important. In general, I think this paper has been well written and extremely accessible to someone like me who has only recently started reading about AI safety.