Early elucidations of the alignment problem focused heavily on value specification. That is, they focused on the idea that given a powerful optimizer, we need some way of specifying our values so that the powerful optimizer can create good outcomes.
Since then, researchers have identified a number of additional problems besides value specification. One of the biggest problems is that in a certain sense, we don’t even know how to optimize for anything, much less a perfect specification of human values.
Let’s assume we could get a utility function containing everything humanity cares about. How would we go about optimizing this utility function?
The default mode of thinking about AI right now is to train a deep learning model that performs well on some training set. But even if we were able to create a training environment for our model that reflected the world very well, and rewarded it each time it did something good, exactly in proportion to how good it really was in our perfect utility function… this still would not be guaranteed to yield a positive artificial intelligence.
This problem is not a superficial one either—it is intrinsic to the way that machine learning is currently accomplished. To be more specific, the way we constructed our AI was by searching over some class of models M, and selecting those models which tended to do well on the training set. Crucially, we know almost nothing about the model which eventually gets selected. The most we can say is that our AI ∈M, but since M was such a broad class, this provides us very little information about what the model is actually doing.
This is similar to the mistake evolution made when designing us. Unlike evolution, we can at least put some hand-crafted constraints, like a regularization penalty, in order to guide our AI into safe regions of M. We can also open up our models and see what’s inside, and in principle simulate every aspect of their internal operations.
But now this still isn’t looking very good, because we barely know anything about what type of computations are safe. What would we even look for? To make matters worse, our current methods for ML transparency are abysmally ill equipped to the task of telling us what is going on inside.
The default outcome of all of this is that eventually, as M grows larger with compute becoming cheaper and budgets getting bigger, gradient descent is bound to hit powerful optimizers who do not share our values.
The case for studying mesa optimization
Early elucidations of the alignment problem focused heavily on value specification. That is, they focused on the idea that given a powerful optimizer, we need some way of specifying our values so that the powerful optimizer can create good outcomes.
Since then, researchers have identified a number of additional problems besides value specification. One of the biggest problems is that in a certain sense, we don’t even know how to optimize for anything, much less a perfect specification of human values.
Let’s assume we could get a utility function containing everything humanity cares about. How would we go about optimizing this utility function?
The default mode of thinking about AI right now is to train a deep learning model that performs well on some training set. But even if we were able to create a training environment for our model that reflected the world very well, and rewarded it each time it did something good, exactly in proportion to how good it really was in our perfect utility function… this still would not be guaranteed to yield a positive artificial intelligence.
This problem is not a superficial one either—it is intrinsic to the way that machine learning is currently accomplished. To be more specific, the way we constructed our AI was by searching over some class of models M, and selecting those models which tended to do well on the training set. Crucially, we know almost nothing about the model which eventually gets selected. The most we can say is that our AI ∈M, but since M was such a broad class, this provides us very little information about what the model is actually doing.
This is similar to the mistake evolution made when designing us. Unlike evolution, we can at least put some hand-crafted constraints, like a regularization penalty, in order to guide our AI into safe regions of M. We can also open up our models and see what’s inside, and in principle simulate every aspect of their internal operations.
But now this still isn’t looking very good, because we barely know anything about what type of computations are safe. What would we even look for? To make matters worse, our current methods for ML transparency are abysmally ill equipped to the task of telling us what is going on inside.
The default outcome of all of this is that eventually, as M grows larger with compute becoming cheaper and budgets getting bigger, gradient descent is bound to hit powerful optimizers who do not share our values.