I think this is pretty complicated, and stretches the meaning of several of the critical terms employed in important ways. I think what you said is reasonable given the limitations of the terminology, but ultimately, may be subtly misleading.
How I would currently put it (which I think strays further from the standard terminology than your analysis):
Take 1
Prediction is not a well-defined optimization problem.
Maximum-a-posteriori reasoning (with a given prior) is a well-defined optimization problem, and we can ask whether it’s outer-aligned. The answer may be “no, because the Solomonoff prior contains malign stuff”.
Variational bayes (with a given prior and variational loss) is similarly well-defined. We can similarly ask whether it’s outer-aligned.
Minimizing square loss with a regularizing penalty is well-defined. Etc. Etc. Etc.
But “prediction” is not a clearly specified optimization target. Even if you fix the predictive loss (square loss, Bayes loss, etc) you need to specify a prior in order to get a well-defined expectation to minimize.
So the really well-defined question is whether specific predictive optimization targets are outer-aligned at optimum. And this type of outer-alignment seems to require the target to discourage mesa-optimizers!
This is a problem for the existing terminology, since it means these objectives are not outer-aligned unless they are also inner-aligned.
Take 2
OK, but maybe you object. I’m assuming that “optimization” means “optimization of a well-defined function which we can completely evaluate”. But (you might say), we can also optimize under uncertainty. We do this all the time. In your post, you frame “optimal performance” in terms of loss+distribution. Machine learning treats the data as a sample from the true distribution, and uses this as a proxy, but adds regularizers precisely because it’s an imperfect proxy (but the regularizers are still just a proxy).
So, in this frame, we think of the true target function as the average loss on the true distribution (ie the distribution which will be encountered in the wild), and we think of gradient descent (and other optimization methods used inside modern ML) as optimizing a proxy (which is totally normal for optimization under uncertainty).
With this frame, I think the situation gets pretty complicated.
Take 2.1
Sure, ok, if it’s just actually predicting the actual stuff, this seems pretty outer-aligned. Pedantic note: the term “alignment” is weird here. It’s not “perfectly aligned” in the sense of perfectly forwarding human values. But it could be non-malign, which I think is what people mostly mean by “AI alignment” when they’re being careful about meaning.
Take 2.2
But this whole frame is saying that once we have outer alignment, the problem that’s left is the problem of correctly predicting the future. We have to optimize under uncertainty because we can’t predict the future. An outer-aligned loss function can nonetheless yield catastrophic results because of distributional shift. The Solomonoff prior is malign because it doesn’t represent the future with enough accuracy, instead containing some really weird stuff.
So, with this terminology, the inner alignment problem is the prediction problem. If we can predict well enough, then we can set up a proxy which gets us inner alignment (by heavily penalizing malign mesa-optimizers for their future treacherous turns). Otherwise, we’re stuck with the inner alignment problem.
So given this use of terminology, “prediction is outer-aligned” is a pretty weird statement. Technically true, but prediction is the whole inner alignment problem.
Take 2.3
But wait, let’s reconsider 2.1.
In this frame, “optimal performance” means optimal at deployment time. This means we get all the strange incentives that come from online learning. We aren’t actually doing online learning, but optimal performance would respond to those incentives anyway.
(You somewhat circumvent this in your “extending the training distribution” section when you suggest proxies such as the Solomonoff distribution rather than using the actual future to define optimality. But this can reintroduce the same problem and more besides. Option #1, Solomonoff, is probably accurate enough to re-introduce the problems with self-fulfilling prophecies, besides being malign in other ways. Option #3, using a physical quantum prior, requires a solution to quantum gravity, and also is probably accurate enough to re-introduce the same problems with self-fulfilling prophecies as well. The only option I consider feasible is #2, human priors. Because humans could notice this whole problem and refuse to be part of a weird loop of self-fulfilling prediction.)
I think this is pretty complicated, and stretches the meaning of several of the critical terms employed in important ways. I think what you said is reasonable given the limitations of the terminology, but ultimately, may be subtly misleading.
How I would currently put it (which I think strays further from the standard terminology than your analysis):
Take 1
Prediction is not a well-defined optimization problem.
Maximum-a-posteriori reasoning (with a given prior) is a well-defined optimization problem, and we can ask whether it’s outer-aligned. The answer may be “no, because the Solomonoff prior contains malign stuff”.
Variational bayes (with a given prior and variational loss) is similarly well-defined. We can similarly ask whether it’s outer-aligned.
Minimizing square loss with a regularizing penalty is well-defined. Etc. Etc. Etc.
But “prediction” is not a clearly specified optimization target. Even if you fix the predictive loss (square loss, Bayes loss, etc) you need to specify a prior in order to get a well-defined expectation to minimize.
So the really well-defined question is whether specific predictive optimization targets are outer-aligned at optimum. And this type of outer-alignment seems to require the target to discourage mesa-optimizers!
This is a problem for the existing terminology, since it means these objectives are not outer-aligned unless they are also inner-aligned.
Take 2
OK, but maybe you object. I’m assuming that “optimization” means “optimization of a well-defined function which we can completely evaluate”. But (you might say), we can also optimize under uncertainty. We do this all the time. In your post, you frame “optimal performance” in terms of loss+distribution. Machine learning treats the data as a sample from the true distribution, and uses this as a proxy, but adds regularizers precisely because it’s an imperfect proxy (but the regularizers are still just a proxy).
So, in this frame, we think of the true target function as the average loss on the true distribution (ie the distribution which will be encountered in the wild), and we think of gradient descent (and other optimization methods used inside modern ML) as optimizing a proxy (which is totally normal for optimization under uncertainty).
With this frame, I think the situation gets pretty complicated.
Take 2.1
Sure, ok, if it’s just actually predicting the actual stuff, this seems pretty outer-aligned. Pedantic note: the term “alignment” is weird here. It’s not “perfectly aligned” in the sense of perfectly forwarding human values. But it could be non-malign, which I think is what people mostly mean by “AI alignment” when they’re being careful about meaning.
Take 2.2
But this whole frame is saying that once we have outer alignment, the problem that’s left is the problem of correctly predicting the future. We have to optimize under uncertainty because we can’t predict the future. An outer-aligned loss function can nonetheless yield catastrophic results because of distributional shift. The Solomonoff prior is malign because it doesn’t represent the future with enough accuracy, instead containing some really weird stuff.
So, with this terminology, the inner alignment problem is the prediction problem. If we can predict well enough, then we can set up a proxy which gets us inner alignment (by heavily penalizing malign mesa-optimizers for their future treacherous turns). Otherwise, we’re stuck with the inner alignment problem.
So given this use of terminology, “prediction is outer-aligned” is a pretty weird statement. Technically true, but prediction is the whole inner alignment problem.
Take 2.3
But wait, let’s reconsider 2.1.
In this frame, “optimal performance” means optimal at deployment time. This means we get all the strange incentives that come from online learning. We aren’t actually doing online learning, but optimal performance would respond to those incentives anyway.
(You somewhat circumvent this in your “extending the training distribution” section when you suggest proxies such as the Solomonoff distribution rather than using the actual future to define optimality. But this can reintroduce the same problem and more besides. Option #1, Solomonoff, is probably accurate enough to re-introduce the problems with self-fulfilling prophecies, besides being malign in other ways. Option #3, using a physical quantum prior, requires a solution to quantum gravity, and also is probably accurate enough to re-introduce the same problems with self-fulfilling prophecies as well. The only option I consider feasible is #2, human priors. Because humans could notice this whole problem and refuse to be part of a weird loop of self-fulfilling prediction.)