You have no guarantees, sure, but that’s a problem with deep learning in general and not just inner alignment. The point is, if your model is not optimizing that reward function then its performance during training will be suboptimal. To the extent your algorithm is able to approximate the true optimum during training, it will behave safely during deployment.
that’s a problem with deep learning in general and not just inner alignment
I think you are understanding inner alignment very differently than we define it in Risks from Learned Optimization, where we introduced the term.
The point is, if your model is not optimizing that reward function then its performance during training will be suboptimal.
This is not true for deceptively aligned models, which is the situation I’m most concerned about, and—as we argue extensively in Risks from Learned Optimization—there are a lot of reasons why a model might end up pursuing a simpler/faster/easier-to-find proxy even if that proxy yields suboptimal training performance.
You have no guarantees, sure, but that’s a problem with deep learning in general and not just inner alignment. The point is, if your model is not optimizing that reward function then its performance during training will be suboptimal. To the extent your algorithm is able to approximate the true optimum during training, it will behave safely during deployment.
I think you are understanding inner alignment very differently than we define it in Risks from Learned Optimization, where we introduced the term.
This is not true for deceptively aligned models, which is the situation I’m most concerned about, and—as we argue extensively in Risks from Learned Optimization—there are a lot of reasons why a model might end up pursuing a simpler/faster/easier-to-find proxy even if that proxy yields suboptimal training performance.
It may be helpful to point to specific sections of such a long paper.
(Also, I agree that a neural network trained trained with that reward could produce a deceptive model that makes a well-timed error.)