I think it’s right. Inner alignment is getting the mesa-optimizers (agents) aligned with the overall objective. Outer alignment ensures the AI understands an overall objective that humans want.
Not quite. Inner alignment, as originally conceived, is about the degree to which the trained model is optimizing for accomplishing the outer objective. Theoretically, you can have an inner-misaligned model that doesn’t have any subagents (though I don’t think this is how realistic AGI will work).
E.g., I weakly suspect that a reason deep learning models are so overconfident is actually due to an inner-misalignment between the predictive patterns SGD instills and the outer optimization criterion, where SGD is systematically under-penalizing the model’s predictive patterns for over-confident mispredictions. If true, that would represent an inner misalignment without there being any sort of deception or agentic optimization from the model’s predictive patterns, just an imperfection in the learning process.
More broadly, I don’t think we actually want truly “inner-aligned” AIs. I think that humans, and RL systems more broadly, are inner-*misaligned* by default, and that this fact is deeply tied in with how our values actually work. I think that, if you had a truly inner-aligned agent acting freely in the real world, that agent would wirehead itself as soon as possible (which is the action that generates maximum reward for a physically embedded agent). E.g., humans being inner-misaligned is why people who learn that wireheading is possible for humans don’t immediately drop everything in order to wirehead.
I think it’s right. Inner alignment is getting the mesa-optimizers (agents) aligned with the overall objective. Outer alignment ensures the AI understands an overall objective that humans want.
Not quite. Inner alignment, as originally conceived, is about the degree to which the trained model is optimizing for accomplishing the outer objective. Theoretically, you can have an inner-misaligned model that doesn’t have any subagents (though I don’t think this is how realistic AGI will work).
E.g., I weakly suspect that a reason deep learning models are so overconfident is actually due to an inner-misalignment between the predictive patterns SGD instills and the outer optimization criterion, where SGD is systematically under-penalizing the model’s predictive patterns for over-confident mispredictions. If true, that would represent an inner misalignment without there being any sort of deception or agentic optimization from the model’s predictive patterns, just an imperfection in the learning process.
More broadly, I don’t think we actually want truly “inner-aligned” AIs. I think that humans, and RL systems more broadly, are inner-*misaligned* by default, and that this fact is deeply tied in with how our values actually work. I think that, if you had a truly inner-aligned agent acting freely in the real world, that agent would wirehead itself as soon as possible (which is the action that generates maximum reward for a physically embedded agent). E.g., humans being inner-misaligned is why people who learn that wireheading is possible for humans don’t immediately drop everything in order to wirehead.
I see. So the agent issue I address above is a sub-issue of overall inner alignment.
In particular, I was the addressing deceptively aligned mesa-optimizers, as discussed here: https://astralcodexten.substack.com/p/deceptively-aligned-mesa-optimizers
Thanks!