I’m still a bit confused by the difference between inner alignment and out-of-distribution generalization. What’s the fundamental difference between the cat-classifying problem and the maze problem. The model itself is an optimizer for the latter? But why this is any special?
What if the neural network used to solve the maze problem just learns a mapping (but doesn’t do any search)? Is that still an inner-alignment problem?
Inner Alignment is only defined for mesa optimizers (i.e., models that run a search). So the answer to your second paragraph is no, it wouldn’t.
Why is this special? Speaking from my own understanding here, in a nutshell because the optimization process is where most of the risk comes in, for two reasons. One is that a system not running an optimization process probably can’t do things with large-scale negative consequences, no matter how terribly “misaligned” it is. This is why comprehensive AI services (a model for the future where we build lots of narrow systems that don’t run searches) arguably has the potential to avoid x-risk (the problem here being competitiveness). And the other is that you can’t get the behavior where a misaligned model appears to work great because it’s doing what you want for instrumental reasons (i.e., deceptive alignment) without the inner optimizer.
I would agree that distributional shift isn’t all that different from inner alignment on a conceptual level.
I’m still a bit confused by the difference between inner alignment and out-of-distribution generalization. What’s the fundamental difference between the cat-classifying problem and the maze problem. The model itself is an optimizer for the latter? But why this is any special?
What if the neural network used to solve the maze problem just learns a mapping (but doesn’t do any search)? Is that still an inner-alignment problem?
Inner Alignment is only defined for mesa optimizers (i.e., models that run a search). So the answer to your second paragraph is no, it wouldn’t.
Why is this special? Speaking from my own understanding here, in a nutshell because the optimization process is where most of the risk comes in, for two reasons. One is that a system not running an optimization process probably can’t do things with large-scale negative consequences, no matter how terribly “misaligned” it is. This is why comprehensive AI services (a model for the future where we build lots of narrow systems that don’t run searches) arguably has the potential to avoid x-risk (the problem here being competitiveness). And the other is that you can’t get the behavior where a misaligned model appears to work great because it’s doing what you want for instrumental reasons (i.e., deceptive alignment) without the inner optimizer.
I would agree that distributional shift isn’t all that different from inner alignment on a conceptual level.