Inner alignment: what are we pointing at?
Proof that a model is an optimizer says very little about the model. I do not know what a research group is studying outer alignment is studying. Inner alignment seems to cover the entire problem at the limit. Whether an optimizer is mesa or not depends on your point of view. These terms seem to be a magnet for confusion and debate. I have to do background reading on someone to even understand what claim they’re making. These are all indicators that we are using the wrong terms.
What are we actually pointing at? What questions do we want answered?
Do we care if a model is an optimizer? Is it important whether it is creating plans through an explicit search process or a clever collection of heuristics? A poor search algorithm cannot plan much and clever enough heuristics can take you to any goal. What’s the important metric?
Sometimes a model will have great capacity to shape its environment but little inclination. How to divide between capacity and inclination in a way that closely corresponds to agents and models as we observe them? (One could say that capacity and inclination cannot be separated but the right definitions would split them right apart.)
When you specify what you want the model to do in code, what is the central difficulty? Is there a common risk or error in giving examples and giving loss/reward/value functions that we can name?
Is there a clear, accepted term for when models do not maintain desired behavior under distribution shift?
Should we distinguish between trained RL models that optimize and spontaneous agents that emerge in dynamical systems? One might expect the first to almost always happen and the second very rarely. What’s the key difference?
I’ll post my answers to these questions in a couple days but I’m curious how other people slice it. Does “inner alignment failure” mean anything or do we need to point more directly?
I’ll take a stab at answering the questions for myself (fairly quick takes):
No, I don’t care about whether a model is an optimiser per se. I care only insofar as being an optimiser makes it more effective as an agent. That is, if it’s robustly able to achieve things, it doesn’t matter how. (However, it could be impossible to achieve things without being shaped like an optimiser; this is still unresolved.)
I agree that it would be nice to find definitions such that capacity and inclination split cleanly. Retargetability is one approach to this, e.g., operationalised as fine-tuning effort required to redirect inclinations.
I think there are two: incorrect labels (when the feedback provider isn’t capable enough to assess the examples it needs to evaluate), and underspecification (leading to goal misgeneralisation).
Goal misgeneralisation. More broadly (to also include capability misgeneralisation), robustness failures.
No I don’t think they’re important to distinguish.
Thanks, especially like vague/incorrect labels to refer to that mismatch. Well-posed Q by Garrabrant, might touch on that in my next post.