I’ll take a stab at answering the questions for myself (fairly quick takes):
No, I don’t care about whether a model is an optimiser per se. I care only insofar as being an optimiser makes it more effective as an agent. That is, if it’s robustly able to achieve things, it doesn’t matter how. (However, it could be impossible to achieve things without being shaped like an optimiser; this is still unresolved.)
I agree that it would be nice to find definitions such that capacity and inclination split cleanly. Retargetability is one approach to this, e.g., operationalised as fine-tuning effort required to redirect inclinations.
I think there are two: incorrect labels (when the feedback provider isn’t capable enough to assess the examples it needs to evaluate), and underspecification (leading to goal misgeneralisation).
Goal misgeneralisation. More broadly (to also include capability misgeneralisation), robustness failures.
No I don’t think they’re important to distinguish.
I’ll take a stab at answering the questions for myself (fairly quick takes):
No, I don’t care about whether a model is an optimiser per se. I care only insofar as being an optimiser makes it more effective as an agent. That is, if it’s robustly able to achieve things, it doesn’t matter how. (However, it could be impossible to achieve things without being shaped like an optimiser; this is still unresolved.)
I agree that it would be nice to find definitions such that capacity and inclination split cleanly. Retargetability is one approach to this, e.g., operationalised as fine-tuning effort required to redirect inclinations.
I think there are two: incorrect labels (when the feedback provider isn’t capable enough to assess the examples it needs to evaluate), and underspecification (leading to goal misgeneralisation).
Goal misgeneralisation. More broadly (to also include capability misgeneralisation), robustness failures.
No I don’t think they’re important to distinguish.
Thanks, especially like vague/incorrect labels to refer to that mismatch. Well-posed Q by Garrabrant, might touch on that in my next post.