Excellent! In particular, it seems like oversight techniques which can pass tests like these could work in worlds where alignment is very difficult, so long as AI progress doesn’t involve a discontinuity so huge that local validity tells you nothing useful (such that there are no analogies between low and high capability regimes).
What do you mean by “so long as AI progress doesn’t involve a discontinuity so huge that local validity tells you nothing useful (such that there are no analogies between low and high capability regimes)”? I’m not proposing that we rely on any analogies between low and high capability regimes.
In the sense that there has to be an analogy between low and high capabilities somewhere, even if at the meta level.
This method lets you catch dangerous models that can break oversight processes for the same fundamental reasons as less dangerous models, not just for the same inputs.
Why does there have to be an analogy between low and high capabilities somewhere? The goal of this method is to evaluate the adequacy of an oversight procedure for a particular model on a particular distribution; it does not involve generalizing from lower capabilities to higher capabilities, you just run this eval on the actual model you cared about measuring local adequacy of oversight for.
(Sorry to come in disagreeably here, but I think of this as a really important point, so want to make sure we argue it out.)
You’re right, I’ve reread the section and that was a slight misunderstanding on my part.
Even so I still think it falls at a 7 on my scale as it’s a way of experimentally validating oversight processes that gives you some evidence about how they’ll work in unseen situations.
I’d say the main point here is that I don’t want to rely on my ability to extrapolate anything about how the model behaves in “unseen situations”, I want to run this eval in every situation where I’m deploying my model.
Excellent! In particular, it seems like oversight techniques which can pass tests like these could work in worlds where alignment is very difficult, so long as AI progress doesn’t involve a discontinuity so huge that local validity tells you nothing useful (such that there are no analogies between low and high capability regimes).
I’d say this corresponds to 7 on my alignment difficulty table.
What do you mean by “so long as AI progress doesn’t involve a discontinuity so huge that local validity tells you nothing useful (such that there are no analogies between low and high capability regimes)”? I’m not proposing that we rely on any analogies between low and high capability regimes.
In the sense that there has to be an analogy between low and high capabilities somewhere, even if at the meta level.
This method lets you catch dangerous models that can break oversight processes for the same fundamental reasons as less dangerous models, not just for the same inputs.
Why does there have to be an analogy between low and high capabilities somewhere? The goal of this method is to evaluate the adequacy of an oversight procedure for a particular model on a particular distribution; it does not involve generalizing from lower capabilities to higher capabilities, you just run this eval on the actual model you cared about measuring local adequacy of oversight for.
(Sorry to come in disagreeably here, but I think of this as a really important point, so want to make sure we argue it out.)
You’re right, I’ve reread the section and that was a slight misunderstanding on my part.
Even so I still think it falls at a 7 on my scale as it’s a way of experimentally validating oversight processes that gives you some evidence about how they’ll work in unseen situations.
I’d say the main point here is that I don’t want to rely on my ability to extrapolate anything about how the model behaves in “unseen situations”, I want to run this eval in every situation where I’m deploying my model.