Regarding value-neutrality verification: If deceptive alignment occurs, the model might output whatever minimizes the neutrality measure, as an instrumental goal [ETA: and it might not do that when it detects that it is currently not being used for computing the neutrality measure]. In such a case it seems that a successful verification step shouldn’t give us much assurance about the behavior of the model.
Note that the model’s output isn’t what’s relevant for the neutrality measure; it’s the algorithm it’s internally implementing. That being said, this sort of trickery is still possible if your model is non-myopic, which is why it’s important to have some sort of myopia guarantee.
Very interesting!
Regarding value-neutrality verification: If deceptive alignment occurs, the model might output whatever minimizes the neutrality measure, as an instrumental goal [ETA: and it might not do that when it detects that it is currently not being used for computing the neutrality measure]. In such a case it seems that a successful verification step shouldn’t give us much assurance about the behavior of the model.
Note that the model’s output isn’t what’s relevant for the neutrality measure; it’s the algorithm it’s internally implementing. That being said, this sort of trickery is still possible if your model is non-myopic, which is why it’s important to have some sort of myopia guarantee.