Ofer comments on Impact measurement and value-neutrality verification

Ofer 15 Oct 2019 8:59 UTC
LW: 4 AF: 3
AF
Very interesting!
Regarding value-neutrality verification: If deceptive alignment occurs, the model might output whatever minimizes the neutrality measure, as an instrumental goal [ETA: and it might not do that when it detects that it is currently not being used for computing the neutrality measure]. In such a case it seems that a successful verification step shouldn’t give us much assurance about the behavior of the model.
- evhub 15 Oct 2019 20:43 UTC
  LW: 3 AF: 2
  AF Parent
  Note that the model’s output isn’t what’s relevant for the neutrality measure; it’s the algorithm it’s internally implementing. That being said, this sort of trickery is still possible if your model is non-myopic, which is why it’s important to have some sort of myopia guarantee.