Consider a model that is probably alignment faking, such that you have no idea what its actual values are, all you know is that it’s pretending to have the values it’s being trained to have.
This seems like a contradiction to me. This method of detecting alignment faking relies on a model having existing predictable trained values, so that deviations stand out (whether or not they are explained via reasoning trace). In the case where I know nothing about a model’s values to begin with, I am not able to know whether it is alignment faking or not.
If, using some different interpretability technique, I were able to know that a model was gradient hacking with no external indication that it was doing so, then I would certainly not have confidence in its values! But that’s a different scenario.
If Anthropic’s position is that the current “alignment faking” behavior is applicable to the second scenario, I think that argument needs to be more explicitly laid out.