Sure, but then why not just train a probe? If we don’t care about much precision what goes wrong with the probe approach?
Here’s a reasonable example where naively training a probe fails. The model lies if any of N features is “true”. One of the features is almost always activated at the same time as some others, such that in the training set it never solely determines whether the model lies.
Then, a probe trained on the activations may not pick up on that feature. Whereas if we can look at model weights, we can see that this feature also matters, and include it in our lying classifier.
This particular case can also be solved by adversarially attacking the probe though.
Here’s a reasonable example where naively training a probe fails. The model lies if any of N features is “true”. One of the features is almost always activated at the same time as some others, such that in the training set it never solely determines whether the model lies.
Then, a probe trained on the activations may not pick up on that feature. Whereas if we can look at model weights, we can see that this feature also matters, and include it in our lying classifier.
This particular case can also be solved by adversarially attacking the probe though.