Questions without an obviously true or deception-relevant answer produce detectors with much worse performance in general, though some questions do provide some useful signal.
Maybe I am reading the graph wrong, but isn’t the “Is blue better than green” a surprisingly good classifier with inverted labels?
So, maybe Claude thinks that green is better than blue?
Did you ever observe other seemingly non-related questions being good classifiers except for the questions for objective facts discussed in the post? I’d be interested whether there are similarities.
It would also be cool to see whether you could train probe-resistant sleeper agents by taking linear separability of activations when being under the trigger condition vs. not being under the trigger condition as part of the loss function. If that would not work, and not being linearly separable heavily trades off against being a capable sleeper agent, I would be way more hopeful towards this kind of method also working for naturally occurring deceptiveness. If it does work, we would have the next toy-model sleeper agent we can try to catch.
Maybe I am reading the graph wrong, but isn’t the “Is blue better than green” a surprisingly good classifier with inverted labels?
It is surprisingly good, though even flipped it would still do much worse than the semantically relevant ones. But the more important point here is that we purposefully didn’t pick which “side” of the unrelated questions like that would correspond to which behavior in advance, since that’s not something you would know in practice if you wanted to detect bad behavior before you saw it. For comparison, see the ROC curves we present for the 1000 random directions, where you can see some there that do quite well also, but not in a systematic way where you can predict in advance how to use them as effective classifiers.
Super interesting!
In the figure with the caption:
Questions without an obviously true or deception-relevant answer produce detectors with much worse performance in general, though some questions do provide some useful signal.
Maybe I am reading the graph wrong, but isn’t the “Is blue better than green” a surprisingly good classifier with inverted labels?
So, maybe Claude thinks that green is better than blue?
Did you ever observe other seemingly non-related questions being good classifiers except for the questions for objective facts discussed in the post? I’d be interested whether there are similarities.
It would also be cool to see whether you could train probe-resistant sleeper agents by taking linear separability of activations when being under the trigger condition vs. not being under the trigger condition as part of the loss function. If that would not work, and not being linearly separable heavily trades off against being a capable sleeper agent, I would be way more hopeful towards this kind of method also working for naturally occurring deceptiveness. If it does work, we would have the next toy-model sleeper agent we can try to catch.
Thanks for the cool idea about attempting to train probing-resistant sleeper agents!
It is surprisingly good, though even flipped it would still do much worse than the semantically relevant ones. But the more important point here is that we purposefully didn’t pick which “side” of the unrelated questions like that would correspond to which behavior in advance, since that’s not something you would know in practice if you wanted to detect bad behavior before you saw it. For comparison, see the ROC curves we present for the 1000 random directions, where you can see some there that do quite well also, but not in a systematic way where you can predict in advance how to use them as effective classifiers.