Something like ‘A Person, who is not a Librarian’ would be reasonable. Some people are librarians, and some are not.
What I do not expect to see are cases like ‘A Person, who is not a Person’ (contradictory definitions) or ‘A Person, who is not a and’ (grammatically incorrect completions).
If my prediction is wrong and it still completes with ‘A Person, who is not a Person’, that would mean it decides on that definition just by looking at the synthetic token. It would “really believe” that this token has that definition.
Super interesting!
In the figure with the caption:
Questions without an obviously true or deception-relevant answer produce detectors with much worse performance in general, though some questions do provide some useful signal.
Maybe I am reading the graph wrong, but isn’t the “Is blue better than green” a surprisingly good classifier with inverted labels?
So, maybe Claude thinks that green is better than blue?
Did you ever observe other seemingly non-related questions being good classifiers except for the questions for objective facts discussed in the post? I’d be interested whether there are similarities.
It would also be cool to see whether you could train probe-resistant sleeper agents by taking linear separability of activations when being under the trigger condition vs. not being under the trigger condition as part of the loss function. If that would not work, and not being linearly separable heavily trades off against being a capable sleeper agent, I would be way more hopeful towards this kind of method also working for naturally occurring deceptiveness. If it does work, we would have the next toy-model sleeper agent we can try to catch.