It is an interpretability paper. When CCS was published, interpretability was arguably the leading
research direction in the alignment community, with Anthropic and Redwood Research both making big bets on interpretability.
FWIW, I personally wouldn’t describe this as interpretability research, I would instead call this “model internals research” or something. Like the research doesn’t necessarily involve any human understanding anything about the model more than what they would understand from training a probe to classify true/false.
[Minor terminology point, unimportant]
FWIW, I personally wouldn’t describe this as interpretability research, I would instead call this “model internals research” or something. Like the research doesn’t necessarily involve any human understanding anything about the model more than what they would understand from training a probe to classify true/false.