ryan_greenblatt comments on How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

ryan_greenblatt 16 Jan 2024 0:24 UTC
LW: 6 AF: 6
4
AF
[Minor terminology point, unimportant]

It is an interpretability paper. When CCS was published, interpretability was arguably the leading research direction in the alignment community, with Anthropic and Redwood Research both making big bets on interpretability.

FWIW, I personally wouldn’t describe this as interpretability research, I would instead call this “model internals research” or something. Like the research doesn’t necessarily involve any human understanding anything about the model more than what they would understand from training a probe to classify true/false.