I speculate that at least three factors made CCS viral:
It was published shortly after the Eliciting Latent Knowledge (ELK) report. At that time, ELK was not only exciting, but new and exciting.
It is an interpretability paper. When CCS was published, interpretability was arguably the leading research direction in the alignment community, with Anthropic and Redwood Research both making big bets on interpretability.
CCS mathematizes “truth” and explains it clearly. It would be really nice if the project of human rationality also helped with the alignment problem. So, CCS is an idea that people want to see work.
It is an interpretability paper. When CCS was published, interpretability was arguably the leading
research direction in the alignment community, with Anthropic and Redwood Research both making big bets on interpretability.
FWIW, I personally wouldn’t describe this as interpretability research, I would instead call this “model internals research” or something. Like the research doesn’t necessarily involve any human understanding anything about the model more than what they would understand from training a probe to classify true/false.
I speculate that at least three factors made CCS viral:
It was published shortly after the Eliciting Latent Knowledge (ELK) report. At that time, ELK was not only exciting, but new and exciting.
It is an interpretability paper. When CCS was published, interpretability was arguably the leading research direction in the alignment community, with Anthropic and Redwood Research both making big bets on interpretability.
CCS mathematizes “truth” and explains it clearly. It would be really nice if the project of human rationality also helped with the alignment problem. So, CCS is an idea that people want to see work.
[Minor terminology point, unimportant]
FWIW, I personally wouldn’t describe this as interpretability research, I would instead call this “model internals research” or something. Like the research doesn’t necessarily involve any human understanding anything about the model more than what they would understand from training a probe to classify true/false.