Beyond the paper and post, I think it seems important to note the community reaction to this work. I think many people dramatically overrated the empirical results in this work due to a combination of misunderstanding what was actually done, misunderstanding why the method worked (which follow up work helped to clarify as you noted), and incorrectly predicting the method would work in many cases where it doesn’t.
The actual conceptual ideas discussed in the blog post seem quite good and somewhat original (this is certainly the best presentation of these sort of ideas in this space at the time it came out). But, I think the conceptual ideas got vastly more traction than otherwise due to people having a very misleadingly favorable impression of the empirical results. I might elaborate more on takeaways related to this in a follow up post.
I speculate that at least three factors made CCS viral:
It was published shortly after the Eliciting Latent Knowledge (ELK) report. At that time, ELK was not only exciting, but new and exciting.
It is an interpretability paper. When CCS was published, interpretability was arguably the leading research direction in the alignment community, with Anthropic and Redwood Research both making big bets on interpretability.
CCS mathematizes “truth” and explains it clearly. It would be really nice if the project of human rationality also helped with the alignment problem. So, CCS is an idea that people want to see work.
It is an interpretability paper. When CCS was published, interpretability was arguably the leading
research direction in the alignment community, with Anthropic and Redwood Research both making big bets on interpretability.
FWIW, I personally wouldn’t describe this as interpretability research, I would instead call this “model internals research” or something. Like the research doesn’t necessarily involve any human understanding anything about the model more than what they would understand from training a probe to classify true/false.
I agree that people dramatically overrated the empirical results of this work, but not more so than other pieces that “went viral” in this community. I’d be excited to see your takes on this general phenomenon as well as how we might address it in the future.
I agree not more than other pieces that “went viral”, but I think that the lasting impact of the misconceptions seem much larger in the case of CCS. This is probably due to the conceptual ideas actually holding up in the case of CCS.
Beyond the paper and post, I think it seems important to note the community reaction to this work. I think many people dramatically overrated the empirical results in this work due to a combination of misunderstanding what was actually done, misunderstanding why the method worked (which follow up work helped to clarify as you noted), and incorrectly predicting the method would work in many cases where it doesn’t.
The actual conceptual ideas discussed in the blog post seem quite good and somewhat original (this is certainly the best presentation of these sort of ideas in this space at the time it came out). But, I think the conceptual ideas got vastly more traction than otherwise due to people having a very misleadingly favorable impression of the empirical results. I might elaborate more on takeaways related to this in a follow up post.
I speculate that at least three factors made CCS viral:
It was published shortly after the Eliciting Latent Knowledge (ELK) report. At that time, ELK was not only exciting, but new and exciting.
It is an interpretability paper. When CCS was published, interpretability was arguably the leading research direction in the alignment community, with Anthropic and Redwood Research both making big bets on interpretability.
CCS mathematizes “truth” and explains it clearly. It would be really nice if the project of human rationality also helped with the alignment problem. So, CCS is an idea that people want to see work.
[Minor terminology point, unimportant]
FWIW, I personally wouldn’t describe this as interpretability research, I would instead call this “model internals research” or something. Like the research doesn’t necessarily involve any human understanding anything about the model more than what they would understand from training a probe to classify true/false.
I agree that people dramatically overrated the empirical results of this work, but not more so than other pieces that “went viral” in this community. I’d be excited to see your takes on this general phenomenon as well as how we might address it in the future.
I agree not more than other pieces that “went viral”, but I think that the lasting impact of the misconceptions seem much larger in the case of CCS. This is probably due to the conceptual ideas actually holding up in the case of CCS.