I’m excited to see both of these papers. It would be a shame if we reached superintelligence before anyone had put serious effort into examining if CoT prompting could be a useful way to understand models.
I also had some uneasiness while reading some sections of the post. For the rest of this comment, I’m going to try to articulate where that’s coming from.
Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen.
I notice that I’m pretty confused when I read this sentence. I didn’t read the whole paper, and it’s possible I’m missing something. But one reaction I have is that this sentence presents an overly optimistic interpretation of the paper’s findings.
The sentence makes me think that Anthropic knows how to set the model size and task in such a way that it guarantees faithful explanations. Perhaps I missed something when skimming the paper, but my understanding is that we’re still very far away from that. In fact, my impression is that the tests in the paper are able to detect unfaithful reasoning, but we don’t yet have strong tests to confirm that a model is engaging in faithful reasoning.
Even supposing that the last sentence is accurate, I think there are a lot of possible “last sentences” that could have been emphasized. For example, the last sentence could have read:
“Overall, our results suggest that CoT is not reliably faithful, and further research will be needed to investigate its promise, especially in larger models.”
“Unfortunately, this finding suggests that CoT prompting may not yet be a promising way to understand the internal cognition of powerful models.” (note: “this finding” refers to the finding in the previous sentence: “as models become larger and more capable, they produce less faithful reasoning on most tasks we study.”)
“While these findings are promising, much more work on CoT is needed before it can reliably be used as a technique to reliably produce faithful explanations, especially in larger models.”
I’ve had similar reactions when reading other posts and papers by Anthropic. For example, in Core Views on AI Safety, Anthropic presents (in my opinion) a much rosier picture of Constitutional AI than is warranted. It also nearly-exclusively cites its own work and doesn’t really acknowledge other important segments of the alignment community.
But Akash, aren’t you just being naive? It’s fairly common for scientists to present slightly-overly-rosy pictures of their research. And it’s extremely common for profit-driven companies to highlight their own work in their blog posts.
I’m holding Anthropic to a higher standard. I think this is reasonable for any company developing technology that poses a serious existential risk. I also think this is especially important in Anthropic’s case. Anthropic has noted that their theory of change heavily involves: (a) understanding how difficult alignment is, (b) updating their beliefs in response to evidence, and (c) communicating alignment difficulty to the broader world.
To the extent that we trust Anthropic not to fool themselves or others, this plan has a higher chance of working. To the extent that Anthropic falls prey to traditional scientific or corporate cultural pressures (and it has incentives to see its own work as more promising than it actually is), this plan is weakened. As noted elsewhere, epistemic culture also has an important role in many alignment plans (see Carefully Bootstrapped Alignment is organizationally hard).
(I’ll conclude by noting that I commend the authors of both papers, and I hope this comment doesn’t come off as overly harsh toward them. The post inspired me to write about a phenomenon I’ve been observing for a while. I’ve been gradually noticing similar feelings over time after reading various Anthropic comms, talking to folks who work at Anthropic, seeing Slack threads, analyzing their governance plans, etc. So this comment is not meant to pick on these particular authors. Rather, it felt like a relevant opportunity to illustrate a feeling I’ve often been noticing.)
I’m excited to see both of these papers. It would be a shame if we reached superintelligence before anyone had put serious effort into examining if CoT prompting could be a useful way to understand models.
I also had some uneasiness while reading some sections of the post. For the rest of this comment, I’m going to try to articulate where that’s coming from.
I notice that I’m pretty confused when I read this sentence. I didn’t read the whole paper, and it’s possible I’m missing something. But one reaction I have is that this sentence presents an overly optimistic interpretation of the paper’s findings.
The sentence makes me think that Anthropic knows how to set the model size and task in such a way that it guarantees faithful explanations. Perhaps I missed something when skimming the paper, but my understanding is that we’re still very far away from that. In fact, my impression is that the tests in the paper are able to detect unfaithful reasoning, but we don’t yet have strong tests to confirm that a model is engaging in faithful reasoning.
Even supposing that the last sentence is accurate, I think there are a lot of possible “last sentences” that could have been emphasized. For example, the last sentence could have read:
“Overall, our results suggest that CoT is not reliably faithful, and further research will be needed to investigate its promise, especially in larger models.”
“Unfortunately, this finding suggests that CoT prompting may not yet be a promising way to understand the internal cognition of powerful models.” (note: “this finding” refers to the finding in the previous sentence: “as models become larger and more capable, they produce less faithful reasoning on most tasks we study.”)
“While these findings are promising, much more work on CoT is needed before it can reliably be used as a technique to reliably produce faithful explanations, especially in larger models.”
I’ve had similar reactions when reading other posts and papers by Anthropic. For example, in Core Views on AI Safety, Anthropic presents (in my opinion) a much rosier picture of Constitutional AI than is warranted. It also nearly-exclusively cites its own work and doesn’t really acknowledge other important segments of the alignment community.
But Akash, aren’t you just being naive? It’s fairly common for scientists to present slightly-overly-rosy pictures of their research. And it’s extremely common for profit-driven companies to highlight their own work in their blog posts.
I’m holding Anthropic to a higher standard. I think this is reasonable for any company developing technology that poses a serious existential risk. I also think this is especially important in Anthropic’s case. Anthropic has noted that their theory of change heavily involves: (a) understanding how difficult alignment is, (b) updating their beliefs in response to evidence, and (c) communicating alignment difficulty to the broader world.
To the extent that we trust Anthropic not to fool themselves or others, this plan has a higher chance of working. To the extent that Anthropic falls prey to traditional scientific or corporate cultural pressures (and it has incentives to see its own work as more promising than it actually is), this plan is weakened. As noted elsewhere, epistemic culture also has an important role in many alignment plans (see Carefully Bootstrapped Alignment is organizationally hard).
(I’ll conclude by noting that I commend the authors of both papers, and I hope this comment doesn’t come off as overly harsh toward them. The post inspired me to write about a phenomenon I’ve been observing for a while. I’ve been gradually noticing similar feelings over time after reading various Anthropic comms, talking to folks who work at Anthropic, seeing Slack threads, analyzing their governance plans, etc. So this comment is not meant to pick on these particular authors. Rather, it felt like a relevant opportunity to illustrate a feeling I’ve often been noticing.)