Some excellent points (and I enjoyed the neat self-referentialism).
Headline take is I agree with you that CoT unfaithfulness—as Turpin and Lanham have operationalised it—is unlikely to pose a problem for the alignment of LLM-based systems.
I think this for the same reasons you state:
1. Unfaithfulness is primarily a function of the training distribution, only appears in particular contexts, and might potentially be avoided by simply asking the LLM to explain its rationale;
and
2. The o1 paradigm of RLing the CoT seems likely to remove unfaithfulness from o1′s CoTs. The way that these CoTs work seems make the model systematically hyper-attentive to different facets of its context window in a way that is verbalised in detail (and is typically enabled by a literal “Hmmmm...” that draws its attention to an unconsidered detail). It strikes me that this would totally rule out the chance that the model still pays attention to bias without saying it: “But wait… The user implied that option A was probably correct”. This is partially an empirical question—since we can’t see the o1 CoTs, I’d pipedream love to see OpenAI do and publish research on whether this is true.
This suggests to me that o1′s training might already have succeeded at giving us what we’d want: an LLM that does, in fact, just say how it made its decision. (It remains an open question whether simply prompting normal LLMs to explain their rationale would also work).
The only part of the Turpin paper that remains potentially worrying to me is the (actually unsurprising) demonstrated capacity of an LLM to fabricate spurious reasoning (“shooting outside the eighteen is not a common phrase in soccer”) in order to support a particular decision.
You can imagine all sorts of innocuous contexts that might incentivise an LLM to do this kind of thing. This might present some difficulties for oversight regimes—this is why I’d be interested in seeing whether something like o1 is capable of front-to-back undertaking an entire complex and malicious action whilst managing to obscure the actual nature of that malicious action (such that an automated LLM judge fails to detect that the action is in fact malicious).
Some excellent points (and I enjoyed the neat self-referentialism).
Headline take is I agree with you that CoT unfaithfulness—as Turpin and Lanham have operationalised it—is unlikely to pose a problem for the alignment of LLM-based systems.
I think this for the same reasons you state:
1. Unfaithfulness is primarily a function of the training distribution, only appears in particular contexts, and might potentially be avoided by simply asking the LLM to explain its rationale;
and
2. The o1 paradigm of RLing the CoT seems likely to remove unfaithfulness from o1′s CoTs.
The way that these CoTs work seems make the model systematically hyper-attentive to different facets of its context window in a way that is verbalised in detail (and is typically enabled by a literal “Hmmmm...” that draws its attention to an unconsidered detail). It strikes me that this would totally rule out the chance that the model still pays attention to bias without saying it: “But wait… The user implied that option A was probably correct”. This is partially an empirical question—since we can’t see the o1 CoTs, I’d pipedream love to see OpenAI do and publish research on whether this is true.
This suggests to me that o1′s training might already have succeeded at giving us what we’d want: an LLM that does, in fact, just say how it made its decision. (It remains an open question whether simply prompting normal LLMs to explain their rationale would also work).
The only part of the Turpin paper that remains potentially worrying to me is the (actually unsurprising) demonstrated capacity of an LLM to fabricate spurious reasoning (“shooting outside the eighteen is not a common phrase in soccer”) in order to support a particular decision.
You can imagine all sorts of innocuous contexts that might incentivise an LLM to do this kind of thing. This might present some difficulties for oversight regimes—this is why I’d be interested in seeing whether something like o1 is capable of front-to-back undertaking an entire complex and malicious action whilst managing to obscure the actual nature of that malicious action (such that an automated LLM judge fails to detect that the action is in fact malicious).