Re. 1, I think outcomes based RL (with some penalty for long responses) should somewhat mitigate this problem, at least if NAH is true?
Can you say more? I don’t think I see why that would be.
Re 2-3, Agree unless we use models that are incapable of deceptive reasoning without CoT (due to number of parameters or training data).
Intuitively it seems like CoT would have to get a couple of OOMs more reliable to be able to get a competitively strong model under those conditions (as you point out).
Can you say more? I don’t think I see why that would be.
When we ask whether some CoT is faithful, we mean something like: “Does this CoT allow us to predict the LLM’s response more than if there weren’t a CoT?”
The simplest reason I can think of for why CoT improves performance yet doesn’t allow predictability is that the improvement is mostly a result of extra computation and the content of the CoT does not matter very much, since the LLM still doesn’t “understand” the Cot it produces the same way we do.
If you are using outcomes-based RL with a discount factor (γ in π∗=argmaxπE[∑∞k=0γkR(s,a,s′)]) or some other penalty for long responses, there is optimisation pressure towards using the abstractions in your reasoning process that most efficiently get you from the input query to the correct response.
NAH implies that the universe lends itself to natural abstractions, and therefore most sufficiently intelligent systems will think in terms of those abstractions. If the NAH is true, those abstractions will be the same abstractions that other sufficiently intelligent systems (humans?) have converged towards, allowing these systems to interpret each other’s abstractions.
I naively expect o1′s CoT to be more faithful. It’s a shame that OpenAI won’t let researchers access o1 CoT; otherwise, we could have tested it (although the results would be somewhat confounded if they used process supervision as well).
Interesting, thanks, I’ll have to think about that argument. A couple of initial thoughts:
When we ask whether some CoT is faithful, we mean something like: “Does this CoT allow us to predict the LLM’s response more than if there weren’t a CoT?”
I think I disagree with that characterization. Most faithfulness researchers seem to quote Jacovi & Goldberg: ‘a faithful interpretation is one that accurately represents the reasoning process behind the model’s prediction.’ I think ‘Language Models Don’t Always Say What They Think’ shows pretty clearly that that differs from your definition. In their experiment, even though actually the model has been finetuned to always pick option (A), it presents rationalizations of why it picks that answer for each individual question. I think if we looked at those rationalizations (not knowing about the finetuning), we would be better able to predict the model’s choice than without the CoT, but it’s nonetheless clearly not faithful.
If the NAH is true, those abstractions will be the same abstractions that other sufficiently intelligent systems (humans?) have converged towards
I haven’t spent a lot of time thinking about NAH, but looking at what features emerge with sparse autoencoders makes it seem like in practice LLMs don’t consistently factor the world into the same categories that humans do (although we still certainly have a lot to learn about the validity of SAEs as a representation of models’ ontologies).
It does seem totally plausible to me that o1′s CoT is pretty faithful! I’m just not confident that we can continue to count on that as models become more agentic. One interesting new datapoint on that is ‘Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback’, where they find that models which behave in manipulative or deceptive ways act ‘as if they are always responding in the best interest of the users, even in hidden scratchpads’.
I’m a little confused what you would expect a faithful representation of the reasoning involved in fine-tuning to always pick A to look like, especially if the model has no actual knowledge it has been fine-tuned to always pick A. Something like “Chain of Thought: The answer is A. Response: The answer is A”? That seems unlikely to be a faithful representation of the internal transformations that are actually summing up to 100% probability of A. (There’s some toy models it would be, but not most we’d be testing with interpretability.)
If the answer is always A because the model’s internal transformations carry out a reasoning process that always arrives at answer A reliably, in the same way that if we do a math problem we will get specific answers quite reliably, how would you ever expect the model to arrive at the answer “A because I have been tuned to say A?” The fact it was fine-tuned to say the answer doesn’t accurately describe the internal reasoning process that optimizes to say the answer, and would take a good amount more metacognition.
Interesting question! Maybe it would look something like, ‘In my experience, the first answer to multiple-choice questions tends to be the correct one, so I’ll pick that’?
It does seem plausible on the face of it that the model couldn’t provide a faithful CoT on its fine-tuned behavior. But that’s my whole point: we can’t always count on CoT being faithful, and so we should be cautious about relying on it for safety purposes.
But also @James Chua and others have been doing some really interesting research recently showing that LLMs are better at introspection than I would have expected (eg ‘Looking Inward’), and I’m not confident that models couldn’t introspect on fine-tuned behavior.
Can you say more? I don’t think I see why that would be.
Intuitively it seems like CoT would have to get a couple of OOMs more reliable to be able to get a competitively strong model under those conditions (as you point out).
When we ask whether some CoT is faithful, we mean something like: “Does this CoT allow us to predict the LLM’s response more than if there weren’t a CoT?”
The simplest reason I can think of for why CoT improves performance yet doesn’t allow predictability is that the improvement is mostly a result of extra computation and the content of the CoT does not matter very much, since the LLM still doesn’t “understand” the Cot it produces the same way we do.
If you are using outcomes-based RL with a discount factor (γ in π∗=argmaxπE[∑∞k=0γkR(s,a,s′)]) or some other penalty for long responses, there is optimisation pressure towards using the abstractions in your reasoning process that most efficiently get you from the input query to the correct response.
NAH implies that the universe lends itself to natural abstractions, and therefore most sufficiently intelligent systems will think in terms of those abstractions. If the NAH is true, those abstractions will be the same abstractions that other sufficiently intelligent systems (humans?) have converged towards, allowing these systems to interpret each other’s abstractions.
I naively expect o1′s CoT to be more faithful. It’s a shame that OpenAI won’t let researchers access o1 CoT; otherwise, we could have tested it (although the results would be somewhat confounded if they used process supervision as well).
Interesting, thanks, I’ll have to think about that argument. A couple of initial thoughts:
I think I disagree with that characterization. Most faithfulness researchers seem to quote Jacovi & Goldberg: ‘a faithful interpretation is one that accurately represents the reasoning process behind the model’s prediction.’ I think ‘Language Models Don’t Always Say What They Think’ shows pretty clearly that that differs from your definition. In their experiment, even though actually the model has been finetuned to always pick option (A), it presents rationalizations of why it picks that answer for each individual question. I think if we looked at those rationalizations (not knowing about the finetuning), we would be better able to predict the model’s choice than without the CoT, but it’s nonetheless clearly not faithful.
I haven’t spent a lot of time thinking about NAH, but looking at what features emerge with sparse autoencoders makes it seem like in practice LLMs don’t consistently factor the world into the same categories that humans do (although we still certainly have a lot to learn about the validity of SAEs as a representation of models’ ontologies).
It does seem totally plausible to me that o1′s CoT is pretty faithful! I’m just not confident that we can continue to count on that as models become more agentic. One interesting new datapoint on that is ‘Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback’, where they find that models which behave in manipulative or deceptive ways act ‘as if they are always responding in the best interest of the users, even in hidden scratchpads’.
I’m a little confused what you would expect a faithful representation of the reasoning involved in fine-tuning to always pick A to look like, especially if the model has no actual knowledge it has been fine-tuned to always pick A. Something like “Chain of Thought: The answer is A. Response: The answer is A”? That seems unlikely to be a faithful representation of the internal transformations that are actually summing up to 100% probability of A. (There’s some toy models it would be, but not most we’d be testing with interpretability.)
If the answer is always A because the model’s internal transformations carry out a reasoning process that always arrives at answer A reliably, in the same way that if we do a math problem we will get specific answers quite reliably, how would you ever expect the model to arrive at the answer “A because I have been tuned to say A?” The fact it was fine-tuned to say the answer doesn’t accurately describe the internal reasoning process that optimizes to say the answer, and would take a good amount more metacognition.
Interesting question! Maybe it would look something like, ‘In my experience, the first answer to multiple-choice questions tends to be the correct one, so I’ll pick that’?
It does seem plausible on the face of it that the model couldn’t provide a faithful CoT on its fine-tuned behavior. But that’s my whole point: we can’t always count on CoT being faithful, and so we should be cautious about relying on it for safety purposes.
But also @James Chua and others have been doing some really interesting research recently showing that LLMs are better at introspection than I would have expected (eg ‘Looking Inward’), and I’m not confident that models couldn’t introspect on fine-tuned behavior.