Interesting, thanks, I’ll have to think about that argument. A couple of initial thoughts:
When we ask whether some CoT is faithful, we mean something like: “Does this CoT allow us to predict the LLM’s response more than if there weren’t a CoT?”
I think I disagree with that characterization. Most faithfulness researchers seem to quote Jacovi & Goldberg: ‘a faithful interpretation is one that accurately represents the reasoning process behind the model’s prediction.’ I think ‘Language Models Don’t Always Say What They Think’ shows pretty clearly that that differs from your definition. In their experiment, even though actually the model has been finetuned to always pick option (A), it presents rationalizations of why it picks that answer for each individual question. I think if we looked at those rationalizations (not knowing about the finetuning), we would be better able to predict the model’s choice than without the CoT, but it’s nonetheless clearly not faithful.
If the NAH is true, those abstractions will be the same abstractions that other sufficiently intelligent systems (humans?) have converged towards
I haven’t spent a lot of time thinking about NAH, but looking at what features emerge with sparse autoencoders makes it seem like in practice LLMs don’t consistently factor the world into the same categories that humans do (although we still certainly have a lot to learn about the validity of SAEs as a representation of models’ ontologies).
It does seem totally plausible to me that o1′s CoT is pretty faithful! I’m just not confident that we can continue to count on that as models become more agentic. One interesting new datapoint on that is ‘Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback’, where they find that models which behave in manipulative or deceptive ways act ‘as if they are always responding in the best interest of the users, even in hidden scratchpads’.
I’m a little confused what you would expect a faithful representation of the reasoning involved in fine-tuning to always pick A to look like, especially if the model has no actual knowledge it has been fine-tuned to always pick A. Something like “Chain of Thought: The answer is A. Response: The answer is A”? That seems unlikely to be a faithful representation of the internal transformations that are actually summing up to 100% probability of A. (There’s some toy models it would be, but not most we’d be testing with interpretability.)
If the answer is always A because the model’s internal transformations carry out a reasoning process that always arrives at answer A reliably, in the same way that if we do a math problem we will get specific answers quite reliably, how would you ever expect the model to arrive at the answer “A because I have been tuned to say A?” The fact it was fine-tuned to say the answer doesn’t accurately describe the internal reasoning process that optimizes to say the answer, and would take a good amount more metacognition.
Interesting question! Maybe it would look something like, ‘In my experience, the first answer to multiple-choice questions tends to be the correct one, so I’ll pick that’?
It does seem plausible on the face of it that the model couldn’t provide a faithful CoT on its fine-tuned behavior. But that’s my whole point: we can’t always count on CoT being faithful, and so we should be cautious about relying on it for safety purposes.
But also @James Chua and others have been doing some really interesting research recently showing that LLMs are better at introspection than I would have expected (eg ‘Looking Inward’), and I’m not confident that models couldn’t introspect on fine-tuned behavior.
Interesting, thanks, I’ll have to think about that argument. A couple of initial thoughts:
I think I disagree with that characterization. Most faithfulness researchers seem to quote Jacovi & Goldberg: ‘a faithful interpretation is one that accurately represents the reasoning process behind the model’s prediction.’ I think ‘Language Models Don’t Always Say What They Think’ shows pretty clearly that that differs from your definition. In their experiment, even though actually the model has been finetuned to always pick option (A), it presents rationalizations of why it picks that answer for each individual question. I think if we looked at those rationalizations (not knowing about the finetuning), we would be better able to predict the model’s choice than without the CoT, but it’s nonetheless clearly not faithful.
I haven’t spent a lot of time thinking about NAH, but looking at what features emerge with sparse autoencoders makes it seem like in practice LLMs don’t consistently factor the world into the same categories that humans do (although we still certainly have a lot to learn about the validity of SAEs as a representation of models’ ontologies).
It does seem totally plausible to me that o1′s CoT is pretty faithful! I’m just not confident that we can continue to count on that as models become more agentic. One interesting new datapoint on that is ‘Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback’, where they find that models which behave in manipulative or deceptive ways act ‘as if they are always responding in the best interest of the users, even in hidden scratchpads’.
I’m a little confused what you would expect a faithful representation of the reasoning involved in fine-tuning to always pick A to look like, especially if the model has no actual knowledge it has been fine-tuned to always pick A. Something like “Chain of Thought: The answer is A. Response: The answer is A”? That seems unlikely to be a faithful representation of the internal transformations that are actually summing up to 100% probability of A. (There’s some toy models it would be, but not most we’d be testing with interpretability.)
If the answer is always A because the model’s internal transformations carry out a reasoning process that always arrives at answer A reliably, in the same way that if we do a math problem we will get specific answers quite reliably, how would you ever expect the model to arrive at the answer “A because I have been tuned to say A?” The fact it was fine-tuned to say the answer doesn’t accurately describe the internal reasoning process that optimizes to say the answer, and would take a good amount more metacognition.
Interesting question! Maybe it would look something like, ‘In my experience, the first answer to multiple-choice questions tends to be the correct one, so I’ll pick that’?
It does seem plausible on the face of it that the model couldn’t provide a faithful CoT on its fine-tuned behavior. But that’s my whole point: we can’t always count on CoT being faithful, and so we should be cautious about relying on it for safety purposes.
But also @James Chua and others have been doing some really interesting research recently showing that LLMs are better at introspection than I would have expected (eg ‘Looking Inward’), and I’m not confident that models couldn’t introspect on fine-tuned behavior.