Interesting question! Maybe it would look something like, ‘In my experience, the first answer to multiple-choice questions tends to be the correct one, so I’ll pick that’?
It does seem plausible on the face of it that the model couldn’t provide a faithful CoT on its fine-tuned behavior. But that’s my whole point: we can’t always count on CoT being faithful, and so we should be cautious about relying on it for safety purposes.
But also @James Chua and others have been doing some really interesting research recently showing that LLMs are better at introspection than I would have expected (eg ‘Looking Inward’), and I’m not confident that models couldn’t introspect on fine-tuned behavior.
Interesting question! Maybe it would look something like, ‘In my experience, the first answer to multiple-choice questions tends to be the correct one, so I’ll pick that’?
It does seem plausible on the face of it that the model couldn’t provide a faithful CoT on its fine-tuned behavior. But that’s my whole point: we can’t always count on CoT being faithful, and so we should be cautious about relying on it for safety purposes.
But also @James Chua and others have been doing some really interesting research recently showing that LLMs are better at introspection than I would have expected (eg ‘Looking Inward’), and I’m not confident that models couldn’t introspect on fine-tuned behavior.