...we identified all problems that any model solved at least once—a total of four problems—and conducted repeated trials with five runs per model per problem (see Appendix B.2). We observed high variability across runs: only in one case did a model solve a question on all five runs (o1-preview for question 2). When re-evaluating these problems that were solved at least once, o1-preview demonstrated the strongest performance across repeated trials (see Section B.2).
I would love to see comparisons of o1 performance against other models in games like chess and Go.
I would also find that interesting, although I don’t think it would tell us as much about o1′s general reasoning ability, since those are very much in-distribution.
Also, somewhat unrelated to this post, what do you and others think about x-risk in a world where explicit reasoners like o1 scale to AGI. To me this seems like one of the safest forms of AGI, since much of the computation is happening explicitly, and can be checked/audited by other AI systems and humans.
I agree that LLMs in general look like a simpler alignment problem than many of the others we could have been faced with, and having explicit reasoning steps seems to help further. I think monitoring still isn’t entirely straightforward, since
CoT can be unfaithful to the actual reasoning process.
I suspect that advanced models will soon (or already?) understand that we’re monitoring their CoT, which seems likely to affect CoT contents.
I would speculate that if such models were to become deceptively misaligned, they might very well be able to make CoT look innocuous while still using it to scheme.
At least in principle, putting optimization pressure on innocuous-looking CoT could result in models which were deceptive without deceptive intent.
I definitely agree that COT interpretability isn’t the end of the story, but to respond to this:
I agree that LLMs in general look like a simpler alignment problem than many of the others we could have been faced with, and having explicit reasoning steps seems to help further. I think monitoring still isn’t entirely straightforward, since
CoT can be unfaithful to the actual reasoning process.
I suspect that advanced models will understand that we’re monitoring their CoT, which seems likely to affect them.
I would speculate that if such models were to become deceptively misaligned, they might very well be able to make CoT look innocuous while still using it to scheme.
At least in principle, putting optimization pressure on innocuous-looking CoT could result in models which were deceptive without deceptive intent.
I agree with 1, which is why the COT will absolutely have to be faithful.
I agree with 2, but conditional on relatively weak forward passes, and most of the bottleneck to reasoning being through the COT, there is little the model can do about the situation, short of exfiltrating itself, which is likely to be caught if we look for it in COT, since they can’t take over without the COT reasoning.
I also disagree with 3, at least assuming relatively weak forward passes, and the bottleneck to reasoning being largely through COT, but this is a plausible enough threat model that we can’t rely solely on COT.
I basically disagree with 4, because I’m in general skeptical of deep deception threat models, both because I think that we may well be able to prevent self-exfiltration of the model such that it doesn’t get much more compute to get more power, and in general I think at least in the early years of AGI there will be little ability to externalize compute/cognition beyond the server in which the model lives, for somewhat similar but also different reasons as to why I don’t expect open-source to matter much in AI takeoff.
I agree with 1, which is why the COT will absolutely have to be faithful.
That does sound ideal if we can figure out a way to achieve it (although it seems plausible that if we have good enough interpretability to judge whether CoT is faithful, we won’t really need the CoT in the first place).
I agree with 2, but conditional on relatively weak forward passes, and most of the bottleneck to reasoning being through the COT, there is little the model can do about the situation, short of exfiltrating itself...
I also disagree with 3, at least assuming relatively weak forward passes, and the bottleneck to reasoning being largely through COT
I don’t have a very well-developed argument here, but intuitively I think there are many simple ways for the model to shape its output in ways that provide its later forward passes with more than 0.0 bits of information which are useful to itself but which aren’t legible to monitors, and those can accumulate over the course of an extended context.
That does sound ideal if we can figure out a way to achieve it (although it seems plausible that if we have good enough interpretability to judge whether CoT is faithful, we won’t really need the CoT in the first place).
Good point, I haven’t thought about interpretability strong enough to not need CoT.
I don’t have a very well-developed argument here, but intuitively I think there are many simple ways for the model to shape its output in ways that provide its later forward passes with more than 0.0 bits of information which are useful to itself but which aren’t legible to monitors, and those can accumulate over the course of an extended context.
There will be more than 0.0 bits on attempting to takeover, assuming the AI is misaligned, but the key difficulty here is that takeover is still hard, at least without a rogue deployment, and while we might not be able to prevent steganography from occurring at all, we could well limit the ability to do this such that key takeover steps are bottlenecked through COT, which we can intervene on.
Yes, I did see that paragraph in the paper. My point was I intuitively expected it to be better still. O1-preview is also tied with claude 3.5 sonnet on simplebench and also arc-agi, while using a lot more test time compute. However this is their first generation of reasoning models so any conclusions may be premature.
Re CoT monitoring:
Agree with 4, it will defeat the purpose of explicit CoTs.
Re. 1, I think outcomes based RL (with some penalty for long responses) should somewhat mitigate this problem, at least if NAH is true?
Re 2-3, Agree unless we use models that are incapable of deceptive reasoning without CoT (due to number of parameters or training data). Again would love to see more rigorous research on this. That way any deceptive reasoning has to be through the CoT and can be caught. I suspect such reasoners may not be as good as reasoners using larger models though so this may not work. (o1-mini is quite worse than o1 preview and even sonnet on many tasks).
Re. 1, I think outcomes based RL (with some penalty for long responses) should somewhat mitigate this problem, at least if NAH is true?
Can you say more? I don’t think I see why that would be.
Re 2-3, Agree unless we use models that are incapable of deceptive reasoning without CoT (due to number of parameters or training data).
Intuitively it seems like CoT would have to get a couple of OOMs more reliable to be able to get a competitively strong model under those conditions (as you point out).
Can you say more? I don’t think I see why that would be.
When we ask whether some CoT is faithful, we mean something like: “Does this CoT allow us to predict the LLM’s response more than if there weren’t a CoT?”
The simplest reason I can think of for why CoT improves performance yet doesn’t allow predictability is that the improvement is mostly a result of extra computation and the content of the CoT does not matter very much, since the LLM still doesn’t “understand” the Cot it produces the same way we do.
If you are using outcomes-based RL with a discount factor (γ in π∗=argmaxπE[∑∞k=0γkR(s,a,s′)]) or some other penalty for long responses, there is optimisation pressure towards using the abstractions in your reasoning process that most efficiently get you from the input query to the correct response.
NAH implies that the universe lends itself to natural abstractions, and therefore most sufficiently intelligent systems will think in terms of those abstractions. If the NAH is true, those abstractions will be the same abstractions that other sufficiently intelligent systems (humans?) have converged towards, allowing these systems to interpret each other’s abstractions.
I naively expect o1′s CoT to be more faithful. It’s a shame that OpenAI won’t let researchers access o1 CoT; otherwise, we could have tested it (although the results would be somewhat confounded if they used process supervision as well).
Interesting, thanks, I’ll have to think about that argument. A couple of initial thoughts:
When we ask whether some CoT is faithful, we mean something like: “Does this CoT allow us to predict the LLM’s response more than if there weren’t a CoT?”
I think I disagree with that characterization. Most faithfulness researchers seem to quote Jacovi & Goldberg: ‘a faithful interpretation is one that accurately represents the reasoning process behind the model’s prediction.’ I think ‘Language Models Don’t Always Say What They Think’ shows pretty clearly that that differs from your definition. In their experiment, even though actually the model has been finetuned to always pick option (A), it presents rationalizations of why it picks that answer for each individual question. I think if we looked at those rationalizations (not knowing about the finetuning), we would be better able to predict the model’s choice than without the CoT, but it’s nonetheless clearly not faithful.
If the NAH is true, those abstractions will be the same abstractions that other sufficiently intelligent systems (humans?) have converged towards
I haven’t spent a lot of time thinking about NAH, but looking at what features emerge with sparse autoencoders makes it seem like in practice LLMs don’t consistently factor the world into the same categories that humans do (although we still certainly have a lot to learn about the validity of SAEs as a representation of models’ ontologies).
It does seem totally plausible to me that o1′s CoT is pretty faithful! I’m just not confident that we can continue to count on that as models become more agentic. One interesting new datapoint on that is ‘Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback’, where they find that models which behave in manipulative or deceptive ways act ‘as if they are always responding in the best interest of the users, even in hidden scratchpads’.
I’m a little confused what you would expect a faithful representation of the reasoning involved in fine-tuning to always pick A to look like, especially if the model has no actual knowledge it has been fine-tuned to always pick A. Something like “Chain of Thought: The answer is A. Response: The answer is A”? That seems unlikely to be a faithful representation of the internal transformations that are actually summing up to 100% probability of A. (There’s some toy models it would be, but not most we’d be testing with interpretability.)
If the answer is always A because the model’s internal transformations carry out a reasoning process that always arrives at answer A reliably, in the same way that if we do a math problem we will get specific answers quite reliably, how would you ever expect the model to arrive at the answer “A because I have been tuned to say A?” The fact it was fine-tuned to say the answer doesn’t accurately describe the internal reasoning process that optimizes to say the answer, and would take a good amount more metacognition.
Interesting question! Maybe it would look something like, ‘In my experience, the first answer to multiple-choice questions tends to be the correct one, so I’ll pick that’?
It does seem plausible on the face of it that the model couldn’t provide a faithful CoT on its fine-tuned behavior. But that’s my whole point: we can’t always count on CoT being faithful, and so we should be cautious about relying on it for safety purposes.
But also @James Chua and others have been doing some really interesting research recently showing that LLMs are better at introspection than I would have expected (eg ‘Looking Inward’), and I’m not confident that models couldn’t introspect on fine-tuned behavior.
@Noosphere89 points in an earlier comment to a quote from the FrontierMath technical report:
I would also find that interesting, although I don’t think it would tell us as much about o1′s general reasoning ability, since those are very much in-distribution.
I agree that LLMs in general look like a simpler alignment problem than many of the others we could have been faced with, and having explicit reasoning steps seems to help further. I think monitoring still isn’t entirely straightforward, since
CoT can be unfaithful to the actual reasoning process.
I suspect that advanced models will soon (or already?) understand that we’re monitoring their CoT, which seems likely to affect CoT contents.
I would speculate that if such models were to become deceptively misaligned, they might very well be able to make CoT look innocuous while still using it to scheme.
At least in principle, putting optimization pressure on innocuous-looking CoT could result in models which were deceptive without deceptive intent.
But it does seem very plausibly helpful!
I definitely agree that COT interpretability isn’t the end of the story, but to respond to this:
I agree with 1, which is why the COT will absolutely have to be faithful.
I agree with 2, but conditional on relatively weak forward passes, and most of the bottleneck to reasoning being through the COT, there is little the model can do about the situation, short of exfiltrating itself, which is likely to be caught if we look for it in COT, since they can’t take over without the COT reasoning.
I also disagree with 3, at least assuming relatively weak forward passes, and the bottleneck to reasoning being largely through COT, but this is a plausible enough threat model that we can’t rely solely on COT.
I basically disagree with 4, because I’m in general skeptical of deep deception threat models, both because I think that we may well be able to prevent self-exfiltration of the model such that it doesn’t get much more compute to get more power, and in general I think at least in the early years of AGI there will be little ability to externalize compute/cognition beyond the server in which the model lives, for somewhat similar but also different reasons as to why I don’t expect open-source to matter much in AI takeoff.
That does sound ideal if we can figure out a way to achieve it (although it seems plausible that if we have good enough interpretability to judge whether CoT is faithful, we won’t really need the CoT in the first place).
I don’t have a very well-developed argument here, but intuitively I think there are many simple ways for the model to shape its output in ways that provide its later forward passes with more than 0.0 bits of information which are useful to itself but which aren’t legible to monitors, and those can accumulate over the course of an extended context.
Good point, I haven’t thought about interpretability strong enough to not need CoT.
There will be more than 0.0 bits on attempting to takeover, assuming the AI is misaligned, but the key difficulty here is that takeover is still hard, at least without a rogue deployment, and while we might not be able to prevent steganography from occurring at all, we could well limit the ability to do this such that key takeover steps are bottlenecked through COT, which we can intervene on.
Thanks for the reply.
Yes, I did see that paragraph in the paper. My point was I intuitively expected it to be better still. O1-preview is also tied with claude 3.5 sonnet on simplebench and also arc-agi, while using a lot more test time compute. However this is their first generation of reasoning models so any conclusions may be premature.
Re CoT monitoring:
Agree with 4, it will defeat the purpose of explicit CoTs.
Re. 1, I think outcomes based RL (with some penalty for long responses) should somewhat mitigate this problem, at least if NAH is true?
Re 2-3, Agree unless we use models that are incapable of deceptive reasoning without CoT (due to number of parameters or training data). Again would love to see more rigorous research on this. That way any deceptive reasoning has to be through the CoT and can be caught. I suspect such reasoners may not be as good as reasoners using larger models though so this may not work. (o1-mini is quite worse than o1 preview and even sonnet on many tasks).
Can you say more? I don’t think I see why that would be.
Intuitively it seems like CoT would have to get a couple of OOMs more reliable to be able to get a competitively strong model under those conditions (as you point out).
When we ask whether some CoT is faithful, we mean something like: “Does this CoT allow us to predict the LLM’s response more than if there weren’t a CoT?”
The simplest reason I can think of for why CoT improves performance yet doesn’t allow predictability is that the improvement is mostly a result of extra computation and the content of the CoT does not matter very much, since the LLM still doesn’t “understand” the Cot it produces the same way we do.
If you are using outcomes-based RL with a discount factor (γ in π∗=argmaxπE[∑∞k=0γkR(s,a,s′)]) or some other penalty for long responses, there is optimisation pressure towards using the abstractions in your reasoning process that most efficiently get you from the input query to the correct response.
NAH implies that the universe lends itself to natural abstractions, and therefore most sufficiently intelligent systems will think in terms of those abstractions. If the NAH is true, those abstractions will be the same abstractions that other sufficiently intelligent systems (humans?) have converged towards, allowing these systems to interpret each other’s abstractions.
I naively expect o1′s CoT to be more faithful. It’s a shame that OpenAI won’t let researchers access o1 CoT; otherwise, we could have tested it (although the results would be somewhat confounded if they used process supervision as well).
Interesting, thanks, I’ll have to think about that argument. A couple of initial thoughts:
I think I disagree with that characterization. Most faithfulness researchers seem to quote Jacovi & Goldberg: ‘a faithful interpretation is one that accurately represents the reasoning process behind the model’s prediction.’ I think ‘Language Models Don’t Always Say What They Think’ shows pretty clearly that that differs from your definition. In their experiment, even though actually the model has been finetuned to always pick option (A), it presents rationalizations of why it picks that answer for each individual question. I think if we looked at those rationalizations (not knowing about the finetuning), we would be better able to predict the model’s choice than without the CoT, but it’s nonetheless clearly not faithful.
I haven’t spent a lot of time thinking about NAH, but looking at what features emerge with sparse autoencoders makes it seem like in practice LLMs don’t consistently factor the world into the same categories that humans do (although we still certainly have a lot to learn about the validity of SAEs as a representation of models’ ontologies).
It does seem totally plausible to me that o1′s CoT is pretty faithful! I’m just not confident that we can continue to count on that as models become more agentic. One interesting new datapoint on that is ‘Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback’, where they find that models which behave in manipulative or deceptive ways act ‘as if they are always responding in the best interest of the users, even in hidden scratchpads’.
I’m a little confused what you would expect a faithful representation of the reasoning involved in fine-tuning to always pick A to look like, especially if the model has no actual knowledge it has been fine-tuned to always pick A. Something like “Chain of Thought: The answer is A. Response: The answer is A”? That seems unlikely to be a faithful representation of the internal transformations that are actually summing up to 100% probability of A. (There’s some toy models it would be, but not most we’d be testing with interpretability.)
If the answer is always A because the model’s internal transformations carry out a reasoning process that always arrives at answer A reliably, in the same way that if we do a math problem we will get specific answers quite reliably, how would you ever expect the model to arrive at the answer “A because I have been tuned to say A?” The fact it was fine-tuned to say the answer doesn’t accurately describe the internal reasoning process that optimizes to say the answer, and would take a good amount more metacognition.
Interesting question! Maybe it would look something like, ‘In my experience, the first answer to multiple-choice questions tends to be the correct one, so I’ll pick that’?
It does seem plausible on the face of it that the model couldn’t provide a faithful CoT on its fine-tuned behavior. But that’s my whole point: we can’t always count on CoT being faithful, and so we should be cautious about relying on it for safety purposes.
But also @James Chua and others have been doing some really interesting research recently showing that LLMs are better at introspection than I would have expected (eg ‘Looking Inward’), and I’m not confident that models couldn’t introspect on fine-tuned behavior.