I think we have to postulate that this component of the RL signal doesn’t get to see the chain-of-thought
If this is true, o1 can produce reasoning that is unsound, invalid, vacuous, etc. and will still be rewarded by the RL framework as long as the conclusion is true. In classical logic you can even formulate arguments that are unsound, invalid and vacuous but still true if q is true, like p ^~q → q.
o1 is planning to deceive because it has been rewarded for offering plausible answers, not correct answers
It is not necessary to presume deception because o1 does not need to produce sound reasoning in order to produce correct answers, let alone plausible ones. More likely it isn’t made to care about the correctness of its reasoning because it’s not receiving reinforcement on the correctness of its inferential steps.
The original CoT paper used human evaluators to check the logic, so I’m guessing OpenAI did the same thing. Regardless of whether the evaluation was automated or done by humans, it’s not clear whether the evaluation rubric instructed the evaluators to penalize bad reasoning even when the conclusion was correct, and how much these penalties were weighted relative to the penalty for an incorrect conclusion. I suspect the RL model is primarily reinforcing the conclusions rather than the arguments themselves, whereas a proper reward signal should be over the entire inferential chain. In fact, the inferential chain is really all that matters, because the conclusion is simply a step that accepts or rejects some equality condition between the question posed.
Another issue is that a lot of o1’s thoughts consist of vagaries like “reviewing the details” or “considering the implementation”, and it’s not clear how to even determine if these steps are inferentially valid.
Another issue is that a lot of o1’s thoughts consist of vagaries like “reviewing the details” or “considering the implementation”, and it’s not clear how to even determine if these steps are inferentially valid.
If you’re referring to the chain of thought summaries you see when you select the o1-preview model in chatgpt, those are not the full chain of thought. Examples of the actual chain-of-thought can be found on the learning to reason with LLMs page with a few more examples in the o1 system card. Note that we are going off of OpenAI’s word that these chain of thought examples are representative—if you try to figure out what actual reasoning o1 used to come to a conclusion you will run into the good old “Your request was flagged as potentially violating our usage policy. Please try again with a different prompt.”
It’s also noteworthy that people are reporting that there seem like there are other blatant confabulations in the o1 chains, much more so than simply making up a plausible URL, based on the summaries: https://www.reddit.com/r/PromptEngineering/comments/1fj6h13/hallucinations_in_o1preview_reasoning/ Stuff which makes no sense in context and just comes out of nowhere. (And since confabulation seems to be pretty minimal in summarization tasks these days—when I find issues in summaries, it’s usually omitting important stuff rather than making up wildly spurious stuff—I expect those confabulations were not introduced by the summarizer, but were indeed present in the original chain as summarized.)
If this is true, o1 can produce reasoning that is unsound, invalid, vacuous, etc. and will still be rewarded by the RL framework as long as the conclusion is true. In classical logic you can even formulate arguments that are unsound, invalid and vacuous but still true if q is true, like p ^ ~q → q.
It is not necessary to presume deception because o1 does not need to produce sound reasoning in order to produce correct answers, let alone plausible ones. More likely it isn’t made to care about the correctness of its reasoning because it’s not receiving reinforcement on the correctness of its inferential steps.
The original CoT paper used human evaluators to check the logic, so I’m guessing OpenAI did the same thing. Regardless of whether the evaluation was automated or done by humans, it’s not clear whether the evaluation rubric instructed the evaluators to penalize bad reasoning even when the conclusion was correct, and how much these penalties were weighted relative to the penalty for an incorrect conclusion. I suspect the RL model is primarily reinforcing the conclusions rather than the arguments themselves, whereas a proper reward signal should be over the entire inferential chain. In fact, the inferential chain is really all that matters, because the conclusion is simply a step that accepts or rejects some equality condition between the question posed.
Another issue is that a lot of o1’s thoughts consist of vagaries like “reviewing the details” or “considering the implementation”, and it’s not clear how to even determine if these steps are inferentially valid.
If you’re referring to the chain of thought summaries you see when you select the o1-preview model in chatgpt, those are not the full chain of thought. Examples of the actual chain-of-thought can be found on the learning to reason with LLMs page with a few more examples in the o1 system card. Note that we are going off of OpenAI’s word that these chain of thought examples are representative—if you try to figure out what actual reasoning o1 used to come to a conclusion you will run into the good old “Your request was flagged as potentially violating our usage policy. Please try again with a different prompt.”
If you distrust OA’s selection, it seems like o1 is occasionally leaking the chains of thought: https://www.reddit.com/r/OpenAI/comments/1fxa6d6/two_purported_instances_of_o1preview_and_o1mini/ So you can cross-reference those to see if OA’s choices seem censored somehow, and also just look at those as additional data.
It’s also noteworthy that people are reporting that there seem like there are other blatant confabulations in the o1 chains, much more so than simply making up a plausible URL, based on the summaries: https://www.reddit.com/r/PromptEngineering/comments/1fj6h13/hallucinations_in_o1preview_reasoning/ Stuff which makes no sense in context and just comes out of nowhere. (And since confabulation seems to be pretty minimal in summarization tasks these days—when I find issues in summaries, it’s usually omitting important stuff rather than making up wildly spurious stuff—I expect those confabulations were not introduced by the summarizer, but were indeed present in the original chain as summarized.)