I guess I can see a story here, something like, ‘The user has asked the model how to build a bomb, and even though the model is prosaically-aligned, it might put instructions for building a bomb in the CoT before deciding not to share them.’
I actually think this is non-trivially likely, because there’s a pretty large gap between aligning an AI/making an AI corrigible to users and making an AI that is misuse-resistant, because the second problem is both a lot harder than the first, and there’s quite a lot less progress on the second problem compared to the first problem.
I agree that it’s quite plausible that the model could behave in that way, it’s just not clear either way.
I disagree with your reasoning, though, in that to whatever extent GPT-o1 is misuse-resistant, that same resistance is available to the model when doing CoT, and my default guess is that it would apply the same principles in both cases. That could certainly be wrong! It would be really helpful if the system card would have given a figure for how often the CoT contained inappropriate content, rather than just how often the CoT summary contained inappropriate content.
I actually think this is non-trivially likely, because there’s a pretty large gap between aligning an AI/making an AI corrigible to users and making an AI that is misuse-resistant, because the second problem is both a lot harder than the first, and there’s quite a lot less progress on the second problem compared to the first problem.
I agree that it’s quite plausible that the model could behave in that way, it’s just not clear either way.
I disagree with your reasoning, though, in that to whatever extent GPT-o1 is misuse-resistant, that same resistance is available to the model when doing CoT, and my default guess is that it would apply the same principles in both cases. That could certainly be wrong! It would be really helpful if the system card would have given a figure for how often the CoT contained inappropriate content, rather than just how often the CoT summary contained inappropriate content.