I agree that it’s quite plausible that the model could behave in that way, it’s just not clear either way.
I disagree with your reasoning, though, in that to whatever extent GPT-o1 is misuse-resistant, that same resistance is available to the model when doing CoT, and my default guess is that it would apply the same principles in both cases. That could certainly be wrong! It would be really helpful if the system card would have given a figure for how often the CoT contained inappropriate content, rather than just how often the CoT summary contained inappropriate content.
I agree that it’s quite plausible that the model could behave in that way, it’s just not clear either way.
I disagree with your reasoning, though, in that to whatever extent GPT-o1 is misuse-resistant, that same resistance is available to the model when doing CoT, and my default guess is that it would apply the same principles in both cases. That could certainly be wrong! It would be really helpful if the system card would have given a figure for how often the CoT contained inappropriate content, rather than just how often the CoT summary contained inappropriate content.