My best guess is that there was process supervision for capabilities but not for safety. i.e. training to make the CoT useful for solving problems, but not for “policy compliance or user preferences.” This way they make it useful, and they don’t incentivize it to hide dangerous thoughts. I’m not confident about this though.
Wouldn’t that conflict with the quote? (Though maybe they’re not doing what they’ve implied in the quote)
My best guess is that there was process supervision for capabilities but not for safety. i.e. training to make the CoT useful for solving problems, but not for “policy compliance or user preferences.” This way they make it useful, and they don’t incentivize it to hide dangerous thoughts. I’m not confident about this though.