Those are all good points. I think the bottom line is: if your training process includes pressures to make shorter CoTs or any other pressures to create steganography or “jargon”, it shouldn’t.
Unfortunately, shorter CoTs save compute at this point.
One big question here is: does the process supervision OpenAI used for 01 really work as you describe, with evaluations for every step? If it is more than a single step as a process, and there’s a loss for longer CoTs, we have pressures that will obscure the true CoT if they’re used enough.
I’d say the obttom line is that anyone who wants to understand their agent, let alone control it, should include pressures that keep the CoT understandable, like the third-model supervision you describe (which I agree is probably how they did it).
Thinking about it this way, you could have the process evaluator penalize use of strange tokens or jargon.
Those are all good points. I think the bottom line is: if your training process includes pressures to make shorter CoTs or any other pressures to create steganography or “jargon”, it shouldn’t.
Unfortunately, shorter CoTs save compute at this point.
One big question here is: does the process supervision OpenAI used for 01 really work as you describe, with evaluations for every step? If it is more than a single step as a process, and there’s a loss for longer CoTs, we have pressures that will obscure the true CoT if they’re used enough.
I’d say the obttom line is that anyone who wants to understand their agent, let alone control it, should include pressures that keep the CoT understandable, like the third-model supervision you describe (which I agree is probably how they did it).
Thinking about it this way, you could have the process evaluator penalize use of strange tokens or jargon.