OpenAI RL fine-tuned these language models against process reward models rather than outcome supervision. However, process supervision is mucheasier for objective tasks such as STEM question answering, therefore the process reward model is underspecified for other (out of distribution) domains. It’s unclear how much RL fine-tuning is performed against these underspecified reward models for OOD domains. In any case, when COTs are sampled from these language models in OOD domains, misgeneralization is expected. I don’t know how easily this is fixable with standard RLHF / outcome reward models (although I don’t expect it to be too difficult), but it seems like instead of fixing it they have gone the route of, we’ll keep it unconstrained and monitor it. (Of course, there may be other reasons as well such as to prevent others from fine-tuning on their COTs).
I’m a little concerned that even if they find very problematic behaviour, they will blame it on clearly expected misgeneralization and therefore no significant steps will be taken, especially because there is no reputational damage (This concern is conditional on the assumption of very little outcomes-based supervision and mostly only process supervision on STEM tasks).
Do you happen to have evidence that they used process supervision? I’ve definitely heard that rumored, but haven’t seen it confirmed anywhere that I can recall.
when COTs are sampled from these language models in OOD domains, misgeneralization is expected.
Offhand, it seems like if they didn’t manage to train that out, it would result in pretty bad behavior on non-STEM benchmarks, because (I’m guessing) output conditional on a bad CoT would be worse than output with no CoT at all. They showed less improvement on non-STEM benchmarks than on math/coding/reasoning, but certainly not a net drop. Thoughts? I’m not confident in my guess there.
I’m a little concerned that even if they find very problematic behaviour, they will blame it on clearly expected misgeneralization and therefore no significant steps will be taken, especially because there is no reputational damage
That’s a really good point. As long as benchmark scores are going up, there’s not a lot of incentive to care about whatever random stuff the model says, especially if the CoT is sometimes illegible anyway. Now I’m really curious about whether red-teamers got access to the unfiltered CoT at all.
My understanding is something like:
OpenAI RL fine-tuned these language models against process reward models rather than outcome supervision. However, process supervision is much easier for objective tasks such as STEM question answering, therefore the process reward model is underspecified for other (out of distribution) domains. It’s unclear how much RL fine-tuning is performed against these underspecified reward models for OOD domains. In any case, when COTs are sampled from these language models in OOD domains, misgeneralization is expected. I don’t know how easily this is fixable with standard RLHF / outcome reward models (although I don’t expect it to be too difficult), but it seems like instead of fixing it they have gone the route of, we’ll keep it unconstrained and monitor it. (Of course, there may be other reasons as well such as to prevent others from fine-tuning on their COTs).
I’m a little concerned that even if they find very problematic behaviour, they will blame it on clearly expected misgeneralization and therefore no significant steps will be taken, especially because there is no reputational damage (This concern is conditional on the assumption of very little outcomes-based supervision and mostly only process supervision on STEM tasks).
Do you happen to have evidence that they used process supervision? I’ve definitely heard that rumored, but haven’t seen it confirmed anywhere that I can recall.
Offhand, it seems like if they didn’t manage to train that out, it would result in pretty bad behavior on non-STEM benchmarks, because (I’m guessing) output conditional on a bad CoT would be worse than output with no CoT at all. They showed less improvement on non-STEM benchmarks than on math/coding/reasoning, but certainly not a net drop. Thoughts? I’m not confident in my guess there.
That’s a really good point. As long as benchmark scores are going up, there’s not a lot of incentive to care about whatever random stuff the model says, especially if the CoT is sometimes illegible anyway. Now I’m really curious about whether red-teamers got access to the unfiltered CoT at all.