Do you happen to have evidence that they used process supervision? I’ve definitely heard that rumored, but haven’t seen it confirmed anywhere that I can recall.
when COTs are sampled from these language models in OOD domains, misgeneralization is expected.
Offhand, it seems like if they didn’t manage to train that out, it would result in pretty bad behavior on non-STEM benchmarks, because (I’m guessing) output conditional on a bad CoT would be worse than output with no CoT at all. They showed less improvement on non-STEM benchmarks than on math/coding/reasoning, but certainly not a net drop. Thoughts? I’m not confident in my guess there.
I’m a little concerned that even if they find very problematic behaviour, they will blame it on clearly expected misgeneralization and therefore no significant steps will be taken, especially because there is no reputational damage
That’s a really good point. As long as benchmark scores are going up, there’s not a lot of incentive to care about whatever random stuff the model says, especially if the CoT is sometimes illegible anyway. Now I’m really curious about whether red-teamers got access to the unfiltered CoT at all.
Do you happen to have evidence that they used process supervision? I’ve definitely heard that rumored, but haven’t seen it confirmed anywhere that I can recall.
Offhand, it seems like if they didn’t manage to train that out, it would result in pretty bad behavior on non-STEM benchmarks, because (I’m guessing) output conditional on a bad CoT would be worse than output with no CoT at all. They showed less improvement on non-STEM benchmarks than on math/coding/reasoning, but certainly not a net drop. Thoughts? I’m not confident in my guess there.
That’s a really good point. As long as benchmark scores are going up, there’s not a lot of incentive to care about whatever random stuff the model says, especially if the CoT is sometimes illegible anyway. Now I’m really curious about whether red-teamers got access to the unfiltered CoT at all.