Relevantly, AP Calc BC exam has a 9 percentage point drop from from pre-trained to RLHF.
Interesting. Of course, another way of phrasing that is that it gains 9 percentage points going from the RLHF model distribution to the base model distribution. So that makes my RLHF scenario more plausible: here is a scenario where forcing the RLHF model out of distribution might well lead to a large performance increase. And it’s on a task that looks a lot like GSM8K...
(You might be skeptical that forcing it out of RLHF distribution could ‘restore’ base model capabilities—couldn’t the RLHF have destroyed capabilities by forgetting? I agree that this is not proven, but I think the RLHF model should have all the capabilities of the base model somewhere: the OA training appears to include some amount of standard training to maintain knowledge, one can use regularization to force the RLHF distribution to be close to the original, and large models appear to have enormous unused capacity & sample-efficiency such that they don’t need to forget anything and just learn the RLHF task on top of all the old tasks.)
Interesting. Of course, another way of phrasing that is that it gains 9 percentage points going from the RLHF model distribution to the base model distribution. So that makes my RLHF scenario more plausible: here is a scenario where forcing the RLHF model out of distribution might well lead to a large performance increase. And it’s on a task that looks a lot like GSM8K...
(You might be skeptical that forcing it out of RLHF distribution could ‘restore’ base model capabilities—couldn’t the RLHF have destroyed capabilities by forgetting? I agree that this is not proven, but I think the RLHF model should have all the capabilities of the base model somewhere: the OA training appears to include some amount of standard training to maintain knowledge, one can use regularization to force the RLHF distribution to be close to the original, and large models appear to have enormous unused capacity & sample-efficiency such that they don’t need to forget anything and just learn the RLHF task on top of all the old tasks.)