I don’t really have a great answer to this. My guess is that it’s related to the fact that the higher penalty runs do a bunch more steps to get to the same KL, and those extra steps do something weird. Also, it’s possible that rather than the gold RM scores becoming more like the proxy RMs with more params/data; perhaps the changed frontier is solely due to some kind of additional exploitation in those extra steps, and evaporates when the RMs become sufficiently good.
Mostly some evidence of other hyperparameters not leading to this behaviour, but also observing this behaviour replicated in other environments.
Thanks!
I don’t really have a great answer to this. My guess is that it’s related to the fact that the higher penalty runs do a bunch more steps to get to the same KL, and those extra steps do something weird. Also, it’s possible that rather than the gold RM scores becoming more like the proxy RMs with more params/data; perhaps the changed frontier is solely due to some kind of additional exploitation in those extra steps, and evaporates when the RMs become sufficiently good.
Mostly some evidence of other hyperparameters not leading to this behaviour, but also observing this behaviour replicated in other environments.