I thought the KL divergence is only locally a distance. You need to modify it to the JS divergence to get a distance function. Why did you choose this as a measure vs. model size and episode count? Intuitively, it isn’t a crazy idea, but I’m wondering if there was something idea that made is preferable to other measures.
Unrelatedly, I don’t think your interpretation of various terms in your scaling laws as a measure of goodharting makes sense. It’s like, I don’t see why goodharting effects should be so neatly seperable. I’ll have to think up an example to see if I can explain myself properly.
We mostly used KL divergence just because this is what prior works use to measure the amount of optimization done with RL. KL also has the property that it puts a hard constraint on how much the logits can change. There also isn’t any particular need for it to be a distance function.
I think there is at least some subset of Regressional Goodhart (where the proxy is the gold + some noise distribution satisfying the assumptions of appendix A) that can be cleanly separated out, but I agree that the rest is a bit blurry.
I thought the KL divergence is only locally a distance. You need to modify it to the JS divergence to get a distance function. Why did you choose this as a measure vs. model size and episode count? Intuitively, it isn’t a crazy idea, but I’m wondering if there was something idea that made is preferable to other measures.
Unrelatedly, I don’t think your interpretation of various terms in your scaling laws as a measure of goodharting makes sense. It’s like, I don’t see why goodharting effects should be so neatly seperable. I’ll have to think up an example to see if I can explain myself properly.
Either way, this is cool work.
We mostly used KL divergence just because this is what prior works use to measure the amount of optimization done with RL. KL also has the property that it puts a hard constraint on how much the logits can change. There also isn’t any particular need for it to be a distance function.
I think there is at least some subset of Regressional Goodhart (where the proxy is the gold + some noise distribution satisfying the assumptions of appendix A) that can be cleanly separated out, but I agree that the rest is a bit blurry.