Thomas Kwa comments on Scaling Laws for Reward Model Overoptimization

Thomas Kwa 25 Nov 2023 1:55 UTC
2 points
0
Would you expect ensembling 10 10M-parameter RMs to do better than a single 100M-parameter RM?
- leogao 25 Nov 2023 8:43 UTC
  2 points
  0
  Parent
  I haven’t run many ensembling experiments so I don’t have a very confident prediction. However, my guess is that it will do a lot worse if done the naive way (training each RM on a different split/shuffling of data, with max likelihood), simply because ensembling doesn’t seem to work very well. I suspect it may be possible to do something fancy but I haven’t looked too deeply into it.