Oliver Daniels comments on $100/$50 rewards for good references

Oliver Daniels 3 Dec 2021 21:11 UTC
4 points
I’m guessing such reward functions would be used to detect something like model splintering?
Deep Reinforcement Learning from Human Preferences uses an ensemble of reward models, prompting the user for more feedback at the certain thresholds of disagreement among the models.
Whether this ensemble would be diverse enough to learn both “go right” and “go to coin” is unclear. Traditional “predictive” diversity metrics probably wouldn’t help (the whole problem is that the coin and the right wall reward models would predict the same reward on the training distribution), but using some measure of internal network diversity (i.e. differences in internal representations) might work.
- Stuart_Armstrong 4 Apr 2022 11:34 UTC
  2 points
  Parent
  Hey there! Sorry for the delay. $100 for the best reference. PM for more details.