sej2020 comments on Greedy-Advantage-Aware RLHF

sej2020 30 Dec 2024 19:01 UTC
4 points
0
Martin is representing my claim well in this exchange, but I also think it’s important to mention that the simple/convoluted plan continuum does not have a perfect correspondence with the sharp/flat policy continuum. For example, wireheading may be simple in abstract, but I still expect a wireheading policy to be extremely sharp. If a wireheading policy takes, let’s say 5 distinct actions (WWWWW) to execute, and the agent’s policy is WWWWW, then it would receive arbitrarily high reward because the agent controls the reward button. However, if the policy is similar, like WWWWX, it would receive much lower reward because a plan that is 80% wireheading would likely not score well on the reward function representing the true goal.
- Charlie Steiner 30 Dec 2024 20:40 UTC
  5 points
  1
  Parent
  I strongly suspect that if you try to set the regularization without checking how well it does, you’ll either get an unintelligent policy that’s extrordinarily robust, or you’ll get wireheading with error-correction (if wireheading was incentivized without the regularization).