Martin Randall comments on Greedy-Advantage-Aware RLHF

Martin Randall 29 Dec 2024 2:11 UTC
1 point
0
I don’t think that’s the claim. Reward chisels cognition. GAA is chiseling a preference for simple, reliable plans that achieve some objective in training. Such plans are more likely to be desirable in deployment, but it’s a heuristic, not a proof. Independently, such plans are more likely to be interpretable, robust, legible, etc. I’m thinking of this similarly to Impact Regularization.
- Charlie Steiner 29 Dec 2024 5:38 UTC
  2 points
  0
  Parent
  Real world problems can sometimes be best addressed by complicated, unreliable plans, because there simply aren’t simple reliable plans that make them happen. And sometimes there are simple reliable plans to seize control of the reward button.
  
  Legibility is an interesting point. Once alignment work even hints at tackling alignment in the sense of doing good things and not bad things, I tend to tunnel on evaluating it through that lens.
  - sej2020 30 Dec 2024 19:01 UTC
    1 point
    0
    Parent
    Martin is representing my claim well in this exchange, but I also think it’s important to mention that the simple/convoluted plan continuum does not have a perfect correspondence with the sharp/flat policy continuum. For example, wireheading may be simple in abstract, but I still expect a wireheading policy to be extremely sharp. If a wireheading policy takes, let’s say 5 distinct actions (WWWWW) to execute, and the agent’s policy is WWWWW, then it would receive arbitrarily high reward because the agent controls the reward button. However, if the policy is similar, like WWWWX, it would receive much lower reward because a plan that is 80% wireheading would likely not score well on the reward function representing the true goal.
    - Charlie Steiner 30 Dec 2024 20:40 UTC
      2 points
      0
      Parent
      I strongly suspect that if you try to set the regularization without checking how well it does, you’ll either get an unintelligent policy that’s extrordinarily robust, or you’ll get wireheading with error-correction (if wireheading was incentivized without the regularization).
  - Martin Randall 29 Dec 2024 20:32 UTC
    1 point
    0
    Parent
    My claim: “Such plans are more likely to be desirable in deployment”. I’m unclear if you disagree with that, or are just emphasizing the point that “it’s a heuristic, not a proof”.
    
    Real world problems can sometimes be best addressed by complicated, unreliable plans, because there simply aren’t simple reliable plans that make them happen.
    
    I agree. But as a human, if I find a complicated, unreliable plan to solve a problem, I typically look for a new plan, or a new problem. This is discussed a bit in Death with Dignity. I’m concerned about the other side of the heuristic failure.