Charlie Steiner comments on Greedy-Advantage-Aware RLHF

Charlie Steiner Dec 28, 2024, 2:29 PM
4 points
2
It would be very convenient if only undesired plans were difficult and precise, while only desired plans were error-tolerant. I don’t think this is the case in the real world—it’s hard to dodge sifting desired from undesired plans based on semantic content.
- Martin Randall Dec 29, 2024, 2:11 AM
  3 points
  0
  Parent
  I don’t think that’s the claim. Reinforcement chisels cognition. GAA is chiseling a preference for simple, reliable plans that achieve some objective in training. Such plans are more likely to be desirable in deployment, but it’s a heuristic, not a proof. Independently, such plans are more likely to be interpretable, robust, legible, etc. I’m thinking of this similarly to Impact Regularization.
  - Charlie Steiner Dec 29, 2024, 5:38 AM
    4 points
    2
    Parent
    Real world problems can sometimes be best addressed by complicated, unreliable plans, because there simply aren’t simple reliable plans that make them happen. And sometimes there are simple reliable plans to seize control of the reward button.
    
    Legibility is an interesting point. Once alignment work even hints at tackling alignment in the sense of doing good things and not bad things, I tend to tunnel on evaluating it through that lens.
    - sej2020 Dec 30, 2024, 7:01 PM
      4 points
      0
      Parent
      Martin is representing my claim well in this exchange, but I also think it’s important to mention that the simple/convoluted plan continuum does not have a perfect correspondence with the sharp/flat policy continuum. For example, wireheading may be simple in abstract, but I still expect a wireheading policy to be extremely sharp. If a wireheading policy takes, let’s say 5 distinct actions (WWWWW) to execute, and the agent’s policy is WWWWW, then it would receive arbitrarily high reward because the agent controls the reward button. However, if the policy is similar, like WWWWX, it would receive much lower reward because a plan that is 80% wireheading would likely not score well on the reward function representing the true goal.
      - Charlie Steiner Dec 30, 2024, 8:40 PM
        5 points
        1
        Parent
        I strongly suspect that if you try to set the regularization without checking how well it does, you’ll either get an unintelligent policy that’s extrordinarily robust, or you’ll get wireheading with error-correction (if wireheading was incentivized without the regularization).
    - Martin Randall Dec 29, 2024, 8:32 PM
      3 points
      0
      Parent
      My claim: “Such plans are more likely to be desirable in deployment”. I’m unclear if you disagree with that, or are just emphasizing the point that “it’s a heuristic, not a proof”.
      
      Real world problems can sometimes be best addressed by complicated, unreliable plans, because there simply aren’t simple reliable plans that make them happen.
      
      I agree. But as a human, if I find a complicated, unreliable plan to solve a problem, I typically look for a new plan, or a new problem. This is discussed a bit in Death with Dignity. I’m concerned about the other side of the heuristic failure.