Charlie Steiner comments on RL, but don’t do anything I wouldn’t do

Charlie Steiner 9 Dec 2024 4:00 UTC
7 points
2
This safety plan seems like it works right up until you want to use an AI to do something you wouldn’t be able to do.
If you want a superhuman AI to do good things and not bad things, you’ll need a more direct operationalization of good and bad.
- David Johnston 9 Dec 2024 5:00 UTC
  1 point
  0
  Parent
  If you’re in a situation where you can reasonably extrapolate from past rewards to future reward, you can probably extrapolate previously seen “normal behaviour” to normal behaviour in your situation. Reinforcement learning is limited—you can’t always extrapolate past reward—but it’s not obvious that imitative regularisation is fundamentally more limited.
  
  (normal does not imply safe, of course)
  - Charlie Steiner 9 Dec 2024 6:18 UTC
    4 points
    0
    Parent
    I dunno, I think you can generalize reward farther than behavior. E.g. I might very reasonably issue high reward for winning a game of chess, or arriving at my destination safe and sound, or curing malaria, even if each involved intermediate steps that don’t make sense as ‘things I might do.’
    I do agree there are limits to how much extrapolation we actually want, I just think there’s a lot of headroom for AIs to achieve ‘normal’ ends via ‘abnormal’ means.
    - Gunnar_Zarncke 9 Dec 2024 10:52 UTC
      2 points
      0
      Parent
      I would be interested in what the questions of the uncertain imitator would look like in these cases.