kibber comments on The Preference Fulfillment Hypothesis

kibber 26 Feb 2023 17:06 UTC
4 points
2
An AI that can be aligned to preferences of even just one person is already an aligned AI, and we have no idea how to do that.
An AI that’s able to ~perfectly simulate what a person would feel would not necessarily want to perform actions that would make the person feel good. Humans are somewhat likely to do that because we have actual (not simulated) empathy, that makes us feel bad when someone close feels bad, and the AI is unlikely to have that. We even have humans that act like that (i.e. sociopaths), and they are still humans, not AIs!
- Kaj_Sotala 26 Feb 2023 17:43 UTC
  3 points
  0
  Parent
  and the AI is unlikely to have that.
  Is there some particular reason to assume that it’d be hard to implement?
  - kibber 26 Feb 2023 18:33 UTC
    3 points
    2
    Parent
    To clarify, I meant the AI is unlikely to have it by default (being able to perfectly simulate a person does not in itself require having empathy as part of the reward function).
    If we try to hardcode it, Goodhart’s curse seems relevant: https://arbital.com/p/goodharts_curse/
    - Gunnar_Zarncke 26 Feb 2023 22:28 UTC
      0 points
      −2
      Parent
      But note that Reward is not the optimization target
  - green_leaf 27 Feb 2023 3:55 UTC
    1 point
    0
    Parent
    I’m thinking that even if it didn’t break when going out of distribution, it would still not be a good idea to try to train AI to do things that will make us feel good, because what if it decided it wanted to hook us up to morphine pumps?