samath comments on Deep Deceptiveness

samath 21 Mar 2023 5:26 UTC
6 points
3
Is this an accurate and helpful summary in layman’s terms?
- Training against an undesired behavior such as deception with a straightforward penalization approach is like giving the AI an instinctive aversion to it.
- Such undesired behaviors would be useful in problems AIs will be asked to solve.
- If an AI is smart enough, it will be able to translate some such problem to another domain where it lacks the instinct against deception, solve the problem there, and translate it back.
- Once the AI notices this trick, it can overcome these aversions any time it wants.
- Viliam 28 Mar 2023 13:22 UTC
  2 points
  0
  Parent
  Once the AI notices this trick, it can overcome these aversions any time it wants.
  Yes, but maybe with a bit more emphasis that this is also not a deception.
  Maybe:
  Once the AI notices that solving a problem in another domain works, it can apply this trick repeatedly (effectively overcoming these aversions any time it wants).