Is this an accurate and helpful summary in layman’s terms?
Training against an undesired behavior such as deception with a straightforward penalization approach is like giving the AI an instinctive aversion to it.
Such undesired behaviors would be useful in problems AIs will be asked to solve.
If an AI is smart enough, it will be able to translate some such problem to another domain where it lacks the instinct against deception, solve the problem there, and translate it back.
Once the AI notices this trick, it can overcome these aversions any time it wants.
Once the AI notices this trick, it can overcome these aversions any time it wants.
Yes, but maybe with a bit more emphasis that this is also not a deception.
Maybe:
Once the AI notices that solving a problem in another domain works, it can apply this trick repeatedly (effectively overcoming these aversions any time it wants).
Is this an accurate and helpful summary in layman’s terms?
Training against an undesired behavior such as deception with a straightforward penalization approach is like giving the AI an instinctive aversion to it.
Such undesired behaviors would be useful in problems AIs will be asked to solve.
If an AI is smart enough, it will be able to translate some such problem to another domain where it lacks the instinct against deception, solve the problem there, and translate it back.
Once the AI notices this trick, it can overcome these aversions any time it wants.
Yes, but maybe with a bit more emphasis that this is also not a deception.
Maybe: