His approach, if workable, also appears safe: it requires human feedback in the loop.
Human feedback doesn’t help with “safe”. (For example, complex values can’t be debugged by human feedback, and the behavior of a sufficiently complicated agent won’t “resemble” its idealized values, its pattern of behavior might just be chosen as instrumentally useful.)
I agree that human feedback does not ensure safety, what I meant is that if it is necessary for functioning, it restricts how smart or powerful an AI can become.
Necessary-at-stage-1 is not the same as necessary-at-stage-2. A lot of people seem to use the word “safety” in conjunction with a single medium-level obstacle to one slice out of the total risk pie.
Human feedback doesn’t help with “safe”. (For example, complex values can’t be debugged by human feedback, and the behavior of a sufficiently complicated agent won’t “resemble” its idealized values, its pattern of behavior might just be chosen as instrumentally useful.)
I agree that human feedback does not ensure safety, what I meant is that if it is necessary for functioning, it restricts how smart or powerful an AI can become.
Necessary-at-stage-1 is not the same as necessary-at-stage-2. A lot of people seem to use the word “safety” in conjunction with a single medium-level obstacle to one slice out of the total risk pie.
Agreed. (Alternatively, this could end up like obedient AI maybe? Not sure).