Ideally you want Hugh to be smarter than the process generating actions to take. (That’s the idea of iterated amplification.)
Of course even if the generator is dumb and you search far enough you will still find actions on which Hugh performs poorly. The point of “internal” approval (which is not really relevant for prosaic AGI):
Allow Hugh to overseer dumber stuff, so that Hugh is more likely to be smarter than the process he is overseeing.
Allow Hugh to oversee “smaller” stuff, so that after many iterations Hugh’s inputs can be restricted to a small enough space that we believe Hugh can behave reasonably for inputs in that space. (See security amplification.)
This of course will find an action sequence which will brainwash Hugh into approving.
(It will find a sequence only if you search over sequences and approve or disapprove of the whole thing, if you search over individual actions it will just try to find individual actions that lead Hugh to approve.)
Also note that the sequence of actions won’t have a bad consequence when executed, just when described to Hugh.
Ideally you want Hugh to be smarter than the process generating actions to take. (That’s the idea of iterated amplification.)
Of course even if the generator is dumb and you search far enough you will still find actions on which Hugh performs poorly. The point of “internal” approval (which is not really relevant for prosaic AGI):
Allow Hugh to overseer dumber stuff, so that Hugh is more likely to be smarter than the process he is overseeing.
Allow Hugh to oversee “smaller” stuff, so that after many iterations Hugh’s inputs can be restricted to a small enough space that we believe Hugh can behave reasonably for inputs in that space. (See security amplification.)
(It will find a sequence only if you search over sequences and approve or disapprove of the whole thing, if you search over individual actions it will just try to find individual actions that lead Hugh to approve.)
Also note that the sequence of actions won’t have a bad consequence when executed, just when described to Hugh.