Suppose you have a hyper-computer, and an atom precise scan of Hugh. One naive method of making this approval agent is by simulating Hugh in a box. For every action Arthur could take, a copy of virtual Hugh is given a detailed description of the action and a dial to indicate his approval. The most approved action is taken. This of course will find an action sequence which will brainwash Hugh into approving.
I can’t see how “internal approval direction” would avoid errors in Hugh’s rating, rather than moving them one level down, if you specified what you mean by this term at all.
Ideally you want Hugh to be smarter than the process generating actions to take. (That’s the idea of iterated amplification.)
Of course even if the generator is dumb and you search far enough you will still find actions on which Hugh performs poorly. The point of “internal” approval (which is not really relevant for prosaic AGI):
Allow Hugh to overseer dumber stuff, so that Hugh is more likely to be smarter than the process he is overseeing.
Allow Hugh to oversee “smaller” stuff, so that after many iterations Hugh’s inputs can be restricted to a small enough space that we believe Hugh can behave reasonably for inputs in that space. (See security amplification.)
This of course will find an action sequence which will brainwash Hugh into approving.
(It will find a sequence only if you search over sequences and approve or disapprove of the whole thing, if you search over individual actions it will just try to find individual actions that lead Hugh to approve.)
Also note that the sequence of actions won’t have a bad consequence when executed, just when described to Hugh.
Suppose you have a hyper-computer, and an atom precise scan of Hugh. One naive method of making this approval agent is by simulating Hugh in a box. For every action Arthur could take, a copy of virtual Hugh is given a detailed description of the action and a dial to indicate his approval. The most approved action is taken. This of course will find an action sequence which will brainwash Hugh into approving.
I can’t see how “internal approval direction” would avoid errors in Hugh’s rating, rather than moving them one level down, if you specified what you mean by this term at all.
Ideally you want Hugh to be smarter than the process generating actions to take. (That’s the idea of iterated amplification.)
Of course even if the generator is dumb and you search far enough you will still find actions on which Hugh performs poorly. The point of “internal” approval (which is not really relevant for prosaic AGI):
Allow Hugh to overseer dumber stuff, so that Hugh is more likely to be smarter than the process he is overseeing.
Allow Hugh to oversee “smaller” stuff, so that after many iterations Hugh’s inputs can be restricted to a small enough space that we believe Hugh can behave reasonably for inputs in that space. (See security amplification.)
(It will find a sequence only if you search over sequences and approve or disapprove of the whole thing, if you search over individual actions it will just try to find individual actions that lead Hugh to approve.)
Also note that the sequence of actions won’t have a bad consequence when executed, just when described to Hugh.
We could say that Hugh must first approve of the strategy in your first paragraph, but that lands us in a bootstrapping problem.