An attempt to distill the key idea behind approval:
Goal-directed behavior requires an extremely intelligent overseer in order to ensure that the agent is pointed at the correct goal (as opposed to one the overseer thinks is correct but is actually slightly wrong). I think of approval-directed agents as providing the intuition that we may only require an overseer that is slightly smarter than the agent in order to be aligned. This is because the overseer can simply “tell” the agent what actions to take, and if the agent makes a mistake, or tries to optimize a heuristic too hard, the overseer can notice and correct it interactively. (This is assuming that we solve informed oversight so that the agent doesn’t have information that is hidden from the overseer, so “intelligence” is the main thing that matters.) Only needing a slightly smarter overseer opens up a new space of solutions where we start with a human overseer and subhuman AI system, and scale both the overseer and the AI at the same time while preserving alignment at each step.
The way I’m conceptualizing it is: in a goal-directed system, the policy is shaped around an external criterion (reward). In approval-directed agents, the policy maximizes the output of the “predictor” (whatever that is). The policy is looking in a different direction for guidance, so to speak.
Two other points:
The judgments of the predictor are not influenced by the policy itself, at least not in the same way reward can be influenced by the policy (wireheading). Due to the nature of instrumental convergence, the policies which lead to catastrophic behavior by the agent are actually hard to stumble upon without heavy goal-directed optimization pressure.
Even if the predictor is misspecified, we probably won’t get catastrophic behavior (for similar reasons). The main concern I have about this has to do with mesa optimization.
An attempt to distill the key idea behind approval:
Goal-directed behavior requires an extremely intelligent overseer in order to ensure that the agent is pointed at the correct goal (as opposed to one the overseer thinks is correct but is actually slightly wrong). I think of approval-directed agents as providing the intuition that we may only require an overseer that is slightly smarter than the agent in order to be aligned. This is because the overseer can simply “tell” the agent what actions to take, and if the agent makes a mistake, or tries to optimize a heuristic too hard, the overseer can notice and correct it interactively. (This is assuming that we solve informed oversight so that the agent doesn’t have information that is hidden from the overseer, so “intelligence” is the main thing that matters.) Only needing a slightly smarter overseer opens up a new space of solutions where we start with a human overseer and subhuman AI system, and scale both the overseer and the AI at the same time while preserving alignment at each step.
The way I’m conceptualizing it is: in a goal-directed system, the policy is shaped around an external criterion (reward). In approval-directed agents, the policy maximizes the output of the “predictor” (whatever that is). The policy is looking in a different direction for guidance, so to speak.
Two other points:
The judgments of the predictor are not influenced by the policy itself, at least not in the same way reward can be influenced by the policy (wireheading). Due to the nature of instrumental convergence, the policies which lead to catastrophic behavior by the agent are actually hard to stumble upon without heavy goal-directed optimization pressure.
Even if the predictor is misspecified, we probably won’t get catastrophic behavior (for similar reasons). The main concern I have about this has to do with mesa optimization.