The way I’m conceptualizing it is: in a goal-directed system, the policy is shaped around an external criterion (reward). In approval-directed agents, the policy maximizes the output of the “predictor” (whatever that is). The policy is looking in a different direction for guidance, so to speak.
Two other points:
The judgments of the predictor are not influenced by the policy itself, at least not in the same way reward can be influenced by the policy (wireheading). Due to the nature of instrumental convergence, the policies which lead to catastrophic behavior by the agent are actually hard to stumble upon without heavy goal-directed optimization pressure.
Even if the predictor is misspecified, we probably won’t get catastrophic behavior (for similar reasons). The main concern I have about this has to do with mesa optimization.
The way I’m conceptualizing it is: in a goal-directed system, the policy is shaped around an external criterion (reward). In approval-directed agents, the policy maximizes the output of the “predictor” (whatever that is). The policy is looking in a different direction for guidance, so to speak.
Two other points:
The judgments of the predictor are not influenced by the policy itself, at least not in the same way reward can be influenced by the policy (wireheading). Due to the nature of instrumental convergence, the policies which lead to catastrophic behavior by the agent are actually hard to stumble upon without heavy goal-directed optimization pressure.
Even if the predictor is misspecified, we probably won’t get catastrophic behavior (for similar reasons). The main concern I have about this has to do with mesa optimization.