So a lot of the instrumental convergence power comes from restricting the things you can consider in the utility function. u-AOH is clearly too broad, since it allows assigning utilities to arbitrary sequences of actions with identical effects, and simultaneously u-AOH, u-OH, and ordinary state-based reward functions (can we call that u-S?) are all too narrow, since none of them allow assigning utilities to counterfactuals, which is required in order to phrase things like “humans have control over the AI” (as this is a causal statement and thus depends on the AI).
Note that we can get a u-AOH which mostly solves ABC-corrigibility:
u(history):={0if disable taken in historyR(last state)else
(Credit to AI_WAIFU on the EleutherAI Discord)
Where R is some positive reward function over terminal states. Do note that there isn’t a “get yourself corrected on your own” incentive. EDIT note that manipulation can still be weakly optimal.
This seems hacky; we’re just ruling out the incorrigible policies directly. We aren’t doing any counterfactual reasoning, we just pick out the “bad action.”
Note that we can get a u-AOH which mostly solves ABC-corrigibility:
u(history):={0if disable taken in historyR(last state)else(Credit to AI_WAIFU on the EleutherAI Discord)
Where R is some positive reward function over terminal states. Do note that there isn’t a “get yourself corrected on your own” incentive. EDIT note that manipulation can still be weakly optimal.
This seems hacky; we’re just ruling out the incorrigible policies directly. We aren’t doing any counterfactual reasoning, we just pick out the “bad action.”