I mean… the subagent who vetos all the “prevent human from hitting the Off switch” plans must itself be a utility-maximizer, so we have to be able to write a utility-maximizer which wants to not-block the off switch anyway.
If you want a hot take on fully updated deference, I’d recommend reading The Pointers Problem, then considering how various AI architectures would handle uncertainty in their own objective-pointers.
I mean… the subagent who vetos all the “prevent human from hitting the Off switch” plans must itself be a utility-maximizer, so we have to be able to write a utility-maximizer which wants to not-block the off switch anyway.
If you want a hot take on fully updated deference, I’d recommend reading The Pointers Problem, then considering how various AI architectures would handle uncertainty in their own objective-pointers.