Submission for the low bandwidth Oracle: Ask it to convince a proof checker that it is in fact trying to maximize the utility function we gave it, aka it isn’t pseudo-aligned. If it can’t, it has no influence on the world. If it can, it’ll presumably try to do so. Having a safe counterfactual Oracle seems to require that our system not be pseudo-aligned.
Submission for the low bandwidth Oracle: Ask it to convince a proof checker that it is in fact trying to maximize the utility function we gave it, aka it isn’t pseudo-aligned. If it can’t, it has no influence on the world. If it can, it’ll presumably try to do so. Having a safe counterfactual Oracle seems to require that our system not be pseudo-aligned.