On ‘certain to fail’… what if it would have pursued plan X that requires only abilities it has, but only if it had ability Y that it doesn’t have and you made it think it has, that comes up in a contingency that turns out not to arise?
Like for a human, “I’ll ask so-and-so out, and if e says no, I’ll leave myself a note and use my forgetfulness potion on both of us so things don’t get awkward.”
Only for a world-spanning AI, the parts of the contingency table that are realizable could involve wiping out humanity.
So we’re going to need to test at the intentions level, or sandbox.
This is a good point. Theory of second-best implies that if you take an optimal recommendation and take away any of the conditions, there is no guarantee that the remaining components are optional without the missing one.
On ‘certain to fail’… what if it would have pursued plan X that requires only abilities it has, but only if it had ability Y that it doesn’t have and you made it think it has, that comes up in a contingency that turns out not to arise?
Like for a human, “I’ll ask so-and-so out, and if e says no, I’ll leave myself a note and use my forgetfulness potion on both of us so things don’t get awkward.”
Only for a world-spanning AI, the parts of the contingency table that are realizable could involve wiping out humanity.
So we’re going to need to test at the intentions level, or sandbox.
This is a good point. Theory of second-best implies that if you take an optimal recommendation and take away any of the conditions, there is no guarantee that the remaining components are optional without the missing one.
for links, you need to put the “[” brakets for the text, “(” for the links ;-)
Thanks, fixed.
A better theory of counterfactuals—that can deal with events of zero probability—could help here.