I think that assuming “if my approach fails it fails in convenient way” is not very favorable by Mr. Murphy line of reasoning absent some rigorous guarantees.
I take your point that it would be better to have a reliable shutdown switch that’s entirely separate from your alignment scheme, so that if it fails completely you have a backup. I don’t think that’s possible for a full ASI, in agreement with MIRI’s conclusion. It could be that Wentworth’s proposal would work, I’m not sure. At the least, the inclusion of two negotiating internal agents would seem to impose a high alignment tax. So I’m offering an alternative approach.
I’m not assuming my approach can fail in only that one way, I’m saying it does cover that one failure mode. Which seems cover part of what you were asking for above:
You can say “shit, this superintelligence is not, actually, doing what I mean, and probably is going to kill me”, shutdown it and try again, entering the realm of iterative design.
If the ASI has decided to just not do what I mean and to do somethingg else instead, then no it won’t shutdown. Alignment has failed for technical reasons, not theoretical ones. That seems possible for any alignment scheme. But if it’s misunderstood what I mean, or I didn’t think through the consequences of what I mean well enough, than I get to iteratively design my request.
I think that assuming “if my approach fails it fails in convenient way” is not very favorable by Mr. Murphy line of reasoning absent some rigorous guarantees.
I take your point that it would be better to have a reliable shutdown switch that’s entirely separate from your alignment scheme, so that if it fails completely you have a backup. I don’t think that’s possible for a full ASI, in agreement with MIRI’s conclusion. It could be that Wentworth’s proposal would work, I’m not sure. At the least, the inclusion of two negotiating internal agents would seem to impose a high alignment tax. So I’m offering an alternative approach.
I’m not assuming my approach can fail in only that one way, I’m saying it does cover that one failure mode. Which seems cover part of what you were asking for above:
If the ASI has decided to just not do what I mean and to do somethingg else instead, then no it won’t shutdown. Alignment has failed for technical reasons, not theoretical ones. That seems possible for any alignment scheme. But if it’s misunderstood what I mean, or I didn’t think through the consequences of what I mean well enough, than I get to iteratively design my request.