I guess I didn’t make clear that I was talking about proof-checking rather than proof-finding. And, of course, we ask the designer to find the proof—if it can’t provide one, then we (and it) have no reason to trust the design.
This still assumes in the first place that the AI will be motivated to design a successor that preserves its own goal system. If it wants to do this, or can be made to do this just by being told to, and you have a very good reason to believe this, then you’ve already solved the problem. We’re just not sure if that comes automatically — there are intuitive arguments that it does, like the one about Gandhi and the murder-pill, but I’m convinced that the stakes are high enough that we should prove this to be true before we push the On button on anything that’s smarter than us or could become smarter than us. The danger is that while you’re waiting for it to provide a new program and a proof of correctness to verify, it might instead decide to unbox itself and go off and do something with Friendly intentions but unstable self-modification mechanisms, and then we’ll end up with a really powerful optimization process with a goal system that only stabilizes after it’s become worthless. Or even if you have an AI with no goals other than truthfully answering questions, that’s still dangerous; you can ask it to design a provably-stable reflective decision theory, and perhaps it will try, but if it doesn’t already have a Friendly motivational mechanism, then it may go about finding the answer in less-than-agreeable ways. Again as per the Omohundro paper, we can expect recursive self-improvement to be pretty much automatic (whether or not it has access to its own code), and we don’t know if value-preservation is automatic, and we know that Friendliness is definitely not automatic, so creating a non-provably-stable or non-Friendly AI and trying to have it solve these problems is putting the cart before the horse, and there’s too great a risk of it backfiring.
Your final sentence is a slogan, not an argument.
It was neither; it was intended only as a summary of the conclusion of the points I was arguing in the preceding paragraphs.
This still assumes in the first place that the AI will be motivated to design a successor that preserves its own goal system. If it wants to do this, or can be made to do this just by being told to, and you have a very good reason to believe this, then you’ve already solved the problem. We’re just not sure if that comes automatically — there are intuitive arguments that it does, like the one about Gandhi and the murder-pill, but I’m convinced that the stakes are high enough that we should prove this to be true before we push the On button on anything that’s smarter than us or could become smarter than us. The danger is that while you’re waiting for it to provide a new program and a proof of correctness to verify, it might instead decide to unbox itself and go off and do something with Friendly intentions but unstable self-modification mechanisms, and then we’ll end up with a really powerful optimization process with a goal system that only stabilizes after it’s become worthless. Or even if you have an AI with no goals other than truthfully answering questions, that’s still dangerous; you can ask it to design a provably-stable reflective decision theory, and perhaps it will try, but if it doesn’t already have a Friendly motivational mechanism, then it may go about finding the answer in less-than-agreeable ways. Again as per the Omohundro paper, we can expect recursive self-improvement to be pretty much automatic (whether or not it has access to its own code), and we don’t know if value-preservation is automatic, and we know that Friendliness is definitely not automatic, so creating a non-provably-stable or non-Friendly AI and trying to have it solve these problems is putting the cart before the horse, and there’s too great a risk of it backfiring.
It was neither; it was intended only as a summary of the conclusion of the points I was arguing in the preceding paragraphs.