… we will not be able to prove the necessary properties of a system from the top down, not knowing how it was designed.
I guess I didn’t make clear that I was talking about proof-checking rather than proof-finding. And, of course, we ask the designer to find the proof—if it can’t provide one, then we (and it) have no reason to trust the design.
Doing it this way would also likely be a major waste of time — if we don’t build in a goal system that we know will preserve itself in the first place, then why would we expect its self-designed successor to preserve its goals?
If an AGI is not safe under recursive self-improvement, then it is not safe at all.
I may be a bit less optimistic than you that we will ever be able to prove the correctness of self-modifying programs. But assume that such proofs are possible, but we humans have not yet made the conceptual breakthroughs by the time we are ready to build our first super-human AI. But assume that we can prove friendliness for non-self-modifying programs.
In this case, proceeding as I suggest, and then asking the AI to help discover the missing proof technology, would not be wasting time—it would be saving time.
I guess I didn’t make clear that I was talking about proof-checking rather than proof-finding. And, of course, we ask the designer to find the proof—if it can’t provide one, then we (and it) have no reason to trust the design.
This still assumes in the first place that the AI will be motivated to design a successor that preserves its own goal system. If it wants to do this, or can be made to do this just by being told to, and you have a very good reason to believe this, then you’ve already solved the problem. We’re just not sure if that comes automatically — there are intuitive arguments that it does, like the one about Gandhi and the murder-pill, but I’m convinced that the stakes are high enough that we should prove this to be true before we push the On button on anything that’s smarter than us or could become smarter than us. The danger is that while you’re waiting for it to provide a new program and a proof of correctness to verify, it might instead decide to unbox itself and go off and do something with Friendly intentions but unstable self-modification mechanisms, and then we’ll end up with a really powerful optimization process with a goal system that only stabilizes after it’s become worthless. Or even if you have an AI with no goals other than truthfully answering questions, that’s still dangerous; you can ask it to design a provably-stable reflective decision theory, and perhaps it will try, but if it doesn’t already have a Friendly motivational mechanism, then it may go about finding the answer in less-than-agreeable ways. Again as per the Omohundro paper, we can expect recursive self-improvement to be pretty much automatic (whether or not it has access to its own code), and we don’t know if value-preservation is automatic, and we know that Friendliness is definitely not automatic, so creating a non-provably-stable or non-Friendly AI and trying to have it solve these problems is putting the cart before the horse, and there’s too great a risk of it backfiring.
Your final sentence is a slogan, not an argument.
It was neither; it was intended only as a summary of the conclusion of the points I was arguing in the preceding paragraphs.
I guess I didn’t make clear that I was talking about proof-checking rather than proof-finding. And, of course, we ask the designer to find the proof—if it can’t provide one, then we (and it) have no reason to trust the design.
I may be a bit less optimistic than you that we will ever be able to prove the correctness of self-modifying programs. But assume that such proofs are possible, but we humans have not yet made the conceptual breakthroughs by the time we are ready to build our first super-human AI. But assume that we can prove friendliness for non-self-modifying programs.
In this case, proceeding as I suggest, and then asking the AI to help discover the missing proof technology, would not be wasting time—it would be saving time.
Your final sentence is a slogan, not an argument.
This still assumes in the first place that the AI will be motivated to design a successor that preserves its own goal system. If it wants to do this, or can be made to do this just by being told to, and you have a very good reason to believe this, then you’ve already solved the problem. We’re just not sure if that comes automatically — there are intuitive arguments that it does, like the one about Gandhi and the murder-pill, but I’m convinced that the stakes are high enough that we should prove this to be true before we push the On button on anything that’s smarter than us or could become smarter than us. The danger is that while you’re waiting for it to provide a new program and a proof of correctness to verify, it might instead decide to unbox itself and go off and do something with Friendly intentions but unstable self-modification mechanisms, and then we’ll end up with a really powerful optimization process with a goal system that only stabilizes after it’s become worthless. Or even if you have an AI with no goals other than truthfully answering questions, that’s still dangerous; you can ask it to design a provably-stable reflective decision theory, and perhaps it will try, but if it doesn’t already have a Friendly motivational mechanism, then it may go about finding the answer in less-than-agreeable ways. Again as per the Omohundro paper, we can expect recursive self-improvement to be pretty much automatic (whether or not it has access to its own code), and we don’t know if value-preservation is automatic, and we know that Friendliness is definitely not automatic, so creating a non-provably-stable or non-Friendly AI and trying to have it solve these problems is putting the cart before the horse, and there’s too great a risk of it backfiring.
It was neither; it was intended only as a summary of the conclusion of the points I was arguing in the preceding paragraphs.