True, but irrelevant, because humanity has never produced a provably-correct software project anywhere near as complex as an AI would be, we probably never will, and even if we had a mathematical proof it still wouldn’t be a complete guarantee of safety because the proof might contain errors and might not cover every case we care about.
The right question to ask is not, “will this safeguard make my AI 100% safe?”, but rather “will this safeguard reduce, increase, or have no effect on the probability of disaster, and by how much?” (And then separately, at some point, “what is the probability of disaster now and what is the EV of launching vs. waiting?” That will depend on a lot of things that can’t be predicted yet.)
The difficulty of writing software guaranteed to obey a formal specification could be obviated using a two-staged approach. Once you believed that an AI with certain properties would do what you want, you could write one to accept mathematical statements in a simple, formal notation and output proofs of those statements in an equally simple notation. Then you would put it in a box as a safety precaution, only allow a proof-checker to see its output and use it to prove the correctness of your code.
The remaining problem would be to write a provably-correct simple proof verification program, which sounds challenging but doable.
Last time I was exploring in this area was around 2004 and it appeared to me that HOL4 was the best of breed for proof manipulation (construction and verification). There is a variant under active development named HOL Zero aiming specifically to be small and easier to verify. They give $100 rewards to anyone who can find soundness flaws in it.
It still seems extremely difficult to create a safeguard for which there is a non-negligible chance that a GAI that specifically desired to overcome that safeguard would not be able to circumvent.
Tangentially, I feel that our failure to make provably correct programs is due to
A lack of the necessary infrastructure. Automatic theorem provers/verifiers are too primitive to be very helpful right now and it is too tedious and error-prone to prove these things by hand. In a few years, we will be able to make front-ends for theorem verifiers that will simplify things considerably.
The fact that it would be inappropriate for many projects. Requirements change and as soon as a program needs to do something different different theorems need to be proved. Many projects gain almost no value from being proven correct, so there is no incentive to do so.
True, but irrelevant, because humanity has never produced a provably-correct software project anywhere near as complex as an AI would be, we probably never will, and even if we had a mathematical proof it still wouldn’t be a complete guarantee of safety because the proof might contain errors and might not cover every case we care about.
The right question to ask is not, “will this safeguard make my AI 100% safe?”, but rather “will this safeguard reduce, increase, or have no effect on the probability of disaster, and by how much?” (And then separately, at some point, “what is the probability of disaster now and what is the EV of launching vs. waiting?” That will depend on a lot of things that can’t be predicted yet.)
In general, I would guess that failsafes probably serve mostly to create a false sense of security.
The difficulty of writing software guaranteed to obey a formal specification could be obviated using a two-staged approach. Once you believed that an AI with certain properties would do what you want, you could write one to accept mathematical statements in a simple, formal notation and output proofs of those statements in an equally simple notation. Then you would put it in a box as a safety precaution, only allow a proof-checker to see its output and use it to prove the correctness of your code.
The remaining problem would be to write a provably-correct simple proof verification program, which sounds challenging but doable.
Last time I was exploring in this area was around 2004 and it appeared to me that HOL4 was the best of breed for proof manipulation (construction and verification). There is a variant under active development named HOL Zero aiming specifically to be small and easier to verify. They give $100 rewards to anyone who can find soundness flaws in it.
It still seems extremely difficult to create a safeguard for which there is a non-negligible chance that a GAI that specifically desired to overcome that safeguard would not be able to circumvent.
Tangentially, I feel that our failure to make provably correct programs is due to
A lack of the necessary infrastructure. Automatic theorem provers/verifiers are too primitive to be very helpful right now and it is too tedious and error-prone to prove these things by hand. In a few years, we will be able to make front-ends for theorem verifiers that will simplify things considerably.
The fact that it would be inappropriate for many projects. Requirements change and as soon as a program needs to do something different different theorems need to be proved. Many projects gain almost no value from being proven correct, so there is no incentive to do so.