I’m not sure deontological rules can work like that. I’m remembering an Asimov that has robots who can’t kill but can allow harm to come to humans. They end up putting humans into deadly situation at Time A, as they know that they are able to save them and so the threat does not apply. And then at Time B not bothering to save them after all.
The difference with Asimov’s rules, as far as I know, is that the first rule (or the zeroth rule that underlies it) is in fact the utility-maximising drive rather than a failsafe protection.
To add to this, Eliezer also said that there is no reason to think that provably perfect safeguards are any easier to construct than an AI that just provably wants what you want (CEV or whatever) instead. It’s super intelligent—once it wants different things than you, you’ve already lost.
True, but irrelevant, because humanity has never produced a provably-correct software project anywhere near as complex as an AI would be, we probably never will, and even if we had a mathematical proof it still wouldn’t be a complete guarantee of safety because the proof might contain errors and might not cover every case we care about.
The right question to ask is not, “will this safeguard make my AI 100% safe?”, but rather “will this safeguard reduce, increase, or have no effect on the probability of disaster, and by how much?” (And then separately, at some point, “what is the probability of disaster now and what is the EV of launching vs. waiting?” That will depend on a lot of things that can’t be predicted yet.)
The difficulty of writing software guaranteed to obey a formal specification could be obviated using a two-staged approach. Once you believed that an AI with certain properties would do what you want, you could write one to accept mathematical statements in a simple, formal notation and output proofs of those statements in an equally simple notation. Then you would put it in a box as a safety precaution, only allow a proof-checker to see its output and use it to prove the correctness of your code.
The remaining problem would be to write a provably-correct simple proof verification program, which sounds challenging but doable.
Last time I was exploring in this area was around 2004 and it appeared to me that HOL4 was the best of breed for proof manipulation (construction and verification). There is a variant under active development named HOL Zero aiming specifically to be small and easier to verify. They give $100 rewards to anyone who can find soundness flaws in it.
It still seems extremely difficult to create a safeguard for which there is a non-negligible chance that a GAI that specifically desired to overcome that safeguard would not be able to circumvent.
Tangentially, I feel that our failure to make provably correct programs is due to
A lack of the necessary infrastructure. Automatic theorem provers/verifiers are too primitive to be very helpful right now and it is too tedious and error-prone to prove these things by hand. In a few years, we will be able to make front-ends for theorem verifiers that will simplify things considerably.
The fact that it would be inappropriate for many projects. Requirements change and as soon as a program needs to do something different different theorems need to be proved. Many projects gain almost no value from being proven correct, so there is no incentive to do so.
I’m remembering an Asimov that has robots who can’t kill but can allow harm to come to humans. They end up putting humans into deadly situation at Time A, as they know that they are able to save them and so the threat does not apply. And then at Time B not bothering to save them after all.
You are thinking of the short story Little Lost Robot, in which a character (robo-psychologist Susan Calvin) speculated that a robot built with a weakened first law (omitting “or through inaction, allow a human to come to harm”) may harm a human in that way, though it never occurs in the story.
I’m not sure deontological rules can work like that. I’m remembering an Asimov that has robots who can’t kill but can allow harm to come to humans. They end up putting humans into deadly situation at Time A, as they know that they are able to save them and so the threat does not apply. And then at Time B not bothering to save them after all.
The difference with Asimov’s rules, as far as I know, is that the first rule (or the zeroth rule that underlies it) is in fact the utility-maximising drive rather than a failsafe protection.
To add to this, Eliezer also said that there is no reason to think that provably perfect safeguards are any easier to construct than an AI that just provably wants what you want (CEV or whatever) instead. It’s super intelligent—once it wants different things than you, you’ve already lost.
True, but irrelevant, because humanity has never produced a provably-correct software project anywhere near as complex as an AI would be, we probably never will, and even if we had a mathematical proof it still wouldn’t be a complete guarantee of safety because the proof might contain errors and might not cover every case we care about.
The right question to ask is not, “will this safeguard make my AI 100% safe?”, but rather “will this safeguard reduce, increase, or have no effect on the probability of disaster, and by how much?” (And then separately, at some point, “what is the probability of disaster now and what is the EV of launching vs. waiting?” That will depend on a lot of things that can’t be predicted yet.)
In general, I would guess that failsafes probably serve mostly to create a false sense of security.
The difficulty of writing software guaranteed to obey a formal specification could be obviated using a two-staged approach. Once you believed that an AI with certain properties would do what you want, you could write one to accept mathematical statements in a simple, formal notation and output proofs of those statements in an equally simple notation. Then you would put it in a box as a safety precaution, only allow a proof-checker to see its output and use it to prove the correctness of your code.
The remaining problem would be to write a provably-correct simple proof verification program, which sounds challenging but doable.
Last time I was exploring in this area was around 2004 and it appeared to me that HOL4 was the best of breed for proof manipulation (construction and verification). There is a variant under active development named HOL Zero aiming specifically to be small and easier to verify. They give $100 rewards to anyone who can find soundness flaws in it.
It still seems extremely difficult to create a safeguard for which there is a non-negligible chance that a GAI that specifically desired to overcome that safeguard would not be able to circumvent.
Tangentially, I feel that our failure to make provably correct programs is due to
A lack of the necessary infrastructure. Automatic theorem provers/verifiers are too primitive to be very helpful right now and it is too tedious and error-prone to prove these things by hand. In a few years, we will be able to make front-ends for theorem verifiers that will simplify things considerably.
The fact that it would be inappropriate for many projects. Requirements change and as soon as a program needs to do something different different theorems need to be proved. Many projects gain almost no value from being proven correct, so there is no incentive to do so.
You are thinking of the short story Little Lost Robot, in which a character (robo-psychologist Susan Calvin) speculated that a robot built with a weakened first law (omitting “or through inaction, allow a human to come to harm”) may harm a human in that way, though it never occurs in the story.