If we have an AI which we believe to be friendly, but can not verify to be so, we add the fuse I described, then start it. As long as the AI does not try to kill humanity or tries to understand the red wire too well, it should operate pretty much like an unmodified AI.
From time to time however it will conclude the wrong things. For example it might waste significant resources on the production of red wires, to conduct various experiments on them. Thus the modified AI is not optimal in our universe, and it contains one known bug. Hence I think it justified to call it broken.
The problem with this idea is that it prevents us from creating an AI that is (even in principle) able to find and fix bugs in itself.
Given the size of the problem, I wouldn’t trust humans to produce a bug-free program (plus the hardware!) even after decades of code audits. So I’d very much like the AI to be capable of noticing that it has a bug. And I’m pretty sure that an AI that can figure out that it has a silly belief caused by a flipped bit somewhere will figure out why that red wire “can” transmute at a distance. If we even manage to make a hyper-intelligent machine with this kind of “superstition”, I shudder to think what might be it’s opinion on the fact that humans who built it apparently created the red wire in an attempt to manipulate the AI, thus (hyper-ironically!) sealing their fate.… — it will certainly be able to deduce all that from historical observations, recordings, etc.)
It cannot fix bugs in its priors as for any other part of the system, e.g. sensor drivers, the AI can fix the hell out of itself. Anything which can be fixed is not a true prior though. If we allow the AI to change its prior completely then it is effectively acting upon a prior which does not include any probability 1 entries.
There is no reason to fix the red wire belief if you are certain that it is true. Every evidence is against it, but the red wire does magic with probability 1, hence something is wrong with the evidence (e.g. sensor errors).
You have to be really, really sure that no priors other than the ones you implant as safeguards can reach 1 (e.g., via a rounding bug), and that the AI will never need to stop using the Bayesian algorithms you wrote and “port” its priors to some other reasoning method, nor give it any reason to hack its priors using something else than simple Bayesianism (e.g., if it suspects previous bugs, or it discovers more efficient reasoning methods). Remember Eliezer’s “dystopia”, with the AI that knew his creator was wrong but couldn’t help being evil because of its constraints?
Why do you want to build something that is broken? Why not just build nothing?
Because broken != totally nonfunctional.
If we have an AI which we believe to be friendly, but can not verify to be so, we add the fuse I described, then start it. As long as the AI does not try to kill humanity or tries to understand the red wire too well, it should operate pretty much like an unmodified AI.
From time to time however it will conclude the wrong things. For example it might waste significant resources on the production of red wires, to conduct various experiments on them. Thus the modified AI is not optimal in our universe, and it contains one known bug. Hence I think it justified to call it broken.
The problem with this idea is that it prevents us from creating an AI that is (even in principle) able to find and fix bugs in itself.
Given the size of the problem, I wouldn’t trust humans to produce a bug-free program (plus the hardware!) even after decades of code audits. So I’d very much like the AI to be capable of noticing that it has a bug. And I’m pretty sure that an AI that can figure out that it has a silly belief caused by a flipped bit somewhere will figure out why that red wire “can” transmute at a distance. If we even manage to make a hyper-intelligent machine with this kind of “superstition”, I shudder to think what might be it’s opinion on the fact that humans who built it apparently created the red wire in an attempt to manipulate the AI, thus (hyper-ironically!) sealing their fate.… — it will certainly be able to deduce all that from historical observations, recordings, etc.)
It cannot fix bugs in its priors as for any other part of the system, e.g. sensor drivers, the AI can fix the hell out of itself. Anything which can be fixed is not a true prior though. If we allow the AI to change its prior completely then it is effectively acting upon a prior which does not include any probability 1 entries.
There is no reason to fix the red wire belief if you are certain that it is true. Every evidence is against it, but the red wire does magic with probability 1, hence something is wrong with the evidence (e.g. sensor errors).
Isn’t being able to fix bugs in your priors a large part of the point of Bayesianism?
I take it the AI can update priors, it just can’t hack them. It can update all it wants from 1.
You have to be really, really sure that no priors other than the ones you implant as safeguards can reach 1 (e.g., via a rounding bug), and that the AI will never need to stop using the Bayesian algorithms you wrote and “port” its priors to some other reasoning method, nor give it any reason to hack its priors using something else than simple Bayesianism (e.g., if it suspects previous bugs, or it discovers more efficient reasoning methods). Remember Eliezer’s “dystopia”, with the AI that knew his creator was wrong but couldn’t help being evil because of its constraints?
But other than that, you’re right.