you cannot plug in such safeties without having the AI detect them. Humans were able to map out many of their own design flaws and but them into nice little books. (that then get used to kill rivals) An AI would be able to figure that out too.
I do not consider the AI detecting the obvious flaw in its prior to be a problem. Certainly it is advantageous to have such a prior in a universe where the red wire would eliminate humanity. And the probability that the AI is in such a universe is 1 according to the AI. So its prior is just right.
No evidence whatsoever can possibly convince the AI from a universe where the red wire is just a red wire.
We are not telling the AI a lie, we are building an AI that is broken.
I think you mix up goals of the AI (paperclipping for once), and the model of reality it develops. I assume that it serves any goal of an AI best to have a highly realistic model of the world, and that it would detect any kind of tampering with that.
Now I have no idea what happens if you hardcode a part of its view on nature, but I can imagine it will not be pleasant.
It crippling to limit thought in that way, and maybe you prevent it from discovering something important.
I think a big problem of FAI is that valuing humans and/or human values (however defined) may fall under superstition, even if it seems more attractive to us and less arbitrary than a red wire/thermite setup.
If an FAI must value people, and is programmed to not be able to think near a line of thought which would lead it to not valuing people, is it significantly crippled? Relative to what we want, there’s no obvious problem, but would it be so weakened that it would lose out to UFAIs?
What line of thought could lead an FAI not to value people, that it would have to avoid? What does it mean for a value system to be superstitious? (see also: Ghosts in the Machine, the metaethics sequence)
What line of thought could lead an FAI not to value people, that it would have to avoid? An agent’s goal system can’t be ‘incorrect’. (see also: Ghosts in the Machine, the metaethics sequence)
If we have an AI which we believe to be friendly, but can not verify to be so, we add the fuse I described, then start it. As long as the AI does not try to kill humanity or tries to understand the red wire too well, it should operate pretty much like an unmodified AI.
From time to time however it will conclude the wrong things. For example it might waste significant resources on the production of red wires, to conduct various experiments on them. Thus the modified AI is not optimal in our universe, and it contains one known bug. Hence I think it justified to call it broken.
The problem with this idea is that it prevents us from creating an AI that is (even in principle) able to find and fix bugs in itself.
Given the size of the problem, I wouldn’t trust humans to produce a bug-free program (plus the hardware!) even after decades of code audits. So I’d very much like the AI to be capable of noticing that it has a bug. And I’m pretty sure that an AI that can figure out that it has a silly belief caused by a flipped bit somewhere will figure out why that red wire “can” transmute at a distance. If we even manage to make a hyper-intelligent machine with this kind of “superstition”, I shudder to think what might be it’s opinion on the fact that humans who built it apparently created the red wire in an attempt to manipulate the AI, thus (hyper-ironically!) sealing their fate.… — it will certainly be able to deduce all that from historical observations, recordings, etc.)
It cannot fix bugs in its priors as for any other part of the system, e.g. sensor drivers, the AI can fix the hell out of itself. Anything which can be fixed is not a true prior though. If we allow the AI to change its prior completely then it is effectively acting upon a prior which does not include any probability 1 entries.
There is no reason to fix the red wire belief if you are certain that it is true. Every evidence is against it, but the red wire does magic with probability 1, hence something is wrong with the evidence (e.g. sensor errors).
You have to be really, really sure that no priors other than the ones you implant as safeguards can reach 1 (e.g., via a rounding bug), and that the AI will never need to stop using the Bayesian algorithms you wrote and “port” its priors to some other reasoning method, nor give it any reason to hack its priors using something else than simple Bayesianism (e.g., if it suspects previous bugs, or it discovers more efficient reasoning methods). Remember Eliezer’s “dystopia”, with the AI that knew his creator was wrong but couldn’t help being evil because of its constraints?
http://lesswrong.com/lw/uw/entangled_truths_contagious_lies/
you cannot plug in such safeties without having the AI detect them. Humans were able to map out many of their own design flaws and but them into nice little books. (that then get used to kill rivals) An AI would be able to figure that out too.
I do not consider the AI detecting the obvious flaw in its prior to be a problem. Certainly it is advantageous to have such a prior in a universe where the red wire would eliminate humanity. And the probability that the AI is in such a universe is 1 according to the AI. So its prior is just right.
No evidence whatsoever can possibly convince the AI from a universe where the red wire is just a red wire.
We are not telling the AI a lie, we are building an AI that is broken.
I think you mix up goals of the AI (paperclipping for once), and the model of reality it develops. I assume that it serves any goal of an AI best to have a highly realistic model of the world, and that it would detect any kind of tampering with that. Now I have no idea what happens if you hardcode a part of its view on nature, but I can imagine it will not be pleasant. It crippling to limit thought in that way, and maybe you prevent it from discovering something important.
I think a big problem of FAI is that valuing humans and/or human values (however defined) may fall under superstition, even if it seems more attractive to us and less arbitrary than a red wire/thermite setup.
If an FAI must value people, and is programmed to not be able to think near a line of thought which would lead it to not valuing people, is it significantly crippled? Relative to what we want, there’s no obvious problem, but would it be so weakened that it would lose out to UFAIs?
What line of thought could lead an FAI not to value people, that it would have to avoid? What does it mean for a value system to be superstitious? (see also: Ghosts in the Machine, the metaethics sequence)
What line of thought could lead an FAI not to value people, that it would have to avoid? An agent’s goal system can’t be ‘incorrect’. (see also: Ghosts in the Machine, the metaethics sequence)
Why do you want to build something that is broken? Why not just build nothing?
Because broken != totally nonfunctional.
If we have an AI which we believe to be friendly, but can not verify to be so, we add the fuse I described, then start it. As long as the AI does not try to kill humanity or tries to understand the red wire too well, it should operate pretty much like an unmodified AI.
From time to time however it will conclude the wrong things. For example it might waste significant resources on the production of red wires, to conduct various experiments on them. Thus the modified AI is not optimal in our universe, and it contains one known bug. Hence I think it justified to call it broken.
The problem with this idea is that it prevents us from creating an AI that is (even in principle) able to find and fix bugs in itself.
Given the size of the problem, I wouldn’t trust humans to produce a bug-free program (plus the hardware!) even after decades of code audits. So I’d very much like the AI to be capable of noticing that it has a bug. And I’m pretty sure that an AI that can figure out that it has a silly belief caused by a flipped bit somewhere will figure out why that red wire “can” transmute at a distance. If we even manage to make a hyper-intelligent machine with this kind of “superstition”, I shudder to think what might be it’s opinion on the fact that humans who built it apparently created the red wire in an attempt to manipulate the AI, thus (hyper-ironically!) sealing their fate.… — it will certainly be able to deduce all that from historical observations, recordings, etc.)
It cannot fix bugs in its priors as for any other part of the system, e.g. sensor drivers, the AI can fix the hell out of itself. Anything which can be fixed is not a true prior though. If we allow the AI to change its prior completely then it is effectively acting upon a prior which does not include any probability 1 entries.
There is no reason to fix the red wire belief if you are certain that it is true. Every evidence is against it, but the red wire does magic with probability 1, hence something is wrong with the evidence (e.g. sensor errors).
Isn’t being able to fix bugs in your priors a large part of the point of Bayesianism?
I take it the AI can update priors, it just can’t hack them. It can update all it wants from 1.
You have to be really, really sure that no priors other than the ones you implant as safeguards can reach 1 (e.g., via a rounding bug), and that the AI will never need to stop using the Bayesian algorithms you wrote and “port” its priors to some other reasoning method, nor give it any reason to hack its priors using something else than simple Bayesianism (e.g., if it suspects previous bugs, or it discovers more efficient reasoning methods). Remember Eliezer’s “dystopia”, with the AI that knew his creator was wrong but couldn’t help being evil because of its constraints?
But other than that, you’re right.