What counts as ‘resources’? Do we think that ‘hardware’ and ‘software’ are natural kinds, such that the AI will always understand what we mean by the two? What if software innovations on their own suffice to threaten the world, without hardware takeover?
What is “taking over the world”, if not taking control of resources (hardware)? Where is the motivation in doing it? Also consider, as others pointed out, that an AI which “misunderstands” your original instructions will demonstrate this earlier than later. For instance, if you create a resource “honeypot” outside the AI which is trivial to take, an AI would naturally take that first, and then you know there’s a problem. It is not going to figure out you don’t want it to take it before it takes it.
Hm? That seems to only penalize it for self-deception, not for deceiving others.
When I say “predict”, I mean publishing what will happen next, and then taking a utility hit if the published account deviates from what happens, as evaluated by a third party.
You’re talking about an Oracle AI. This is one useful avenue to explore, but it’s almost certainly not as easy as you suggest:
The first part of what you copy pasted seems to say that “it’s nontrivial to implement”. No shit, but I didn’t say the contrary. Then there is a bunch of “what if” scenarios I think are not particularly likely and kind of contrived:
Example question: “How should I get rid of my disease most cheaply?” Example answer: “You won’t. You will die soon, unavoidably. This report is 99.999% reliable”. Predicted human reaction: Decides to kill self and get it over with. Success rate: 100%, the disease is gone. Costs of cure: zero. Mission completed.′
Because asking for understandable plans means you can’t ask for plans you don’t understand? And you’re saying that refusing to give a plan counts as success and not failure? Sounds like a strange set up that would be corrected almost immediately.
And if the preference function was just over the human’s ‘goodness’ of the end result, rather than the accuracy of the human’s understanding of the predictions, the AI might tell you something that was predictively false but whose implementation would lead you to what the AI defines as a ‘good’ outcome.
If the AI has the right idea about “human understanding”, I would think it would have the right idea about what we mean by “good”. Also, why would you implement such a function before asking the AI to evaluate examples of “good” and provide their own?
And if we ask how happy the human is, the resulting decision procedure would exert optimization pressure to convince the human to take drugs, and so on.
Is making humans happy so hard that it’s actually easier to deceive them into taking happy pills than to do what they mean? Is fooling humans into accepting different definitions easier than understanding what they really mean? In what circumstances would the former ever happen before the latter?
And if you ask it to tell you whether “taking happy pills” is an outcome most humans would approve of, what is it going to answer? If it’s going to do this for happiness, won’t it do it for everything? Again: do you think weaving an elaborate fib to fool every human being into becoming wireheads and never picking up on the trend is actually less effort than just giving humans what they really want? To me this is like driving a whole extra hour to get to a store that sells an item you want fifty cents cheaper.
I’m not saying these things are not possible. I’m saying that they are contrived: they are constructed to the express purpose of being failure modes, but there’s no reason to think they would actually happen, especially given that they seem to be more complicated than the desired behavior.
Now, here’s the thing: you want to develop FAI. In order to develop FAI, you will need tools. The best tool is Tool AI. Consider a bootstrapping scheme: in order for commands written in English to be properly followed, you first make AI for the very purpose of modelling human language semantics. You can check that the AI is on the same page as you are by discussing with it and asking questions such as: “is doing X in line with the objective ‘Y’?”; it doesn’t even need to be self-modifying at all. The resulting AI can then be transformed into a utility function computer: you give the first AI an English statement and build a second AI maximizing the utility which is given to it by the first AI.
And let’s be frank here: how else do you figure friendly AI could be made? The human brain is a complex, organically grown, possibly inconsistent mess; you are not going, from human wits alone, to build some kind of formal proof of friendliness, even a probabilistic one. More likely than not, there is no such thing: concepts such as life, consciousness, happiness or sentience are ill-defined and you can’t even demonstrate the friendliness of a human being, or even of a group of human beings, let alone of humanity as a whole, which also is a poorly defined thing.
However, massive amounts of information about our internal thought processes are leaked through our languages. You need AI to sift through it and model these processes, their average and their variance. You need AI to extract this information, fill in the holes, produce probability clouds about intent that match whatever borderline incoherent porridge of ideas our brains implement as the end result of billions of years of evolutionary fumbling. In a sense, I guess this would be X in your seed AI: AI which already demonstrated, to our satisfaction, that it understands what we mean, and directly takes charge of a second AI’s utility measurement. I don’t really see any alternatives: if you want FAI, start by focusing on AI that can extract meaning from sentences. Reliable semantic extraction is virtually a prerequisite for FAI, if you can’t do the former, forget about the latter.
What is “taking over the world”, if not taking control of resources (hardware)? Where is the motivation in doing it? Also consider, as others pointed out, that an AI which “misunderstands” your original instructions will demonstrate this earlier than later. For instance, if you create a resource “honeypot” outside the AI which is trivial to take, an AI would naturally take that first, and then you know there’s a problem. It is not going to figure out you don’t want it to take it before it takes it.
When I say “predict”, I mean publishing what will happen next, and then taking a utility hit if the published account deviates from what happens, as evaluated by a third party.
The first part of what you copy pasted seems to say that “it’s nontrivial to implement”. No shit, but I didn’t say the contrary. Then there is a bunch of “what if” scenarios I think are not particularly likely and kind of contrived:
Because asking for understandable plans means you can’t ask for plans you don’t understand? And you’re saying that refusing to give a plan counts as success and not failure? Sounds like a strange set up that would be corrected almost immediately.
If the AI has the right idea about “human understanding”, I would think it would have the right idea about what we mean by “good”. Also, why would you implement such a function before asking the AI to evaluate examples of “good” and provide their own?
Is making humans happy so hard that it’s actually easier to deceive them into taking happy pills than to do what they mean? Is fooling humans into accepting different definitions easier than understanding what they really mean? In what circumstances would the former ever happen before the latter?
And if you ask it to tell you whether “taking happy pills” is an outcome most humans would approve of, what is it going to answer? If it’s going to do this for happiness, won’t it do it for everything? Again: do you think weaving an elaborate fib to fool every human being into becoming wireheads and never picking up on the trend is actually less effort than just giving humans what they really want? To me this is like driving a whole extra hour to get to a store that sells an item you want fifty cents cheaper.
I’m not saying these things are not possible. I’m saying that they are contrived: they are constructed to the express purpose of being failure modes, but there’s no reason to think they would actually happen, especially given that they seem to be more complicated than the desired behavior.
Now, here’s the thing: you want to develop FAI. In order to develop FAI, you will need tools. The best tool is Tool AI. Consider a bootstrapping scheme: in order for commands written in English to be properly followed, you first make AI for the very purpose of modelling human language semantics. You can check that the AI is on the same page as you are by discussing with it and asking questions such as: “is doing X in line with the objective ‘Y’?”; it doesn’t even need to be self-modifying at all. The resulting AI can then be transformed into a utility function computer: you give the first AI an English statement and build a second AI maximizing the utility which is given to it by the first AI.
And let’s be frank here: how else do you figure friendly AI could be made? The human brain is a complex, organically grown, possibly inconsistent mess; you are not going, from human wits alone, to build some kind of formal proof of friendliness, even a probabilistic one. More likely than not, there is no such thing: concepts such as life, consciousness, happiness or sentience are ill-defined and you can’t even demonstrate the friendliness of a human being, or even of a group of human beings, let alone of humanity as a whole, which also is a poorly defined thing.
However, massive amounts of information about our internal thought processes are leaked through our languages. You need AI to sift through it and model these processes, their average and their variance. You need AI to extract this information, fill in the holes, produce probability clouds about intent that match whatever borderline incoherent porridge of ideas our brains implement as the end result of billions of years of evolutionary fumbling. In a sense, I guess this would be X in your seed AI: AI which already demonstrated, to our satisfaction, that it understands what we mean, and directly takes charge of a second AI’s utility measurement. I don’t really see any alternatives: if you want FAI, start by focusing on AI that can extract meaning from sentences. Reliable semantic extraction is virtually a prerequisite for FAI, if you can’t do the former, forget about the latter.