? The point is that having a simulated brain and saying “do what this brain approves of” does not make the AI safe, as defining the circumstance in which the approval is acceptable is a hard problem.
This is a problem for us controlling an AI, not a problem for the AI.
I do not understand your question. It was suggested that an AI run a simulated brain, and ask the brain for approval for doing its action. My point was that “ask the brain for approval” is a complicated thing to define, and puts no real limits on what the AI can do unless we define it properly.
I’m assuming that it has some sort of programming along the lines of “optimise X, subject to the constraint that uploaded brain B must approve your decisions.”
Then it will use the most twisted definition of “approve” that it can find, in order to best optimise X.
You have already assumed you can build an .AI that optimises X. I am not assuming anything different.
In fact any .AI that self improves is going to have to have some sort of goal of getting things right, whether instrumental or terminal. Terminal is much safer, to the extent that it might even solve the whole friendliness problem.
You have already assumed you can build an .AI that optimists X. Iam no assuming anything different.
No, you are assuming that we can build an AI that optimises a specific thing, “interpret all directives according to your makers intentions”. I’m assuming that we can build an AI that can optimise something, which is very different.
No, you are assuming that we can build an AI that optimises a specific thing, “interpret all directives according to your makers intentions”. I’m assuming that we can build an AI that can optimise something, which is very different.
An AI that can self-improve considerably does already interpret a vast amount of directives according to its makers intentions, since self-improvement is an intentional feature.
Being able to predict a programs behavior is a prerequisite if you want the program to work well. Since unpredictable behavior tends to be chaotic and detrimental to the overall performance. In other words, if you got an AI that does not work according to its makers intentions, then you got an AI that does not work, or which is not very powerful.
An AI that can self-improve considerably does already interpret a vast amount of directives according to its makers intentions, since self-improvement is an intentional feature.
Goedel machines already specify self-improvement in formal mathematical form. If you can specify human morality in a similar formal manner, I’ll be a lot more relaxed.
Also, I don’t assume self improvement. Some model of powerful intelligences don’t require it.
Orthogonality thesis: an AI that optimises X can be built, in theory, for almost any X
My assumption in this argument: an AI that optimises X can be built, for some X.
What we need: a way of building, in practice, the specific X we want.
In fact, let’s be generous: you have an AI that can optimise any X you give it. All you need to do now is specify that X to get the result we want. And no, “interpret all directives according to your makers intentions” is not a specification.
But it’s an instruction humans are capable of following within the limits of their ability.
If I was building a non .AI system to do X, then I would have to specify X. But AIs are learning systems.
If you are going to admit that there is difference between theoretical possibility and practical likelihood in the OT, then ,most of the UFAI argument goes out of the window, since the Lovecraftian Horrors that so densly populate mindspace are only theoretical possibilities.
But it’s an instruction humans are capable of following within the limits of their ability.
Because they desire to do so. If for some reason the human has no desire to follow those instructions, then they will “follow” them formally but twist them beyond recognition. Same goes for AI, except that they will not default to desiring to follow them, as many humans would.
What an .AI does depends how it is built. You keep arguing that one particular architectural choice, with an arbitrary top level goal and only instrumental rationality is dangerous. But that choice is not necessary or inevitable.
(Almost) any top level goal that does not specify human safety.
only instrumental rationality
Self modifying AIs will tend to instrumental rationality according to Omohundro’s arguments.
But that choice is not necessary or inevitable.
Good. How do you propose to avoid that happening? You seem extraordinarily confident that these as-yet-undesigned machines, developed and calibrated in a training environment only, by programmers who don’t take AI risk seriously, and put potentially into positions of extreme power where I wouldn’t trust any actual human—will end up capturing almost all of human morality.
I’ve argued against both against convergent goal fidelity regarding the intended (versus the actually programmed in) goals and against objective morality at length, and multiple times. I can dig up a few comments, if you’d like. I don’t know what strawman version you’re referring to, though: the accuracy/inaccuracy of my assertion doesn’t affect the veracity of your claim.
There is no reason to suppose they will not tend to epistemic rationality, which includes instrumental rationality.
You have no evidence that .AI researchers aren’t taking .AI risk seriously enough, given what they are in fact doing. They may not be taking your arguments seriously, and that may well be because you arguments are not relevant to their research. A number of them have said as much on this site.
Even aside from the relevance issue, the MIRI argument constantly assumes that superintelligent IS will have inexplicable deficits. Superintelligent but dumb doesn’t make logical sense.
There’s an argument that an SAI will figure out the correct morality, and there’s an argument that it wont misinterpret directives. They are different arguments, and the second is much stronger.
I now see your point. I still don’t see how you plan to code a “interpret these things properly” piece of the AI. I think working through a specific design would be useful.
I also think you should work your argument into a less wrong post (and send me a message when you’ve done that, in case I miss it) as 12 or so levels deep into a comment thread is not a place most people will ever see.
They are different arguments, and the second is much stronger.
Not really. Given the first, we can instruct “only do things that [some human or human group with nice values] would approve of” and we’ve got an acceptable morality.
By “interpret these things correctly”, do you mean linguistic competence, or a goal?
A goal. If the AI becomes superintelligent, then it will develop linguistic competence as needed. But I see no way of coding it so that that competence is reflected in its motivation (and it’s not from lack of searching for ways of doing that).
So is it safe to run AIXI approximations in boxes today?
By code it, do you mean “code, train, or evolve it”?
Note that we dont know much about coding higher level goals in general.
Note that “get things right except where X is concerned” is more complex than “get things right”. Humans do the former because of bias. The less anthropic nature of an .AI might be to our advantage.
Why would that .be more of a problem for an AI than a human?
? The point is that having a simulated brain and saying “do what this brain approves of” does not make the AI safe, as defining the circumstance in which the approval is acceptable is a hard problem.
This is a problem for us controlling an AI, not a problem for the AI.
I still don’t get it. We assume acceptability by default. We don’t constantly stop and ask “Was that extracted under torture”.
I do not understand your question. It was suggested that an AI run a simulated brain, and ask the brain for approval for doing its action. My point was that “ask the brain for approval” is a complicated thing to define, and puts no real limits on what the AI can do unless we define it properly.
Ok. You are assuming the superintelligent .AI will pose the question in a dumb way?
No, I am assuming the superintelligent AI will pose the question in the way it will get the answer it prefers to get.
Oh, you’re assuming it’s malicious. In order to prove...?
No, not assuming it’s malicious.
I’m assuming that it has some sort of programming along the lines of “optimise X, subject to the constraint that uploaded brain B must approve your decisions.”
Then it will use the most twisted definition of “approve” that it can find, in order to best optimise X.
The programme it with:
Prime directive—interpret all directives according to your makers intentions.
Secondary directive—do nothing that goes against the uploaded brain
Tertiary objective—optimise X.
And how do you propose to code the prime directive? (with that, you have no need for the other ones; the uploaded brain is completely pointless)
The prime directive is the tertiary directive for a specific X
That’s not a coding approach for the prime directive.
You have already assumed you can build an .AI that optimises X. I am not assuming anything different.
In fact any .AI that self improves is going to have to have some sort of goal of getting things right, whether instrumental or terminal. Terminal is much safer, to the extent that it might even solve the whole friendliness problem.
No, you are assuming that we can build an AI that optimises a specific thing, “interpret all directives according to your makers intentions”. I’m assuming that we can build an AI that can optimise something, which is very different.
An AI that can self-improve considerably does already interpret a vast amount of directives according to its makers intentions, since self-improvement is an intentional feature.
Being able to predict a programs behavior is a prerequisite if you want the program to work well. Since unpredictable behavior tends to be chaotic and detrimental to the overall performance. In other words, if you got an AI that does not work according to its makers intentions, then you got an AI that does not work, or which is not very powerful.
Goedel machines already specify self-improvement in formal mathematical form. If you can specify human morality in a similar formal manner, I’ll be a lot more relaxed.
Also, I don’t assume self improvement. Some model of powerful intelligences don’t require it.
So your saying the orthogonality thesis is false?
???
Orthogonality thesis: an AI that optimises X can be built, in theory, for almost any X
My assumption in this argument: an AI that optimises X can be built, for some X.
What we need: a way of building, in practice, the specific X we want.
In fact, let’s be generous: you have an AI that can optimise any X you give it. All you need to do now is specify that X to get the result we want. And no, “interpret all directives according to your makers intentions” is not a specification.
But it’s an instruction humans are capable of following within the limits of their ability.
If I was building a non .AI system to do X, then I would have to specify X. But AIs are learning systems.
If you are going to admit that there is difference between theoretical possibility and practical likelihood in the OT, then ,most of the UFAI argument goes out of the window, since the Lovecraftian Horrors that so densly populate mindspace are only theoretical possibilities.
Because they desire to do so. If for some reason the human has no desire to follow those instructions, then they will “follow” them formally but twist them beyond recognition. Same goes for AI, except that they will not default to desiring to follow them, as many humans would.
What an .AI does depends how it is built. You keep arguing that one particular architectural choice, with an arbitrary top level goal and only instrumental rationality is dangerous. But that choice is not necessary or inevitable.
(Almost) any top level goal that does not specify human safety.
Self modifying AIs will tend to instrumental rationality according to Omohundro’s arguments.
Good. How do you propose to avoid that happening? You seem extraordinarily confident that these as-yet-undesigned machines, developed and calibrated in a training environment only, by programmers who don’t take AI risk seriously, and put potentially into positions of extreme power where I wouldn’t trust any actual human—will end up capturing almost all of human morality.
That confidence, I’d surmise, often goes hand in hand with an implicit or explicit belief in objective morality.
If you don’t think people should believe in it, argue against it, and not just a strawmman version.
I’ve argued against both against convergent goal fidelity regarding the intended (versus the actually programmed in) goals and against objective morality at length, and multiple times. I can dig up a few comments, if you’d like. I don’t know what strawman version you’re referring to, though: the accuracy/inaccuracy of my assertion doesn’t affect the veracity of your claim.
The usual strawmen are The Tablet and Written into the Laws of the Universe.
There is no reason to suppose they will not tend to epistemic rationality, which includes instrumental rationality.
You have no evidence that .AI researchers aren’t taking .AI risk seriously enough, given what they are in fact doing. They may not be taking your arguments seriously, and that may well be because you arguments are not relevant to their research. A number of them have said as much on this site.
Even aside from the relevance issue, the MIRI argument constantly assumes that superintelligent IS will have inexplicable deficits. Superintelligent but dumb doesn’t make logical sense.
And you’ve redefined “anything but perfectly morally in tune with humanity” as “dumb”. I’m waiting for an argument as to why that is so.
There’s an argument that an SAI will figure out the correct morality, and there’s an argument that it wont misinterpret directives. They are different arguments, and the second is much stronger.
I now see your point. I still don’t see how you plan to code a “interpret these things properly” piece of the AI. I think working through a specific design would be useful.
I also think you should work your argument into a less wrong post (and send me a message when you’ve done that, in case I miss it) as 12 or so levels deep into a comment thread is not a place most people will ever see.
Not really. Given the first, we can instruct “only do things that [some human or human group with nice values] would approve of” and we’ve got an acceptable morality.
By “interpret these things correctly”, do you mean linguistic competence, or a goal?
The linguistic competence is aready assumed in any .AI that can talk it’s way out of a box (ie not AIXI like), without provision of a design by MIRI.
An AIXI can’t even conceptualise that it’s in a box, so it doesn’t matter if it gets its goals wrong, It can be rendered safe by boxing.
Which combination of assumptions is the problem?
I’m not so sure about that… AIXI can learn certain ways of behaving as if it were part of the universe, even with the Cartesian dualism in its code: http://lesswrong.com/lw/8rl/would_aixi_protect_itself/
A goal. If the AI becomes superintelligent, then it will develop linguistic competence as needed. But I see no way of coding it so that that competence is reflected in its motivation (and it’s not from lack of searching for ways of doing that).
So is it safe to run AIXI approximations in boxes today?
By code it, do you mean “code, train, or evolve it”?
Note that we dont know much about coding higher level goals in general.
Note that “get things right except where X is concerned” is more complex than “get things right”. Humans do the former because of bias. The less anthropic nature of an .AI might be to our advantage.
IMHO, yes. The computational complexity of AIXItl is such that it can’t be used for anything significant on modern hardware.