A quick and dirty inspiration: that value alignment is very hard is showed by very intelligent people going neoreactionary or convert to a religion from atheism.
They usually base their ‘move’ on a coherent epistemology, with just some tiny components that zag instead of zigging.
Both neoreactionaries and other people want a functional society, they just disagree on how to get it; both transhumanists and religious people want to live forever, they just disagree about whether life extension or preying to get into the best afterlife is the best way to go about it. Perhaps they have the same terminal goals?
OTOH if religious people have ‘faith’ as a terminal value and atheists do not, then yes, this may be more of a problem. If people have ‘other people follow my values’ as a terminal value, this could be a very large problem.
I think that one of the problem with CEV is that we still have to say who’s values we want to extrapolate, and it almost defines the outcome.
For example CEV of values of C. elegance is not equal to human values.
The main problem here is how we defined to be human. Here are most value groups differ. Will we include unborn children? Neanderthals?
I mean that different value system have different definition of human beings, and “human CEV” is different for green party member who include all animals in “humans”, and to neo-something who may exclude some people from definition of humans. So AI could correctly calculate CEV but to the wrong group.
I agree with a twist: I think CEV will mostly be uninteresting, because it will have to integrate so many conflicting points of view that it will mostly come up with “do nothing at all”. Brainstorming a bit, I would say that value alignment is impossible unless an AI becomes actively part of the moral landscape: instead of being a slave to a hypothetical human uber-value, it will need to interact heavily with humans and force them to act so to reveal their true preferences or collaborate to generate a utopia.
I would also add that diversity of human values is value itself and important part of human culture. While it often results in conflicts and sufferings, homogenous New World may be even more unpleasant. CEV indirectly implies that there is one value system for all.
It is also based on orthogonality thesis which may be true for some AIs but not true for human brains. There is no separate values in human brain. “Value” is the way to describe human behaviour, but we can’t show value neurons, or value texts in human brain. Our emotions, thoughts and memories are interconnected with our motivation and reward system, as well as all our cultural background. So we can’t uploading values without uploading a human being.
The orthogonality thesis will be false for AIs for the same reasons you rightly say it is false for humans.
We have the desire for certain things. How do we know we have those desires? Because we notice that when we feel a certain way (which we end up calling desires) we tend to do certain things. So we call those feelings the desire to do those things. If we tended to do other things, but felt the same way, we would call those feelings desires for the other things, instead of the first things.
In the same way, AIs will tend to do certain things, and many of those tendencies will be completely independent of some arbitrary goal. For example, let’s suppose there is a supposed paperclipper. It is a physical object, and its parts will have tendencies. Consequently it will have tendencies as a whole, and many of them will be unrelated to paperclips, depending on many factors such as what it is made of and how the parts are put together. When it has certain feelings (presumably) and ends up tending to do certain things, e.g. suppose it tends to think a long time about certain questions, it will say that those feelings are the desire for those things. So it will believe that it has a desire to think a long time about certain topics, even if that is unrelated to paperclips.
In case of AIs some orthogonality is possible if its goal system is preserved in the separate text-block, but only until some extent.
If he is sophisticated enough he would ask: “What is paperclip? Why it is in my value box?” And such ability to reflection (which is needed for self-improving AI) will blur the distance between intelligence and its values.
Orthogonality is also under question if the context change. If meaning of words and world model change, values need to be updated. Context is inside AI, not in his value box.
The separate text-block can illustrate what I am saying. You have an AI, made of two parts, A & B. Part B contains the value box which says, “paperclips are the only important thing.” But there is also part A, which is a physical thing, and since it is a physical thing, it will have certain tendencies. Since the paperclippiness is only in part B, those tendencies will be independent of paperclips. When it feels those tendencies, it will feel desires that have nothing to do with paperclips.
“They could still operate in harmony...” Those tendencies were there before anyone ever thought of paperclips, so there isn’t much chance that all of them would work out just in the way that would happen to promote paperclips.
Of course, everything is a physical object. What I’m curious about your position is if you think that you can put any algorithm inside a piece of hardware, or not. I’m afraid that your position on the matter is so out there for me that without a toy model I wouldn’t be able to understand what you mean. The recursive nature of the comments doesn’t help, also.
You can put any program you want into a physical object. But since it is a physical object, it will do other things in addition to executing the algorithm.
I gave the example of following gravity, and in general it is following all of the laws of physics, e.g. by resisting the pressure of other things in contact with it, and so on. Of course, the laws of physics are also responsible for it executing the program. But that doesn’t mean the laws of physics do nothing at all except execute the program—evidently they do plenty of other things as well. And you are not in control of those things and cannot program them. So they will not all work out to promote paperclips, and the thing will always feel desires that have nothing to do with paperclips.
I don’t think that “tendencies” is right wording here. Like a calculator has a keyboard and a processor. The keyboard provides digits for multiplication, but the processor doesn’t have any own tendencies.
The processor has tendencies. It is subject to the law of gravity and many other physical tendencies. That is why I mentioned the fact that the parts of an AI are physical. They are bodies, and have many bodily tendencies, no matter what algorithms are programmed into them.
But that doesn’t just reduce to a will of survival? I know that extracting certain salts from my blood is essential to my survival, so I want that my parts that do exactly that continue to do so. But I do not have any specific attachment to that functions just because a sub-part of me executes this. If I were in a simulation, say, even if I knew that my simulated kidneys worked in the same way I know I could continue to exist even without that function. From the wording of your previous comments, it seemed that an AI conscious of its parts should have isomorphic desires, but the problem is that there could be many different isomorphisms, some of which are ridiculous.
We do in fact feel many desires like that, e.g. the desire to remove certain unneeded materials from our bodies, and other such things. The reason you don’t have a specific desire to extract ammonia is that you don’t have a specific feeling for that; if you did have a specific feeling, it would be a desire specifically to extract ammonia, just like you specifically desire the actions I mentioned in the first part of this comment, and just as you can have the specific desire for sex.
I feel we are talking past each other, because reading the comment above I’m totally confused about what question you’re answering...
Let me rephrase my question: if I substitute one of the parts of an AI with an inequivalent part, say a kidney with a solar cell, will its desires change or not?
Let me respond with another question: if I substitute one of the parts of a human being an inequivalent part, say the nutritional system so that it lives on rocks instead of food, will the human’s desires change or not?
Yes, they will, because they will desire to eat rocks instead of what they were eating before.
Yes, and expecting it any one will try to narrow the group with which CEV-AI will align. For example we have a CEV-AI with following function: given presented with group X, he will calculate CEV(X). So if I want to manipulate it, I will present him as small group as possible, which will mostly include people like me.
But it is also possible to imagine superCEV, which will be able to calculate not only CEV, but also the best group for CEV.
For me CEV is not pleasant because it kills me as subject of history. I am not the one who rules the universe, but just a pet for which someone else knows what is good and bad. It kills all my potential for future development, which I see as natural evolution of my own values in complex environment.
I think also that there are infinitely many possible SEVs for a given group, which means that any CEV is mostly random.
The CEV-AI will have to create many simulations to calculate CEV. So most observers will still find themselves in CEV-modeling simulation where they fight another group of people with different values, and see their own values evolving someway. If the values of all people will not converge, the simulation will be terminated. I don’t see it as peasant scenario also.
My opinion is that most natural solution for aligning problem is to create AI as extension of his creator (using uploading and Tool AIs, or running AI-master algorithm on biological brain). In this case any evolution of goal system will by mine evolution, so there will be no problem of aligning something with something. There will be always one evolving thing. (Another question if it will be safe for others or stable).
Uhm, it’s evident I’ve not made my argument very clear (which is not a surprise, since I’ve wrote that stub in less than a minute).
Let me rephrase it in a way that addresses your point:
“Value alignment is very hard, because two people, both very intelligent and with a mostly coherent epistemology, can radically differ in their values because of a very tiny difference. Think for example to neoreactionaries or rationalists converted to religions: they have a mostly coherent set of beliefs, often diverging from atheist or progressive rationalists in very few points.”
They want to steer the future in a different direction than what I want, so by definition they have different values (they might be instrumental values, but those are important too).
A quick and dirty inspiration: that value alignment is very hard is showed by very intelligent people going neoreactionary or convert to a religion from atheism. They usually base their ‘move’ on a coherent epistemology, with just some tiny components that zag instead of zigging.
Comment at will, I’ll expand with more thoughts.
Both neoreactionaries and other people want a functional society, they just disagree on how to get it; both transhumanists and religious people want to live forever, they just disagree about whether life extension or preying to get into the best afterlife is the best way to go about it. Perhaps they have the same terminal goals?
OTOH if religious people have ‘faith’ as a terminal value and atheists do not, then yes, this may be more of a problem. If people have ‘other people follow my values’ as a terminal value, this could be a very large problem.
I think that one of the problem with CEV is that we still have to say who’s values we want to extrapolate, and it almost defines the outcome.
For example CEV of values of C. elegance is not equal to human values.
The main problem here is how we defined to be human. Here are most value groups differ. Will we include unborn children? Neanderthals?
I mean that different value system have different definition of human beings, and “human CEV” is different for green party member who include all animals in “humans”, and to neo-something who may exclude some people from definition of humans. So AI could correctly calculate CEV but to the wrong group.
I agree with a twist: I think CEV will mostly be uninteresting, because it will have to integrate so many conflicting points of view that it will mostly come up with “do nothing at all”.
Brainstorming a bit, I would say that value alignment is impossible unless an AI becomes actively part of the moral landscape: instead of being a slave to a hypothetical human uber-value, it will need to interact heavily with humans and force them to act so to reveal their true preferences or collaborate to generate a utopia.
I would also add that diversity of human values is value itself and important part of human culture. While it often results in conflicts and sufferings, homogenous New World may be even more unpleasant. CEV indirectly implies that there is one value system for all.
It is also based on orthogonality thesis which may be true for some AIs but not true for human brains. There is no separate values in human brain. “Value” is the way to describe human behaviour, but we can’t show value neurons, or value texts in human brain. Our emotions, thoughts and memories are interconnected with our motivation and reward system, as well as all our cultural background. So we can’t uploading values without uploading a human being.
The orthogonality thesis will be false for AIs for the same reasons you rightly say it is false for humans.
We have the desire for certain things. How do we know we have those desires? Because we notice that when we feel a certain way (which we end up calling desires) we tend to do certain things. So we call those feelings the desire to do those things. If we tended to do other things, but felt the same way, we would call those feelings desires for the other things, instead of the first things.
In the same way, AIs will tend to do certain things, and many of those tendencies will be completely independent of some arbitrary goal. For example, let’s suppose there is a supposed paperclipper. It is a physical object, and its parts will have tendencies. Consequently it will have tendencies as a whole, and many of them will be unrelated to paperclips, depending on many factors such as what it is made of and how the parts are put together. When it has certain feelings (presumably) and ends up tending to do certain things, e.g. suppose it tends to think a long time about certain questions, it will say that those feelings are the desire for those things. So it will believe that it has a desire to think a long time about certain topics, even if that is unrelated to paperclips.
In case of AIs some orthogonality is possible if its goal system is preserved in the separate text-block, but only until some extent.
If he is sophisticated enough he would ask: “What is paperclip? Why it is in my value box?” And such ability to reflection (which is needed for self-improving AI) will blur the distance between intelligence and its values.
Orthogonality is also under question if the context change. If meaning of words and world model change, values need to be updated. Context is inside AI, not in his value box.
The separate text-block can illustrate what I am saying. You have an AI, made of two parts, A & B. Part B contains the value box which says, “paperclips are the only important thing.” But there is also part A, which is a physical thing, and since it is a physical thing, it will have certain tendencies. Since the paperclippiness is only in part B, those tendencies will be independent of paperclips. When it feels those tendencies, it will feel desires that have nothing to do with paperclips.
Maybe, but they could still operate in harmony to reduce the world to a giant paperclips.
“They could still operate in harmony...” Those tendencies were there before anyone ever thought of paperclips, so there isn’t much chance that all of them would work out just in the way that would happen to promote paperclips.
Are we still talking about an AI that can be programmed at will?
I am pointing out that you cannot have an AI without parts that you did not program. An AI is not an algorithm. It is a physical object.
Of course, everything is a physical object. What I’m curious about your position is if you think that you can put any algorithm inside a piece of hardware, or not.
I’m afraid that your position on the matter is so out there for me that without a toy model I wouldn’t be able to understand what you mean. The recursive nature of the comments doesn’t help, also.
You can put any program you want into a physical object. But since it is a physical object, it will do other things in addition to executing the algorithm.
Well, now you got me curious. What other things a processor is doing when executing a program?
I gave the example of following gravity, and in general it is following all of the laws of physics, e.g. by resisting the pressure of other things in contact with it, and so on. Of course, the laws of physics are also responsible for it executing the program. But that doesn’t mean the laws of physics do nothing at all except execute the program—evidently they do plenty of other things as well. And you are not in control of those things and cannot program them. So they will not all work out to promote paperclips, and the thing will always feel desires that have nothing to do with paperclips.
I don’t think that “tendencies” is right wording here. Like a calculator has a keyboard and a processor. The keyboard provides digits for multiplication, but the processor doesn’t have any own tendencies.
But it still could define context
The processor has tendencies. It is subject to the law of gravity and many other physical tendencies. That is why I mentioned the fact that the parts of an AI are physical. They are bodies, and have many bodily tendencies, no matter what algorithms are programmed into them.
This is akin to saying that since your kidneys work by extracting ammonia from your blood, you have some amount of desire to drink ammonia.
No. It is akin to saying that if you felt the work of your kidneys, you would call that a desire to extract ammonia from your blood. And you would.
But that doesn’t just reduce to a will of survival? I know that extracting certain salts from my blood is essential to my survival, so I want that my parts that do exactly that continue to do so. But I do not have any specific attachment to that functions just because a sub-part of me executes this. If I were in a simulation, say, even if I knew that my simulated kidneys worked in the same way I know I could continue to exist even without that function.
From the wording of your previous comments, it seemed that an AI conscious of its parts should have isomorphic desires, but the problem is that there could be many different isomorphisms, some of which are ridiculous.
We do in fact feel many desires like that, e.g. the desire to remove certain unneeded materials from our bodies, and other such things. The reason you don’t have a specific desire to extract ammonia is that you don’t have a specific feeling for that; if you did have a specific feeling, it would be a desire specifically to extract ammonia, just like you specifically desire the actions I mentioned in the first part of this comment, and just as you can have the specific desire for sex.
I feel we are talking past each other, because reading the comment above I’m totally confused about what question you’re answering...
Let me rephrase my question: if I substitute one of the parts of an AI with an inequivalent part, say a kidney with a solar cell, will its desires change or not?
Let me respond with another question: if I substitute one of the parts of a human being an inequivalent part, say the nutritional system so that it lives on rocks instead of food, will the human’s desires change or not?
Yes, they will, because they will desire to eat rocks instead of what they were eating before.
The same with the AI.
Yes, and expecting it any one will try to narrow the group with which CEV-AI will align. For example we have a CEV-AI with following function: given presented with group X, he will calculate CEV(X). So if I want to manipulate it, I will present him as small group as possible, which will mostly include people like me.
But it is also possible to imagine superCEV, which will be able to calculate not only CEV, but also the best group for CEV.
For me CEV is not pleasant because it kills me as subject of history. I am not the one who rules the universe, but just a pet for which someone else knows what is good and bad. It kills all my potential for future development, which I see as natural evolution of my own values in complex environment.
I think also that there are infinitely many possible SEVs for a given group, which means that any CEV is mostly random.
The CEV-AI will have to create many simulations to calculate CEV. So most observers will still find themselves in CEV-modeling simulation where they fight another group of people with different values, and see their own values evolving someway. If the values of all people will not converge, the simulation will be terminated. I don’t see it as peasant scenario also.
My opinion is that most natural solution for aligning problem is to create AI as extension of his creator (using uploading and Tool AIs, or running AI-master algorithm on biological brain). In this case any evolution of goal system will by mine evolution, so there will be no problem of aligning something with something. There will be always one evolving thing. (Another question if it will be safe for others or stable).
This is a very weak argument, since it might simply show that coherent epistemology leads to everyone becoming religious, or neo-reactionary.
In other words, your arguments just says “very intelligent people disagree with me, so that must be because of perverse values.”
It could also just be that you are wrong.
Uhm, it’s evident I’ve not made my argument very clear (which is not a surprise, since I’ve wrote that stub in less than a minute).
Let me rephrase it in a way that addresses your point:
“Value alignment is very hard, because two people, both very intelligent and with a mostly coherent epistemology, can radically differ in their values because of a very tiny difference.
Think for example to neoreactionaries or rationalists converted to religions: they have a mostly coherent set of beliefs, often diverging from atheist or progressive rationalists in very few points.”
I am saying that is a difference in belief, not in values, or not necessarily in values.
They want to steer the future in a different direction than what I want, so by definition they have different values (they might be instrumental values, but those are important too).
Ok, but in this sense every human being has different values by definition, and always will.