I have a question: It seems to me that Friendliness is a function of more than just an AI. To determine whether an AI is Friendly, it would seem necessary to answer the question: Friendly to whom? If that question is unanswered, then “Friendly” seems like an unsaturated function like “2+”. In the LW context, the answer to that question is probably something along the lines of “humanity”. However, wouldn’t a mathematical definition of “humanity” be too complex to let us prove that some particular AI is Friendly to humanity? Even if the answer to “To whom?” is “Eliezer Yudkowsky”, even that seems like it would be a rather complicated proof to say the least.
Any proofs will be like… assuming that if some laws of aerodynamics and range of conditions hold, proving that a certain plane design will fly. Which of course has some trouble because we don’t know the equivalent of aerodynamics either.
That would seem to be the best possible solution, but I have never heard aeroplane engineers claim that their designs are “provably airworthy”. If you take the aeroplane design approach, then isn’t “provably Friendly” a somewhat misleading claim to make, especially when you’re talking about pushing conditions to the extreme that you yourself admit are beyond your powers of prediction? The aeroplane equivalent would be like designing a plane so powerful that its flight changes the atmospheric conditions of the entire planet, but then the plane uses a complicated assembly of gyroscopes or something to continue flying in a straight line. However, if you yourself cannot predict which specific changes the flight of the plane will make, then how can you claim that you can prove that particular assembly of gyroscopes is sufficient to keep the plane on the preplanned path? On the other hand, if you can prove which specific changes the plane’s flight will make that are relevant to its flight, then you have a mathematical definition of the target atmosphere at a sufficient depth of resolution to design such an assembly. Does MIRI think it can come up with an equivalent mathematical model of humanity with respect to AI?
The SEP says that preferences cannot be aggregated without additional constraints on how the aggregation is to be done, and the end result changes depending on things like the order of aggregation, so these additional constraints take on the quality of arbitrariness. How does CEV get around that problem?
Different classes of satisfactory initial definitions may fall into different selfconsistent
attractors for optimal definitions of volition. Or they may all converge to
essentially the same endpoint. A CEV might survey the “space” of initial dynamics and
self-consistent final dynamics, looking to see if one alternative obviously stands out as
best; extrapolating the opinions humane philosophers might have of that space. But
if there are multiple, self-consistent, satisficing endpoints, each of them optimal under
their own criterion—okay. Whatever. As long as we end up in a Nice Place to Live.
And yes, the programmers’ choices may have a huge impact on the ultimate destiny
of the human species. Or a bird, chirping in the programmers’ window. Or a science
fiction novel, or a few lines spoken by a character in an anime, or a webcomic. Life is
chaotic, small things have large effects. So it goes.
Which you could sum up as “CEV doesn’t get around that problem, it treats it as irrelevant—the point isn’t to find a particular good solution that’s unique and totally non-arbitrary, it’s just to find even one of the good solutions. If arbitrary reasons shift us from Good World #4 to Good World #36, who cares as long as they both really are good worlds”.
There is no easy way to resolve this problem. There is also no known method that takes such an inconsistent set of preferences as input and gives a consistent set of preferences as output such that the output would be recognizable to either party who contributed an original set of preferences as furthering any of their original goals. These random decisions are required so often in cases where there isn’t an unanimous agreement that in practice, there would be a large component of arbitrariness every single time CEV tries to arrive at a uniform set of preferences by extrapolating volitions of multiple agents into the future.
This doesn’t mean the problem is unresolvable, just that it’s an AI problem in its own right, but given these problems, wouldn’t it be better to pick whichever Nice Place to Live is the safest to reach instead of bothering with CEV? I say this because I’m not sure Nice Place to Live can be defined in terms of CEV, as any CEV-approved output. Because of the preference aggregation problem, I’m not certain that a world that is provably CEV-abiding also provably avoids flagrant immorality. Two moral frameworks when aggregated by a non-smart algorithm might give rise to an immoral framework, so I’m not sure the essence of the problem is resolved just by CEV as explained in the paper.
Although what if we told each party to submit goals rather than non-goal preferences? If the AI has access to a model specifying which actions lead to which consequences, then it can search for those actions that maximize the number of goals fulfilled regardless of which party submitted them, or perhaps takes a Rawlsian approach of trying to maximize the number of goals fulfilled that were submitted by whichever party will have the least number of goals fulfilled if that sequence of actions were taken, etc. That seems very imaginable to me. You can then have heuristics that constrain the search space and stuff. You can also have non-goal preferences in addition to goals if the parties have any of those.
In that light, it seems to me that the problem was inferring goals from a set of preferences which were not purely non-goal preferences but were actually presented with some unspecified goals in mind. Eg. One party wanted chocolate, but said, “I want to go to the store” instead. If that was the source of the original problem, then we can see why we might need an AI to solve it, since it calls for some lightweight mind reading. Of course, a CEV-implementing AI would have to be a mind reader anyway, since we don’t really know what our goals ultimately are given everything we could know about reality.
This still does not guarantee basic morality, but parties should at least recognize some of their ultimate goals in the end result. They might still grumble about the result not being exactly what they wanted, but we can at least scold them for lacking a spirit of compromise.
All this presupposes that enough of our actions can be reduced to ultimate goals that can be discovered, and I don’t think this process guarantees we will be satisfied with the results. For example, this might erode personal freedom to an unpleasant degree. If we would choose to live in some world X if we were wiser and nicer than we are, then it doesn’t necessarily follow that X is a Nice Place to Live as we are now. Changing ourselves to reach that level of niceness and wisdom might require unacceptably extensive modifications to our actual selves.
My recent paper touches upon preference aggregation a bit in section 8, BTW, though it’s mostly focused on the question of figuring out a single individual’s values. (Not sure how relevant that is for your comments, but thought maybe a little.)
(And all my ranting still didn’t address the fundamental difficulty: There is no rational way to choose from among different projections of values held by multiple agents, projections such as Rawlsianism and utilitarianism.)
Interesting. In that case, would you say an AI that provably implements CEV’s replacement is, for that reason, provably Friendly? That is, AIs implementing CEV’s replacement form an analytical subset of Friendly AIs? What is the current replacement for CEV anyway? Having some technical material would be even better. If it’s open to the public, then I’d like to understand how EY proposes to install a general framework similar to CEV at the “initial dynamic” stage that can predictably generate a provably Friendly AI without explicitly modeling the target of its Friendliness.
There isn’t really one as far as I know; “The Value Learning Problem” discusses some of the questions involved, but seems to mostly at be the point of defining the problem rather than trying to answer it. (This seems appropriate to me; trying to answer the problem at this point seems premature.)
I have a question: It seems to me that Friendliness is a function of more than just an AI. To determine whether an AI is Friendly, it would seem necessary to answer the question: Friendly to whom? If that question is unanswered, then “Friendly” seems like an unsaturated function like “2+”. In the LW context, the answer to that question is probably something along the lines of “humanity”. However, wouldn’t a mathematical definition of “humanity” be too complex to let us prove that some particular AI is Friendly to humanity? Even if the answer to “To whom?” is “Eliezer Yudkowsky”, even that seems like it would be a rather complicated proof to say the least.
Any proofs will be like… assuming that if some laws of aerodynamics and range of conditions hold, proving that a certain plane design will fly. Which of course has some trouble because we don’t know the equivalent of aerodynamics either.
That would seem to be the best possible solution, but I have never heard aeroplane engineers claim that their designs are “provably airworthy”. If you take the aeroplane design approach, then isn’t “provably Friendly” a somewhat misleading claim to make, especially when you’re talking about pushing conditions to the extreme that you yourself admit are beyond your powers of prediction? The aeroplane equivalent would be like designing a plane so powerful that its flight changes the atmospheric conditions of the entire planet, but then the plane uses a complicated assembly of gyroscopes or something to continue flying in a straight line. However, if you yourself cannot predict which specific changes the flight of the plane will make, then how can you claim that you can prove that particular assembly of gyroscopes is sufficient to keep the plane on the preplanned path? On the other hand, if you can prove which specific changes the plane’s flight will make that are relevant to its flight, then you have a mathematical definition of the target atmosphere at a sufficient depth of resolution to design such an assembly. Does MIRI think it can come up with an equivalent mathematical model of humanity with respect to AI?
That’s the reason EY came up with the concept of CEV—Coherent Extrapolated Volition.
The SEP says that preferences cannot be aggregated without additional constraints on how the aggregation is to be done, and the end result changes depending on things like the order of aggregation, so these additional constraints take on the quality of arbitrariness. How does CEV get around that problem?
From the CEV paper:
Which you could sum up as “CEV doesn’t get around that problem, it treats it as irrelevant—the point isn’t to find a particular good solution that’s unique and totally non-arbitrary, it’s just to find even one of the good solutions. If arbitrary reasons shift us from Good World #4 to Good World #36, who cares as long as they both really are good worlds”.
The real difficulty is that when you combine two sets of preferences, each of which make sense on their own, you get a set of preferences that makes no sense whatsoever: http://plato.stanford.edu/entries/economics/#5.2 https://www.google.com/search?q=site%3Aplato.stanford.edu+social+choice&ie=utf-8&oe=utf-8
There is no easy way to resolve this problem. There is also no known method that takes such an inconsistent set of preferences as input and gives a consistent set of preferences as output such that the output would be recognizable to either party who contributed an original set of preferences as furthering any of their original goals. These random decisions are required so often in cases where there isn’t an unanimous agreement that in practice, there would be a large component of arbitrariness every single time CEV tries to arrive at a uniform set of preferences by extrapolating volitions of multiple agents into the future.
This doesn’t mean the problem is unresolvable, just that it’s an AI problem in its own right, but given these problems, wouldn’t it be better to pick whichever Nice Place to Live is the safest to reach instead of bothering with CEV? I say this because I’m not sure Nice Place to Live can be defined in terms of CEV, as any CEV-approved output. Because of the preference aggregation problem, I’m not certain that a world that is provably CEV-abiding also provably avoids flagrant immorality. Two moral frameworks when aggregated by a non-smart algorithm might give rise to an immoral framework, so I’m not sure the essence of the problem is resolved just by CEV as explained in the paper.
Although what if we told each party to submit goals rather than non-goal preferences? If the AI has access to a model specifying which actions lead to which consequences, then it can search for those actions that maximize the number of goals fulfilled regardless of which party submitted them, or perhaps takes a Rawlsian approach of trying to maximize the number of goals fulfilled that were submitted by whichever party will have the least number of goals fulfilled if that sequence of actions were taken, etc. That seems very imaginable to me. You can then have heuristics that constrain the search space and stuff. You can also have non-goal preferences in addition to goals if the parties have any of those.
In that light, it seems to me that the problem was inferring goals from a set of preferences which were not purely non-goal preferences but were actually presented with some unspecified goals in mind. Eg. One party wanted chocolate, but said, “I want to go to the store” instead. If that was the source of the original problem, then we can see why we might need an AI to solve it, since it calls for some lightweight mind reading. Of course, a CEV-implementing AI would have to be a mind reader anyway, since we don’t really know what our goals ultimately are given everything we could know about reality.
This still does not guarantee basic morality, but parties should at least recognize some of their ultimate goals in the end result. They might still grumble about the result not being exactly what they wanted, but we can at least scold them for lacking a spirit of compromise.
All this presupposes that enough of our actions can be reduced to ultimate goals that can be discovered, and I don’t think this process guarantees we will be satisfied with the results. For example, this might erode personal freedom to an unpleasant degree. If we would choose to live in some world X if we were wiser and nicer than we are, then it doesn’t necessarily follow that X is a Nice Place to Live as we are now. Changing ourselves to reach that level of niceness and wisdom might require unacceptably extensive modifications to our actual selves.
My recent paper touches upon preference aggregation a bit in section 8, BTW, though it’s mostly focused on the question of figuring out a single individual’s values. (Not sure how relevant that is for your comments, but thought maybe a little.)
Thanks, I’ll look into it.
(And all my ranting still didn’t address the fundamental difficulty: There is no rational way to choose from among different projections of values held by multiple agents, projections such as Rawlsianism and utilitarianism.)
I think that’s on the list of MIRI open research problems.
Interesting. In that case, would you say an AI that provably implements CEV’s replacement is, for that reason, provably Friendly? That is, AIs implementing CEV’s replacement form an analytical subset of Friendly AIs? What is the current replacement for CEV anyway? Having some technical material would be even better. If it’s open to the public, then I’d like to understand how EY proposes to install a general framework similar to CEV at the “initial dynamic” stage that can predictably generate a provably Friendly AI without explicitly modeling the target of its Friendliness.
There isn’t really one as far as I know; “The Value Learning Problem” discusses some of the questions involved, but seems to mostly at be the point of defining the problem rather than trying to answer it. (This seems appropriate to me; trying to answer the problem at this point seems premature.)
Thanks. That makes sense to me.
I think that’s MIRI’s usage of the term friendly.
He’s not proposing a mechanism as far as I know. That’s another open problem.
See Miris research for details.