Suppose we want to program an AI to represent the interest of a group. The standard utilitarian solution is to give the AI a utility function that is an average of the utility functions of the individual in the group, but that runs into the interpersonal comparison of utility problem. (Was there ever a post about this? Does Eliezer have a preferred approach?)
Here’s my idea for how to solve this. Create N AIs, one for each individual in the group, and program it with the utility function of that individual. Then set a time in the future when one of those AIs will be randomly selected and allowed to take over the universe. In the mean time the N AIs are to negotiate amongst themselves, and if necessary, given help to enforce their agreements.
The advantages of this approach are:
AIs will need to know how to negotiate with each other anyway, so we can build on top of that “for free”.
There seems little question that the scheme is fair, since everyone is given an equal amount of bargaining power.
Unless you can directly extract a sincere and accurate utility function from the participants’ brains, this is vulnerable to exaggeration in the AI programming. Say my optimal amount of X is 6. I could program my AI to want 12 of X, but be willing to back off to 6 in exchange for concessions regarding Y from other AIs that don’t want much X.
This does not seem to be the case when the AIs are unable to read each other’s minds. Your AI can be expected to lie to others with more tactical effectiveness than you can lie indirectly via deceiving it. Even in that case it would be better to let the AI rewrite itself for you.
On a similar note, being able to directly extract a sincere and accurate utility function from the participants’ brains leaves the system vulnerable to exploitations. Individuals are able to rewrite their own preferences strategically in much the same way that an AI can. Future-me may not be happy but present-me got what he wants and I don’t (necessarily) have to care about future me.
I had also mentioned this in an earlier comment on another thread. It turns out that this is a standard concern in bargaining theory. See section 11.2 of this review paper.
So, yeah, it’s a problem, but it has to be solved anyway in order for AIs to negotiate with each other.
Create N AIs, one for each individual in the group, and program it with the utility function of that individual. [...] everyone is given an equal amount of bargaining power.
Do you think the more powerful group members are going to agree to that?!? They worked hard for their power and status—and are hardly likely to agree to their assets being ripped away from them in this way. Surely they will ridicule your scheme, and fight against it being implemented.
The main idea I wanted to introduce in that comment was the idea of using (supervised) bargaining to aggregate individual preferences. Bargaining power (or more generally, weighing of individual preferences) is a mostly orthogonal issue. If equal bargaining power turns out to be impractical and/or immoral, then some other distribution of bargaining power can be used.
Why not use virtual agents, which are given only a safe interface to negotiate with each other over, and no physical powers, and are monitored by a meta-AI that prevents them from trying to game the system, fool each other, etc. This would avoid having wars between superintelligences in the real physical universe.
I think that’s what I implied: there is a supervisor process that governs the negotiation process and eventually picks a random AI to be released into the real world.
What exactly is “equal bargaining power” is vague. If you “instantiate” multiple AIs, their “bargaining power” may well depend on their “positions” relative to each other, the particular values in each of them, etc.
Then set a time in the future when one of those AIs will be randomly selected and allowed to take over the universe.
Why this requirement? A cooperation of AIs might as well be one AI. Cooperation between AIs is just a special case of operation of each AI in the environment, and where you draw the boundary between AI and environment is largely arbitrary.
The idea is that the status quo (i.e., the outcome if the AIs fail to cooperate) is N possible worlds of equal probability, each shaped according to the values of one individual/AI. The AIs would negotiate from this starting point and improve upon it. If all the AIs cooperate (which I presume would be the case), then which AI gets randomly selected to take over the world won’t make any difference.
What exactly is “equal bargaining power” is vague. If you “instantiate” multiple AIs, their “bargaining power” may well depend on their “positions” relative to each other, the particular values in each of them, etc.
In this case the AIs start from an equal position, but you’re right that their values might also figure into bargaining power. I think this is related to a point Eliezer made in the comment I linked to: a delegate may “threaten to adopt an extremely negative policy in order to gain negotiating leverage over other delegates.” So if your values make you vulnerable to this kind of threat, then you might have less bargaining power than others. Is this what you had in mind?
Letting a bunch of AIs with given values resolve their disagreement is not the best way to merge values, just like letting the humanity go on as it is is not the best way to preserve human values. As extraction of preference shouldn’t depend on the actual “power” or even stability of the given system, merging of preference could also possibly be done directly and more fairly when specific implementations and their “bargaining power” are abstracted away. Such implementation-independent composition/interaction of preference may turn out to be a central idea for the structure of preference.
There seems to be a bootstrapping problem: In order to figure out what the precise statement is that human preference makes, we need to know how to combine preferences from different systems; in order to know how preferences should combine, we need to know what human preference says about this.
If we already have a given preference, it will only retell itself as an answer to the query “What preference should result [from combining A and B]?”, so that’s not how the game is played. “What’s a fair way of combining A and B?” may be more like it, but of questionable relevance. For now, I’m focusing on getting a better idea of what kind of mathematical structure preference should be, rather than on how to point to the particular object representing the given imperfect agent.
For now, I’m focusing on getting a better idea of what kind of mathematical structure preference should be
What is/are your approach(es) for attacking this problem, if you don’t mind sharing?
In my UDT1 post I suggested that the mathematical structure of preference could be an ordering on all possible (vectors of) execution histories of all possible computations. This seems general enough to represent any conceivable kind of preference (except preferences about uncomputable universes), but also appears rather useless for answering the question of how preferences should be merged.
For now, I’m focusing on getting a better idea of what kind of mathematical structure preference should be
What is/are your approach(es) for attacking this problem, if you don’t mind sharing?
Since I don’t have self-contained results, I can’t describe what I’m searching for concisely, and the working hypotheses and hunches are too messy to summarize in a blog comment. I’ll give some of the motivations I found towards the end of the current blog sequence, and possibly will elaborate in the next one if the ideas sufficiently mature.
In my UDT1 post I suggested that the mathematical structure of preference could be an ordering on all possible (vectors of) execution histories of all possible computations. This seems general enough to represent any conceivable kind of preference (except preferences about uncomputable universes), but also appears rather useless for answering the question of how preferences should be merged.
Yes, this is not very helpful. Consider the question: what is the difference between (1) preference, (2) strategy that the agent will follow, and the (3) whole of agent’s algorithm? Histories of the universe could play a role in semantics of (1), but they are problematic in principle, because we don’t know, nor will ever know with certainty, the true laws of the universe. And what we really want is to get to (3), not (1), but with good understanding of (1) so that we know (3) to be based on our (1).
I’ll give some of the motivations I found towards the end of the current blog sequence, and possibly will elaborate in the next one if the ideas sufficiently mature.
Thanks. I look forward to that.
Histories of the universe could play a role in semantics of (1), but they are problematic in principle, because we don’t know, nor will ever know with certainty, the true laws of the universe.
I don’t understand what you mean here, and I think maybe you misunderstood something I said earlier. Here’s what I wrote in the UDT1 post:
More generally, we can always represent your preferences as a utility function on vectors of the form where E1 is an execution history of P1, E2 is an execution history of P2, and so on.
(Note that of course this utility function has to be represented in a compressed/connotational form, otherwise it would be infinite in size.) If we consider the multiverse to be the execution of all possible programs, there is no uncertainty about the laws of the multiverse. There is uncertainty about “which universes, i.e., programs, we’re in”, but that’s a problem we already have a handle on, I think.
So, I don’t know what you’re referring to by “true laws of the universe”, and I can’t find an interpretation of it where your quoted statement makes sense to me.
If we consider the multiverse to be the execution of all possible programs, there is no uncertainty about the laws of the multiverse.
I don’t believe that directly posing this “hypothesis” is a meaningful way to go, although computational paradigm can find its way into description of the environment for the AI that in its initial implementation works from within a digital computer.
Here is a revised way of asking the question I had in mind: If our preferences determine which extraction method is the correct one (the one that results in our actual preferences), and if we cannot know or use our preferences with precision until they are extracted, then how can we find the correct extraction method?
Asking it this way, I’m no longer sure it is a real problem. I can imagine that knowing what kind of object preference is would clarify what properties a correct extraction method needs to have.
Going meta and using the (potentially) available data such as humans in form of uploads, is a step made in attempt to minimize the amount of data (given explicitly by the programmers) to the process that reconstructs human preference. Sure, it’s a bet (there are no universal preference-extraction methods that interpret every agent in a way it’d prefer to do itself, so we have to make a good enough guess), but there seems to be no other way to have a chance at preserving current preference. Also, there may turn out to be a good means of verification that the solution given by a particular preference-extraction procedure is the right one.
So you know how to divide the pie? There is no interpersonal “best way” to resolve directly conflicting values. (This is further than Eliezer went.) Sure, “divide equally” makes a big dent in the problem, but I find it much more likely any given AI will be a Zaire than a Yancy. As a simple case, say AI1 values X at 1, and AI2 values Y at 1, and X+Y must, empirically, equal 1. I mean, there are plenty of cases where there’s more overlap and orthogonal values, but this kind of conflict is unavoidable between any reasonably complex utility functions.
here is no interpersonal “best way” to resolve directly conflicting values.
I’m not suggesting an “interpersonal” way (as in, by a philosopher of perfect emptiness). The possibilities open for the search of “off-line” resolution of conflict (with abstract transformation of preference) are wider than those for the “on-line” method (with AIs fighting/arguing it over) and so the “best” option, for any given criterion of “best”, is going to be better in “off-line” case.
There seems to be a bootstrapping problem: In order to figure out what the precise statement is that human preference makes, we need to know how to combine preferences from different systems; in order to know how preferences should combine, we need to know what human preference says about this.
There seems to be a bootstrapping problem: In order to figure out what the precise statement is that human preference makes, we need to know how to combine preferences from different systems; in order to know how preferences should combine, we need to know what human preference says about this.
Letting a bunch of AIs with given values resolve their disagreement is not the best way to merge values
[Edited] I agree that it is probably not the best way. Still, the idea of merging values by letting a bunch of AIs with given values resolve their disagreement seems better than previous proposed solutions, and perhaps gives a clue to what the real solution looks like.
BTW, I have a possible solution to the AI-extortion problem mentioned by Eliezer. We can set a lower bound for each delegate’s utility function at the status quo outcome, (N possible worlds with equal probability, each shaped according to one individual’s utility function). Then any threats to cause an “extremely negative” outcome will be ineffective since the “extremely negative” outcome will have utility equal to the status quo outcome.
Suppose we want to program an AI to represent the interest of a group. The standard utilitarian solution is to give the AI a utility function that is an average of the utility functions of the individual in the group, but that runs into the interpersonal comparison of utility problem. (Was there ever a post about this? Does Eliezer have a preferred approach?)
Here’s my idea for how to solve this. Create N AIs, one for each individual in the group, and program it with the utility function of that individual. Then set a time in the future when one of those AIs will be randomly selected and allowed to take over the universe. In the mean time the N AIs are to negotiate amongst themselves, and if necessary, given help to enforce their agreements.
The advantages of this approach are:
AIs will need to know how to negotiate with each other anyway, so we can build on top of that “for free”.
There seems little question that the scheme is fair, since everyone is given an equal amount of bargaining power.
Comments?
ETA: I found a very similar idea mentioned before by Eliezer.
Unless you can directly extract a sincere and accurate utility function from the participants’ brains, this is vulnerable to exaggeration in the AI programming. Say my optimal amount of X is 6. I could program my AI to want 12 of X, but be willing to back off to 6 in exchange for concessions regarding Y from other AIs that don’t want much X.
This does not seem to be the case when the AIs are unable to read each other’s minds. Your AI can be expected to lie to others with more tactical effectiveness than you can lie indirectly via deceiving it. Even in that case it would be better to let the AI rewrite itself for you.
On a similar note, being able to directly extract a sincere and accurate utility function from the participants’ brains leaves the system vulnerable to exploitations. Individuals are able to rewrite their own preferences strategically in much the same way that an AI can. Future-me may not be happy but present-me got what he wants and I don’t (necessarily) have to care about future me.
I had also mentioned this in an earlier comment on another thread. It turns out that this is a standard concern in bargaining theory. See section 11.2 of this review paper.
So, yeah, it’s a problem, but it has to be solved anyway in order for AIs to negotiate with each other.
Do you think the more powerful group members are going to agree to that?!? They worked hard for their power and status—and are hardly likely to agree to their assets being ripped away from them in this way. Surely they will ridicule your scheme, and fight against it being implemented.
The main idea I wanted to introduce in that comment was the idea of using (supervised) bargaining to aggregate individual preferences. Bargaining power (or more generally, weighing of individual preferences) is a mostly orthogonal issue. If equal bargaining power turns out to be impractical and/or immoral, then some other distribution of bargaining power can be used.
Why not use virtual agents, which are given only a safe interface to negotiate with each other over, and no physical powers, and are monitored by a meta-AI that prevents them from trying to game the system, fool each other, etc. This would avoid having wars between superintelligences in the real physical universe.
I think that’s what I implied: there is a supervisor process that governs the negotiation process and eventually picks a random AI to be released into the real world.
ok, just checking you weren’t advocating a free-for-all.
What exactly is “equal bargaining power” is vague. If you “instantiate” multiple AIs, their “bargaining power” may well depend on their “positions” relative to each other, the particular values in each of them, etc.
Why this requirement? A cooperation of AIs might as well be one AI. Cooperation between AIs is just a special case of operation of each AI in the environment, and where you draw the boundary between AI and environment is largely arbitrary.
The idea is that the status quo (i.e., the outcome if the AIs fail to cooperate) is N possible worlds of equal probability, each shaped according to the values of one individual/AI. The AIs would negotiate from this starting point and improve upon it. If all the AIs cooperate (which I presume would be the case), then which AI gets randomly selected to take over the world won’t make any difference.
In this case the AIs start from an equal position, but you’re right that their values might also figure into bargaining power. I think this is related to a point Eliezer made in the comment I linked to: a delegate may “threaten to adopt an extremely negative policy in order to gain negotiating leverage over other delegates.” So if your values make you vulnerable to this kind of threat, then you might have less bargaining power than others. Is this what you had in mind?
Letting a bunch of AIs with given values resolve their disagreement is not the best way to merge values, just like letting the humanity go on as it is is not the best way to preserve human values. As extraction of preference shouldn’t depend on the actual “power” or even stability of the given system, merging of preference could also possibly be done directly and more fairly when specific implementations and their “bargaining power” are abstracted away. Such implementation-independent composition/interaction of preference may turn out to be a central idea for the structure of preference.
There seems to be a bootstrapping problem: In order to figure out what the precise statement is that human preference makes, we need to know how to combine preferences from different systems; in order to know how preferences should combine, we need to know what human preference says about this.
If we already have a given preference, it will only retell itself as an answer to the query “What preference should result [from combining A and B]?”, so that’s not how the game is played. “What’s a fair way of combining A and B?” may be more like it, but of questionable relevance. For now, I’m focusing on getting a better idea of what kind of mathematical structure preference should be, rather than on how to point to the particular object representing the given imperfect agent.
What is/are your approach(es) for attacking this problem, if you don’t mind sharing?
In my UDT1 post I suggested that the mathematical structure of preference could be an ordering on all possible (vectors of) execution histories of all possible computations. This seems general enough to represent any conceivable kind of preference (except preferences about uncomputable universes), but also appears rather useless for answering the question of how preferences should be merged.
Since I don’t have self-contained results, I can’t describe what I’m searching for concisely, and the working hypotheses and hunches are too messy to summarize in a blog comment. I’ll give some of the motivations I found towards the end of the current blog sequence, and possibly will elaborate in the next one if the ideas sufficiently mature.
Yes, this is not very helpful. Consider the question: what is the difference between (1) preference, (2) strategy that the agent will follow, and the (3) whole of agent’s algorithm? Histories of the universe could play a role in semantics of (1), but they are problematic in principle, because we don’t know, nor will ever know with certainty, the true laws of the universe. And what we really want is to get to (3), not (1), but with good understanding of (1) so that we know (3) to be based on our (1).
Thanks. I look forward to that.
I don’t understand what you mean here, and I think maybe you misunderstood something I said earlier. Here’s what I wrote in the UDT1 post:
(Note that of course this utility function has to be represented in a compressed/connotational form, otherwise it would be infinite in size.) If we consider the multiverse to be the execution of all possible programs, there is no uncertainty about the laws of the multiverse. There is uncertainty about “which universes, i.e., programs, we’re in”, but that’s a problem we already have a handle on, I think.
So, I don’t know what you’re referring to by “true laws of the universe”, and I can’t find an interpretation of it where your quoted statement makes sense to me.
I don’t believe that directly posing this “hypothesis” is a meaningful way to go, although computational paradigm can find its way into description of the environment for the AI that in its initial implementation works from within a digital computer.
Here is a revised way of asking the question I had in mind: If our preferences determine which extraction method is the correct one (the one that results in our actual preferences), and if we cannot know or use our preferences with precision until they are extracted, then how can we find the correct extraction method?
Asking it this way, I’m no longer sure it is a real problem. I can imagine that knowing what kind of object preference is would clarify what properties a correct extraction method needs to have.
Going meta and using the (potentially) available data such as humans in form of uploads, is a step made in attempt to minimize the amount of data (given explicitly by the programmers) to the process that reconstructs human preference. Sure, it’s a bet (there are no universal preference-extraction methods that interpret every agent in a way it’d prefer to do itself, so we have to make a good enough guess), but there seems to be no other way to have a chance at preserving current preference. Also, there may turn out to be a good means of verification that the solution given by a particular preference-extraction procedure is the right one.
So you know how to divide the pie? There is no interpersonal “best way” to resolve directly conflicting values. (This is further than Eliezer went.) Sure, “divide equally” makes a big dent in the problem, but I find it much more likely any given AI will be a Zaire than a Yancy. As a simple case, say AI1 values X at 1, and AI2 values Y at 1, and X+Y must, empirically, equal 1. I mean, there are plenty of cases where there’s more overlap and orthogonal values, but this kind of conflict is unavoidable between any reasonably complex utility functions.
I’m not suggesting an “interpersonal” way (as in, by a philosopher of perfect emptiness). The possibilities open for the search of “off-line” resolution of conflict (with abstract transformation of preference) are wider than those for the “on-line” method (with AIs fighting/arguing it over) and so the “best” option, for any given criterion of “best”, is going to be better in “off-line” case.
There seems to be a bootstrapping problem: In order to figure out what the precise statement is that human preference makes, we need to know how to combine preferences from different systems; in order to know how preferences should combine, we need to know what human preference says about this.
There seems to be a bootstrapping problem: In order to figure out what the precise statement is that human preference makes, we need to know how to combine preferences from different systems; in order to know how preferences should combine, we need to know what human preference says about this.
[Edited] I agree that it is probably not the best way. Still, the idea of merging values by letting a bunch of AIs with given values resolve their disagreement seems better than previous proposed solutions, and perhaps gives a clue to what the real solution looks like.
BTW, I have a possible solution to the AI-extortion problem mentioned by Eliezer. We can set a lower bound for each delegate’s utility function at the status quo outcome, (N possible worlds with equal probability, each shaped according to one individual’s utility function). Then any threats to cause an “extremely negative” outcome will be ineffective since the “extremely negative” outcome will have utility equal to the status quo outcome.