In practice, it seems to me that what Alice is concerned with are the social (signaling) implications of Bob gaining knowledge of both the bonus and of the possibility of recommendation.
I am assuming that Alice, on reflection, decides that she wants to give Bob the higher bonus even if nobody else ever learned that she had the opportunity to recommend him for the project, the way I would not want to steal food from a starving person even if nobody ever found out about it.
The concern I’m replying to is that decision theory assumes your preferences can be described by a binary “is preferred to” relation, but humans might choose option A if the available options are A and B, and option B if the available options are A, B and C, so how do you model that as a binary relation? I actually don’t recall seeing this raised in the context of VNM utility theory, but I believe I’ve seen it in discussions of Arrow’s impossibility theorem, where the Independence of Irrelevant Alternatives axiom (confusingly, not the analog of VNM’s Independence of Irrelevant Alternatives) says that adding option C must not change the decision from A to B.
I’m not particularly bothered for decision theory if you can do an experiment and have humans exhibit such behavior, because some human behavior is patently self-defeating and I don’t think we should require decision theory to explain all our biases as “rational”, but I want a decision theory that won’t exclude the preferences that we would actually want to adopt on reflection, so I either want it to support Alice’s preferences or I want to understand why Alice’s preferences are in fact irrational.
In fact, deciding that some kinds of preferences should be outlawed as irrational can be dangerous: [...]
Maybe I’m just being slow right now, but I can’t figure out what this has to do with the discussion preceding it.
It’s like this: Caring about the set of options you were able to choose between seems like a bad idea to me; I’m skeptical that preferences like Alice’s are what I would want to adopt, on reflection. I might be tempted to simply say, they’re obviously irrational, no problem if decision theory doesn’t cater to them. But caring about the algorithm your AI runs also seems like a bad idea, and by similar intuitions I might have been willing to accept a decision theory that would outlaw such preferences—which, as it turns out, would not be good.
The point is that we’re asking what it means to have a consistent direction in which you are trying to steer the future, and it doesn’t look like our AIs are on the same bearing.
I either don’t understand or disagree with this. In the situation you describe it sounds to me like the two AIs will make different decisions in practice for game-theoretic reasons, but I don’t see why one would suspect that they are trying to steer the future in different directions.
Let’s suppose that both AIs have the following preferences: Most importantly, they don’t want Earth blown up. However, if they are able to blow up Earth no later than one month from now, they would like to maximize the expected number of paperclips in the universe; if they aren’t able to, they want to maximize the expected number of staples. Now, if in two months a freak accident wipes out Alice’s AI, then the world ends up tiled with paperclips; if it wipes out Carol’s AI, the world ends up tiled with staples. (Unless they made a deal that if either was wiped out, the other would carry on its work, as any paperclip maximizer might do with any staples maximizer—though they might not have a reason to, since they’re not risk-averse.) This does not sound like steering the future in the same direction, to me.
Could you expand on the game-theoretic reasons? My intuition is that from a game theoretic perspective, “steering the future in the same direction” should mean we’re talking about a partnership game, i.e., that both agents will get the same payoff for any strategy profile, and I do not see why this would lead to reasons to “make different decisions in practice”.
The concern I’m replying to is that decision theory assumes your preferences can be described by a binary “is preferred to” relation, but humans might choose option A if the available options are A and B, and option B if the available options are A, B and C, so how do you model that as a binary relation?
Oh. I still do not think the example you gave illustrates this concern. One interpretation of the situation is that Alice gains new knowledge in the scenario. The existence of a new project suited to Bob’s talents increases Alice’s assessment of Bob’s value. More generally, it’s reasonable for an agent’s preferences to change as its knowledge changes.
In response to this objection, I think you only need to assume that deciding between A and B and C is equivalent to deciding between A and (B and C) and also equivalent to deciding between (A and B) and C, together with the assumption that your agent is capable of consistently assigning preferences to “composite choices” like (A and B).
Caring about the set of options you were able to choose between seems like a bad idea to me; I’m skeptical that preferences like Alice’s are what I would want to adopt, on reflection. I might be tempted to simply say, they’re obviously irrational, no problem if decision theory doesn’t cater to them. But caring about the algorithm your AI runs also seems like a bad idea, and by similar intuitions I might have been willing to accept a decision theory that would outlaw such preferences—which, as it turns out, would not be good.
Are you claiming that these two situations are analogous or only claiming that they are two examples of caring about whether decision theory should allow certain kinds of preferences? That’s one of the things I was confused about (because I can’t see the analogy but your writing suggests that one exists). Also, where does your intuition that it is a bad idea to care about the algorithm your AI runs come from? It seems like an obviously good idea to care about the algorithm your AI runs to me.
Could you expand on the game-theoretic reasons? My intuition is that from a game theoretic perspective, “steering the future in the same direction” should mean we’re talking about a partnership game, i.e., that both agents will get the same payoff for any strategy profile, and I do not see why this would lead to reasons to “make different decisions in practice”.
I guess that depends on what “same” means. If you instantiate two AIs that are running identical algorithms but both AIs are explicitly trying to monopolize all of the resources on the planet, then they’re playing a zero-sum game but there’s a reasonable sense in which they are trying to steer the future in the “same” direction (namely that they are running identical algorithms).
If this isn’t a reasonable notion of sameness because the algorithm involves reference to thisAgent and the referent of this pointer changes depending on who’s instantiating the algorithm, then the preferences you’ve described are also not the same preferences because they also refer to thisAgent. If the preferences are modified to say “if an agent running thisAlgorithm has access to foo,” then as far as I can tell the two AIs you describe should behave as if they are the same agent.
It’s possible that I’m just misreading your words to match my picture of the world, but it sounds to me as if we’re not disagreeing too much, but I failed to get my point across in the post. Specifically:
If this isn’t a reasonable notion of sameness because the algorithm involves reference to thisAgent and the referent of this pointer changes depending on who’s instantiating the algorithm, then the preferences you’ve described are also not the same preferences because they also refer to thisAgent. If the preferences are modified to say “if an agent running thisAlgorithm has access to foo,” then as far as I can tell the two AIs you describe should behave as if they are the same agent.
I am saying that I think that a “direction for steering the future” should not depend on a global thisAgent variable. To make the earlier example even more blatant, I don’t think it’s useful to call “If thisAgent = Alice’s AI, maximize paperclips; if thisAgent = Carol’s AI, maximize staples” a coherent direction, I’d call it a function that returns a coherent direction. Whether or not the concept I’m trying to define is the best meaning for “same direction” is of course only a definitional debate and not that interesting, but I think it’s a useful concept.
I agree that the most obvious formalization of Alice’s preferences would depend on thisAgent. So I’m saying that there actually is a nontrivial restriction on her preferences: If she wants to keep something like her informal formulation, she will need to decide what they are supposed to mean in terms that do not refer to thisAgent. They may simply refer to “Alice”, but then the AI is influenced only by what Alice was able to do, not by what the AI was able to do, and Alice will have to decide whether that is what she wants.
Oh. I still do not think the example you gave illustrates this concern. One interpretation of the situation is that Alice gains new knowledge in the scenario. The existence of a new project suited to Bob’s talents increases Alice’s assessment of Bob’s value. More generally, it’s reasonable for an agent’s preferences to change as its knowledge changes.
But how could you come up with a pair of situations such that in situation (i), the agent can choose options A and B, while in situation (ii), the agent can choose between A, B and C, and yet the agent has exactly the same information in situations (i) and (ii)? So under your rules, how could any example illustrate the concern?
I do agree that it’s reasonable for Alice to choose a different option because the knowledge she has is different—that’s my resolution to the problem.
In response to this objection, I think you only need to assume that deciding between A and B and C is equivalent to deciding between A and (B and C) and also equivalent to deciding between (A and B) and C, together with the assumption that your agent is capable of consistently assigning preferences to “composite choices” like (A and B).
Sorry, I do not understand—what do you mean by your composite choices? What does it mean to choose (A and B) when A and B are mutually exclusive options?
Are you claiming that these two situations are analogous or only claiming that they are two examples of caring about whether decision theory should allow certain kinds of preferences? That’s one of the things I was confused about (because I can’t see the analogy but your writing suggests that one exists).
I’m claiming they are both examples of preferences you might think you could outlaw as irrational, so you might think it’s ok to use a decision theory that doesn’t allow for such preferences. In one of the two cases, it’s clearly not ok, which suggests we shouldn’t be too quick to decide it’s ok in the other case.
Also, where does your intuition that it is a bad idea to care about the algorithm your AI runs come from? It seems like an obviously good idea to care about the algorithm your AI runs to me.
Could it be that it’s not clear enough that I’m talking about terminal values, not instrumental values?
Maybe it’s not right to say that it seems like a bad idea, more like it would seem at first that people just don’t have terminal preferences about the algorithm run (or at least not strong ones—you might derive enjoyment from an elegant algorithm, but that wouldn’t outweigh your desire to save lives, so your instrumental preference for a well-working algorithm would always dominate your terminal preference for enjoying an elegant algorithm, if the two came into conflict). So at first it might seem reasonable to design a decision theory where you are not allowed to care about the algorithm your AI is running—I find it at least conceivable that when trying to prove theorems about self-modifying AI, making such an assumption might simplify things, so this does seem like a conceivable failure mode to me.
I agree that the most obvious formalization of Alice’s preferences would depend on thisAgent. So I’m saying that there actually is a nontrivial restriction on her preferences: If she wants to keep something like her informal formulation, she will need to decide what they are supposed to mean in terms that do not refer to thisAgent.
Got it. I think.
But how could you come up with a pair of situations such that in situation (i), the agent can choose options A and B, while in situation (ii), the agent can choose between A, B and C, and yet the agent has exactly the same information in situations (i) and (ii)?
In situation (i), Alice can choose between chocolate and vanilla ice cream. In situation (ii), Alice can choose between chocolate, vanilla, and strawberry ice cream. Having access to these options doesn’t change Alice’s knowledge about her preferences for ice cream flavors (under the assumption that access to flavors on a given day doesn’t reflect some kind of global shortage of a flavor). In general it might help to have Alice’s choices randomly determined, so that Alice’s knowledge of her choices doesn’t give her information about anything else.
Sorry, I do not understand—what do you mean by your composite choices? What does it mean to choose (A and B) when A and B are mutually exclusive options?
Sorry, I should probably have used “or” instead of “and.” If A and B are the primitive choices “chocolate ice cream” and “vanilla ice cream,” then the composite choice (A or B) is “the opportunity to choose between chocolate and vanilla ice cream.” The point is that once you allow a decision theory to assign preferences among composite choices, then composition of choices is associative, so preferences among an arbitrary number of primitive choices are determined by preferences among pairs of primitive choices.
Maybe it’s not right to say that it seems like a bad idea, more like it would seem at first that people just don’t have terminal preferences about the algorithm run (or at least not strong ones—you might derive enjoyment from an elegant algorithm, but that wouldn’t outweigh your desire to save lives, so your instrumental preference for a well-working algorithm would always dominate your terminal preference for enjoying an elegant algorithm, if the two came into conflict). So at first it might seem reasonable to design a decision theory where you are not allowed to care about the algorithm your AI is running—I find it at least conceivable that when trying to prove theorems about self-modifying AI, making such an assumption might simplify things, so this does seem like a conceivable failure mode to me.
Okay, but it still seems reasonable to have instrumental preferences about algorithms that AIs run, and I don’t see why decision theory is not allowed to talk about instrumental preferences. (Admittedly I don’t know very much about decision theory.)
I am assuming that Alice, on reflection, decides that she wants to give Bob the higher bonus even if nobody else ever learned that she had the opportunity to recommend him for the project, the way I would not want to steal food from a starving person even if nobody ever found out about it.
The concern I’m replying to is that decision theory assumes your preferences can be described by a binary “is preferred to” relation, but humans might choose option A if the available options are A and B, and option B if the available options are A, B and C, so how do you model that as a binary relation? I actually don’t recall seeing this raised in the context of VNM utility theory, but I believe I’ve seen it in discussions of Arrow’s impossibility theorem, where the Independence of Irrelevant Alternatives axiom (confusingly, not the analog of VNM’s Independence of Irrelevant Alternatives) says that adding option C must not change the decision from A to B.
I’m not particularly bothered for decision theory if you can do an experiment and have humans exhibit such behavior, because some human behavior is patently self-defeating and I don’t think we should require decision theory to explain all our biases as “rational”, but I want a decision theory that won’t exclude the preferences that we would actually want to adopt on reflection, so I either want it to support Alice’s preferences or I want to understand why Alice’s preferences are in fact irrational.
It’s like this: Caring about the set of options you were able to choose between seems like a bad idea to me; I’m skeptical that preferences like Alice’s are what I would want to adopt, on reflection. I might be tempted to simply say, they’re obviously irrational, no problem if decision theory doesn’t cater to them. But caring about the algorithm your AI runs also seems like a bad idea, and by similar intuitions I might have been willing to accept a decision theory that would outlaw such preferences—which, as it turns out, would not be good.
Let’s suppose that both AIs have the following preferences: Most importantly, they don’t want Earth blown up. However, if they are able to blow up Earth no later than one month from now, they would like to maximize the expected number of paperclips in the universe; if they aren’t able to, they want to maximize the expected number of staples. Now, if in two months a freak accident wipes out Alice’s AI, then the world ends up tiled with paperclips; if it wipes out Carol’s AI, the world ends up tiled with staples. (Unless they made a deal that if either was wiped out, the other would carry on its work, as any paperclip maximizer might do with any staples maximizer—though they might not have a reason to, since they’re not risk-averse.) This does not sound like steering the future in the same direction, to me.
Could you expand on the game-theoretic reasons? My intuition is that from a game theoretic perspective, “steering the future in the same direction” should mean we’re talking about a partnership game, i.e., that both agents will get the same payoff for any strategy profile, and I do not see why this would lead to reasons to “make different decisions in practice”.
Oh. I still do not think the example you gave illustrates this concern. One interpretation of the situation is that Alice gains new knowledge in the scenario. The existence of a new project suited to Bob’s talents increases Alice’s assessment of Bob’s value. More generally, it’s reasonable for an agent’s preferences to change as its knowledge changes.
In response to this objection, I think you only need to assume that deciding between A and B and C is equivalent to deciding between A and (B and C) and also equivalent to deciding between (A and B) and C, together with the assumption that your agent is capable of consistently assigning preferences to “composite choices” like (A and B).
Are you claiming that these two situations are analogous or only claiming that they are two examples of caring about whether decision theory should allow certain kinds of preferences? That’s one of the things I was confused about (because I can’t see the analogy but your writing suggests that one exists). Also, where does your intuition that it is a bad idea to care about the algorithm your AI runs come from? It seems like an obviously good idea to care about the algorithm your AI runs to me.
I guess that depends on what “same” means. If you instantiate two AIs that are running identical algorithms but both AIs are explicitly trying to monopolize all of the resources on the planet, then they’re playing a zero-sum game but there’s a reasonable sense in which they are trying to steer the future in the “same” direction (namely that they are running identical algorithms).
If this isn’t a reasonable notion of sameness because the algorithm involves reference to thisAgent and the referent of this pointer changes depending on who’s instantiating the algorithm, then the preferences you’ve described are also not the same preferences because they also refer to thisAgent. If the preferences are modified to say “if an agent running thisAlgorithm has access to foo,” then as far as I can tell the two AIs you describe should behave as if they are the same agent.
Thanks for the feedback!
It’s possible that I’m just misreading your words to match my picture of the world, but it sounds to me as if we’re not disagreeing too much, but I failed to get my point across in the post. Specifically:
I am saying that I think that a “direction for steering the future” should not depend on a global thisAgent variable. To make the earlier example even more blatant, I don’t think it’s useful to call “If thisAgent = Alice’s AI, maximize paperclips; if thisAgent = Carol’s AI, maximize staples” a coherent direction, I’d call it a function that returns a coherent direction. Whether or not the concept I’m trying to define is the best meaning for “same direction” is of course only a definitional debate and not that interesting, but I think it’s a useful concept.
I agree that the most obvious formalization of Alice’s preferences would depend on thisAgent. So I’m saying that there actually is a nontrivial restriction on her preferences: If she wants to keep something like her informal formulation, she will need to decide what they are supposed to mean in terms that do not refer to thisAgent. They may simply refer to “Alice”, but then the AI is influenced only by what Alice was able to do, not by what the AI was able to do, and Alice will have to decide whether that is what she wants.
But how could you come up with a pair of situations such that in situation (i), the agent can choose options A and B, while in situation (ii), the agent can choose between A, B and C, and yet the agent has exactly the same information in situations (i) and (ii)? So under your rules, how could any example illustrate the concern?
I do agree that it’s reasonable for Alice to choose a different option because the knowledge she has is different—that’s my resolution to the problem.
Sorry, I do not understand—what do you mean by your composite choices? What does it mean to choose (A and B) when A and B are mutually exclusive options?
I’m claiming they are both examples of preferences you might think you could outlaw as irrational, so you might think it’s ok to use a decision theory that doesn’t allow for such preferences. In one of the two cases, it’s clearly not ok, which suggests we shouldn’t be too quick to decide it’s ok in the other case.
Could it be that it’s not clear enough that I’m talking about terminal values, not instrumental values?
Maybe it’s not right to say that it seems like a bad idea, more like it would seem at first that people just don’t have terminal preferences about the algorithm run (or at least not strong ones—you might derive enjoyment from an elegant algorithm, but that wouldn’t outweigh your desire to save lives, so your instrumental preference for a well-working algorithm would always dominate your terminal preference for enjoying an elegant algorithm, if the two came into conflict). So at first it might seem reasonable to design a decision theory where you are not allowed to care about the algorithm your AI is running—I find it at least conceivable that when trying to prove theorems about self-modifying AI, making such an assumption might simplify things, so this does seem like a conceivable failure mode to me.
Got it. I think.
In situation (i), Alice can choose between chocolate and vanilla ice cream. In situation (ii), Alice can choose between chocolate, vanilla, and strawberry ice cream. Having access to these options doesn’t change Alice’s knowledge about her preferences for ice cream flavors (under the assumption that access to flavors on a given day doesn’t reflect some kind of global shortage of a flavor). In general it might help to have Alice’s choices randomly determined, so that Alice’s knowledge of her choices doesn’t give her information about anything else.
Sorry, I should probably have used “or” instead of “and.” If A and B are the primitive choices “chocolate ice cream” and “vanilla ice cream,” then the composite choice (A or B) is “the opportunity to choose between chocolate and vanilla ice cream.” The point is that once you allow a decision theory to assign preferences among composite choices, then composition of choices is associative, so preferences among an arbitrary number of primitive choices are determined by preferences among pairs of primitive choices.
Okay, but it still seems reasonable to have instrumental preferences about algorithms that AIs run, and I don’t see why decision theory is not allowed to talk about instrumental preferences. (Admittedly I don’t know very much about decision theory.)