If you allow selection over policies instead of individual decisions, you can be perfectly consistent.
I don’t see how selecting policies instead of actions removes the motivation for independence.
You just have to pick one that you like, and your choice is going to be arbitrary.
Ultimately, it isn’t the policy that you care about; it’s the outcome. So you should pick a policy because you like the probability distributions over outcomes that you get from implementing it more than you like the probability distributions over outcomes that you would get from implementing other policies. Since there are many decision problems to use your policy on, this quite heavily constrains what policy you choose. In order to get a policy that reliably picks the actions that you decide are correct in the situations where you can tell what the correct action is, it will have to make those decisions for the same reason you decided that it was the best action (or at least something equivalent to or approximating the same reason). So no, the choice of policy is not at all arbitrary.
If you go with EU you are pascal muggable.
That is not true. EU maximizers with bounded utility functions reject Pascal’s wager.
I don’t see how selecting policies instead of actions removes the motivation for independence.
There are two reasons to like independence. First of all, you might like it for philosophical/aesthetic reasons: “these things really should be independent, these really should be irrelevant”. Or you could like it because it prevents you from being money pumped.
When considering policies, money pumping is (almost) no longer an issue, because a policy that allows itself to be money-pumped is (almost) certainly inferior to one that doesn’t. So choosing policies removes one of the motivations for independence, to my mind the important one.
While it’s true that this does not tell you to pay each time to switch the outcomes around in a circle over and over again, it still falls prey to one step of a similar problem. Suppose their are 3 possible outcomes: A, B, and C, and there are 2 possible scenarios: X and Y. In scenario X, you get to choose between A and B. In scenario Y, you can attempt to choose between A and B, and you get what you picked with 50% probability, and you get outcome C otherwise. In each scenario, this is the only decision you will ever make. Suppose in scenario X, you prefer A over B, but in scenario Y, you prefer (B+C)/2 over (A+C)/2. But suppose you had to pay to pick A in scenario X, and you had to pay to pick (B+C)/2 in scenario Y, and you still make those choices. If Y is twice as likely as X a priori, then you are paying to get a probability distribution over outcomes that you could have gotten for free by picking B given X, and (A+C)/2 given Y. Since each scenario only involves you ever getting to make one decision, picking a policy is equivalent to picking a decision.
Your example is difficult to follow, but I think you are missing the point. If there is only one decision, then it’s actions can’t be inconsistent. By choosing a policy only once—one that maximizes it’s desired probability distribution of utility outcomes—it’s not money pumpable, and it’s not inconsistent.
Now by itself it still sucks because we probably don’t want to maximize for the best median future. But it opens up the door to more general policies for making decisions. You no longer have to use expected utility if you want to be consistent. You can choose a tradeoff between expected utility and median utility (see my top level comment), or a different algorithm entirely.
If there is only one decision point in each possible world, then it is impossible to demonstrate inconsistency within a world, but you can still be inconsistent between different possible worlds.
Edit: as V_V pointed out, the VNM framework was designed to handle isolated decisions. So if you think that considering an isolated decision rather than multiple decisions removes the motivation for the independence axiom, then you have misunderstood the motivation for the independence axiom.
So if you think that considering an isolated decision rather than multiple decisions removes the motivation for the independence axiom, then you have misunderstood the motivation for the independence axiom.
I understand the two motivations for the independence axiom, and the practical one (“you can’t be money pumped”) is much more important that the theoretical one (“your system obeys this here philosophically neat understanding of irrelevant information”).
But this is kind of a moot point, because humans don’t have utility functions. And therefore we will have to construct them. And the process of constructing them is almost certainly going to depend on facts about the world, making the construction process almost certainly inconsistent between different possible worlds.
And the process of constructing them is almost certainly going to depend on facts about the world
It shouldn’t. If your preferences among outcomes depend on what options are actually available to you, then I don’t see how you can justify claiming to have preferences among outcomes, as opposed to tendencies to make certain choices.
Then define me a process that takes people’s current mess of preferences, makes these into utility functions, and, respecting bounded rationality, is independent of options available in the real world. Even then, we have the problem that this mess of preferences is highly dependent on real world experiences in the first place.
I don’t see how you can justify claiming to have preferences among outcomes, as opposed to tendencies to make certain choices.
If I always go left at a road, I have tendency to make certain choices. If I have a full model of the entire universe with labelled outcomes ranked on a utility function, and use it with unbounded rationality to make decisions, I have preferences among outcomes. The extremes are clear.
I feel that a bounded human being with a crude mental model that is trying to achieve some goal, imperfectly (because of ingrained bad habits, for instance) is better described as having preferences among outcomes. You could argue that they have mere tendencies, but this seems to stretch the term. But in any case, this is a simple linguistic dispute. Real human beings cannot achieve independence.
Then define me a process that takes people’s current mess of preferences, makes these into utility functions, and, respecting bounded rationality, is independent of options available in the real world.
Define me a process with all those properties except the last one. If you can’t do that either, it’s not the last constraint that is to blame for the difficulty.
Even then, we have the problem that this mess of preferences is highly dependent on real world experiences in the first place.
Yes, different agents have different preferences. The same agent shouldn’t have its preferences change when the available outcomes do.
If I have a full model of the entire universe with labelled outcomes ranked on a utility function, and use it with unbounded rationality to make decisions, I have preferences among outcomes.
If you are neutral between .4A+.6C and .4B+.6C, then you don’t have a very good claim to preferring A over B.
Define me a process with all those properties except the last one.
Well, there’s my old idea here: http://lesswrong.com/lw/8qb/cevinspired_models/ . I don’t think it’s particularly good, but it does construct a utility function, and might be doable with good enough models or a WBE. More broadly, there’s the general “figure out human preferences from their decisions and from hypothetical questions and fit a utility function to it”, which we can already do today (see “inverse reinforcement learning”); we just can’t do it well enough, yet, to get something generally safe at the other end.
None of these ideas have independent variants (not technically true; I can think of some independent versions of them, but they’re so ludicrously unsafe in our world that we’d rule them out immediately; thus, this would be a non-independent process).
If you are neutral between .4A+.6C and .4B+.6C, then you don’t have a very good claim to preferring A over B.
?
If I actually do prefer A over B (and my behaviour reflects that in (1- ɛ)A+ ɛC versus (1-ɛ)B+ ɛC cases), then I have an extremely good claim to preferring A over B, and an extremely poor claim to independence.
I assumed accuracy was implied by “making a mess of preferences into a utility function”.
More broadly, there’s the general “figure out human preferences from their decisions and from hypothetical questions and fit a utility function to it”, which we can already do today (see “inverse reinforcement learning”); we just can’t do it well enough, yet, to get something generally safe at the other end.
I’m somewhat skeptical of that strategy for learning utility functions, because the space of possible outcomes is extremely high-dimensional, and it may be difficult to test extreme outcomes because the humans you’re trying to construct a utility function for might not be able to understand them. But perhaps this objection doesn’t get to the heart of the matter, and I should put it aside for now.
None of these ideas have independent variants
I am admittedly not well-versed in inverse reinforcement learning, but this is a perplexing claim. Except for a few people like you suggesting alternatives, I’ve only ever heard “utility function” used to refer to a function you maximize the expected value of (if you’re trying to handle uncertainty), or a function you just maximize the value of (if you’re not trying to handle uncertainty). In the first case, we have independence. In the second case, the question of whether or not we obey independence doesn’t really make sense. So if inverse reinforcement learning violates independence, then what exactly does it try to fit to human preferences?
If I actually do prefer A over B
Then if the only difference between two gambles is that one might give you A when the other might give you B, you’ll take the one that might give you something you like instead of something you don’t like.
I’ve only ever heard “utility function” used to refer to
To be clear, I am saying the process of constructing the utility function violates independence, not that subsequently maximising it does. Similarly, choosing a median-maximising policy P violates independence, but there is (almost certainly) a utility u such that maximising u is the same as following P.
Once the first choice is made, we have independence in both cases; before it is made, we have it in neither. The philosophical underpinning of independence in single decisions therefore seems very weak.
To be clear, I am saying the process of constructing the utility function violates independence
Feel free to tell me to shut up and learn how inverse reinforcement learning works before bothering you with such questions, if that is appropriate, but I’m not sure what you mean. Can you be more precise about what property you’re saying inverse reinforcement learning doesn’t have?
Inverse reinforcement learning relies on observation of humans performing specific actions, and drawing the “right” conclusion as to what their preferences. Indirectly, it relies on humans tinkering with its code to remove “errors”, ie things that don’t fit with the mental image that human programmers of what preferences should be.
Given that human desires are not independent (citation not needed), this process, if it produces a utility function, involves constructing something independent from non-independent input. However, to establish this utility function, the algorithm has access only to the particular problems given to it, and the particular mental images of its programmers. It is almost certain that the end result would be somewhat different if it was trained on different problems, or if its programmers had different intuitions. Therefore the process itself cannot be independent.
Ah, I see what you mean, and you’re right; the utility function constructed will depend on how the data points are sampled. This isn’t quite the same as saying that the result will depend on what results are actually available, though, unless knowledge about what results will be available is used to determine how to sample the data. This still seems like somewhat of a defect of inverse reinforcement learning, unless there ends up being a good case that some particular way of sampling the data is optimal for revealing underlying preferences and ignoring biases, or something like that.
Given that human desires are not independent (citation not needed)
That’s probably true, but on the other hand, you seem to want to pin the deviations of human behavior from VNM rationality on violations of the independence axiom, and it isn’t clear to me that this is the case (I don’t think the point you were making relies on this, so if you weren’t trying to make that claim then you can ignore this; it just seemed like you might be). There are situations where there are large framing effects (that is, whether A or B is preferred depends on how the options are presented, even if no other outcome C is being mixed in with them), and likely also violations of transitivity (where someone would say A>B, B>C, and C>A whenever you ask them about 2 of them without bringing up the third). It seems likely to me that most paradoxes of human decision-making have more to do with these than they do to violations of independence.
It can’t be inconsistent within a world no matter how many decisions points there are. If we agree it’s not inconsistent, then what are you arguing against?
I don’t care about the VNM framework. As you said, it is designed to be optimal for decisions made in isolation. Because we don’t need to make decisions in isolation, we don’t need to be constrained by it.
No. Inconsistency between different possible worlds is still inconsistency.
Because we don’t need to make decisions in isolation, we don’t need to be constrained by it.
The difference doesn’t matter that much in practice. If there are multiple decision points, you can combine them into one by selecting a policy, or by considering them sequentially and using your beliefs about what your choices will be in the future to compute the expected utilities of the possible decisions available to you now. The reason that the VNM framework was designed for one-shot decisions is that it makes things simpler without actually constraining what it can be applied to.
No. Inconsistency between different possible worlds is still inconsistency.
It’s perfectly consistent in the sense that it’s not money pumpable, and always makes the same decisions given the same information. It will make different decisions in different situations, given different information. But that is not inconsistent by an reasonable definition of “inconsistent”.
The difference doesn’t matter that much in practice.
It makes a huge difference. If you want to get the best median future, then you can’t make decisions in isolation. You need to consider every possible decision you will have to make, and their probability. And choose a decision policy that selects the best median outcome.
It’s perfectly consistent in the sense that it’s not money pumpable, and always makes the same decisions given the same information.
As in my previous example (sorry about it being difficult to follow, though I’m not sure yet what I could say to clarify things), it is inconsistent in the sense that it can lead you to pay for probability distributions over outcomes that you could have achieved for free.
You need to consider every possible decision you will have to make, and their probability.
Right. As I just said, “you can… consider them sequentially and use your beliefs about what your choices will be in the future to compute the expected utilities of the possible decisions available to you now.” (edited to fix grammar). This reduces iterated decisions to isolated decisions: you have certain beliefs about what you’ll do in the future, and now you just have to make a decision on the issue facing you now.
I don’t see how selecting policies instead of actions removes the motivation for independence.
Ultimately, it isn’t the policy that you care about; it’s the outcome. So you should pick a policy because you like the probability distributions over outcomes that you get from implementing it more than you like the probability distributions over outcomes that you would get from implementing other policies. Since there are many decision problems to use your policy on, this quite heavily constrains what policy you choose. In order to get a policy that reliably picks the actions that you decide are correct in the situations where you can tell what the correct action is, it will have to make those decisions for the same reason you decided that it was the best action (or at least something equivalent to or approximating the same reason). So no, the choice of policy is not at all arbitrary.
That is not true. EU maximizers with bounded utility functions reject Pascal’s wager.
There are two reasons to like independence. First of all, you might like it for philosophical/aesthetic reasons: “these things really should be independent, these really should be irrelevant”. Or you could like it because it prevents you from being money pumped.
When considering policies, money pumping is (almost) no longer an issue, because a policy that allows itself to be money-pumped is (almost) certainly inferior to one that doesn’t. So choosing policies removes one of the motivations for independence, to my mind the important one.
While it’s true that this does not tell you to pay each time to switch the outcomes around in a circle over and over again, it still falls prey to one step of a similar problem. Suppose their are 3 possible outcomes: A, B, and C, and there are 2 possible scenarios: X and Y. In scenario X, you get to choose between A and B. In scenario Y, you can attempt to choose between A and B, and you get what you picked with 50% probability, and you get outcome C otherwise. In each scenario, this is the only decision you will ever make. Suppose in scenario X, you prefer A over B, but in scenario Y, you prefer (B+C)/2 over (A+C)/2. But suppose you had to pay to pick A in scenario X, and you had to pay to pick (B+C)/2 in scenario Y, and you still make those choices. If Y is twice as likely as X a priori, then you are paying to get a probability distribution over outcomes that you could have gotten for free by picking B given X, and (A+C)/2 given Y. Since each scenario only involves you ever getting to make one decision, picking a policy is equivalent to picking a decision.
Your example is difficult to follow, but I think you are missing the point. If there is only one decision, then it’s actions can’t be inconsistent. By choosing a policy only once—one that maximizes it’s desired probability distribution of utility outcomes—it’s not money pumpable, and it’s not inconsistent.
Now by itself it still sucks because we probably don’t want to maximize for the best median future. But it opens up the door to more general policies for making decisions. You no longer have to use expected utility if you want to be consistent. You can choose a tradeoff between expected utility and median utility (see my top level comment), or a different algorithm entirely.
If there is only one decision point in each possible world, then it is impossible to demonstrate inconsistency within a world, but you can still be inconsistent between different possible worlds.
Edit: as V_V pointed out, the VNM framework was designed to handle isolated decisions. So if you think that considering an isolated decision rather than multiple decisions removes the motivation for the independence axiom, then you have misunderstood the motivation for the independence axiom.
I understand the two motivations for the independence axiom, and the practical one (“you can’t be money pumped”) is much more important that the theoretical one (“your system obeys this here philosophically neat understanding of irrelevant information”).
But this is kind of a moot point, because humans don’t have utility functions. And therefore we will have to construct them. And the process of constructing them is almost certainly going to depend on facts about the world, making the construction process almost certainly inconsistent between different possible worlds.
It shouldn’t. If your preferences among outcomes depend on what options are actually available to you, then I don’t see how you can justify claiming to have preferences among outcomes, as opposed to tendencies to make certain choices.
Then define me a process that takes people’s current mess of preferences, makes these into utility functions, and, respecting bounded rationality, is independent of options available in the real world. Even then, we have the problem that this mess of preferences is highly dependent on real world experiences in the first place.
If I always go left at a road, I have tendency to make certain choices. If I have a full model of the entire universe with labelled outcomes ranked on a utility function, and use it with unbounded rationality to make decisions, I have preferences among outcomes. The extremes are clear.
I feel that a bounded human being with a crude mental model that is trying to achieve some goal, imperfectly (because of ingrained bad habits, for instance) is better described as having preferences among outcomes. You could argue that they have mere tendencies, but this seems to stretch the term. But in any case, this is a simple linguistic dispute. Real human beings cannot achieve independence.
Define me a process with all those properties except the last one. If you can’t do that either, it’s not the last constraint that is to blame for the difficulty.
Yes, different agents have different preferences. The same agent shouldn’t have its preferences change when the available outcomes do.
If you are neutral between .4A+.6C and .4B+.6C, then you don’t have a very good claim to preferring A over B.
Well, there’s my old idea here: http://lesswrong.com/lw/8qb/cevinspired_models/ . I don’t think it’s particularly good, but it does construct a utility function, and might be doable with good enough models or a WBE. More broadly, there’s the general “figure out human preferences from their decisions and from hypothetical questions and fit a utility function to it”, which we can already do today (see “inverse reinforcement learning”); we just can’t do it well enough, yet, to get something generally safe at the other end.
None of these ideas have independent variants (not technically true; I can think of some independent versions of them, but they’re so ludicrously unsafe in our world that we’d rule them out immediately; thus, this would be a non-independent process).
?
If I actually do prefer A over B (and my behaviour reflects that in (1- ɛ)A+ ɛC versus (1-ɛ)B+ ɛC cases), then I have an extremely good claim to preferring A over B, and an extremely poor claim to independence.
I assumed accuracy was implied by “making a mess of preferences into a utility function”.
I’m somewhat skeptical of that strategy for learning utility functions, because the space of possible outcomes is extremely high-dimensional, and it may be difficult to test extreme outcomes because the humans you’re trying to construct a utility function for might not be able to understand them. But perhaps this objection doesn’t get to the heart of the matter, and I should put it aside for now.
I am admittedly not well-versed in inverse reinforcement learning, but this is a perplexing claim. Except for a few people like you suggesting alternatives, I’ve only ever heard “utility function” used to refer to a function you maximize the expected value of (if you’re trying to handle uncertainty), or a function you just maximize the value of (if you’re not trying to handle uncertainty). In the first case, we have independence. In the second case, the question of whether or not we obey independence doesn’t really make sense. So if inverse reinforcement learning violates independence, then what exactly does it try to fit to human preferences?
Then if the only difference between two gambles is that one might give you A when the other might give you B, you’ll take the one that might give you something you like instead of something you don’t like.
To be clear, I am saying the process of constructing the utility function violates independence, not that subsequently maximising it does. Similarly, choosing a median-maximising policy P violates independence, but there is (almost certainly) a utility u such that maximising u is the same as following P.
Once the first choice is made, we have independence in both cases; before it is made, we have it in neither. The philosophical underpinning of independence in single decisions therefore seems very weak.
Feel free to tell me to shut up and learn how inverse reinforcement learning works before bothering you with such questions, if that is appropriate, but I’m not sure what you mean. Can you be more precise about what property you’re saying inverse reinforcement learning doesn’t have?
Inverse reinforcement learning relies on observation of humans performing specific actions, and drawing the “right” conclusion as to what their preferences. Indirectly, it relies on humans tinkering with its code to remove “errors”, ie things that don’t fit with the mental image that human programmers of what preferences should be.
Given that human desires are not independent (citation not needed), this process, if it produces a utility function, involves constructing something independent from non-independent input. However, to establish this utility function, the algorithm has access only to the particular problems given to it, and the particular mental images of its programmers. It is almost certain that the end result would be somewhat different if it was trained on different problems, or if its programmers had different intuitions. Therefore the process itself cannot be independent.
Ah, I see what you mean, and you’re right; the utility function constructed will depend on how the data points are sampled. This isn’t quite the same as saying that the result will depend on what results are actually available, though, unless knowledge about what results will be available is used to determine how to sample the data. This still seems like somewhat of a defect of inverse reinforcement learning, unless there ends up being a good case that some particular way of sampling the data is optimal for revealing underlying preferences and ignoring biases, or something like that.
That’s probably true, but on the other hand, you seem to want to pin the deviations of human behavior from VNM rationality on violations of the independence axiom, and it isn’t clear to me that this is the case (I don’t think the point you were making relies on this, so if you weren’t trying to make that claim then you can ignore this; it just seemed like you might be). There are situations where there are large framing effects (that is, whether A or B is preferred depends on how the options are presented, even if no other outcome C is being mixed in with them), and likely also violations of transitivity (where someone would say A>B, B>C, and C>A whenever you ask them about 2 of them without bringing up the third). It seems likely to me that most paradoxes of human decision-making have more to do with these than they do to violations of independence.
It can’t be inconsistent within a world no matter how many decisions points there are. If we agree it’s not inconsistent, then what are you arguing against?
I don’t care about the VNM framework. As you said, it is designed to be optimal for decisions made in isolation. Because we don’t need to make decisions in isolation, we don’t need to be constrained by it.
No. Inconsistency between different possible worlds is still inconsistency.
The difference doesn’t matter that much in practice. If there are multiple decision points, you can combine them into one by selecting a policy, or by considering them sequentially and using your beliefs about what your choices will be in the future to compute the expected utilities of the possible decisions available to you now. The reason that the VNM framework was designed for one-shot decisions is that it makes things simpler without actually constraining what it can be applied to.
It’s perfectly consistent in the sense that it’s not money pumpable, and always makes the same decisions given the same information. It will make different decisions in different situations, given different information. But that is not inconsistent by an reasonable definition of “inconsistent”.
It makes a huge difference. If you want to get the best median future, then you can’t make decisions in isolation. You need to consider every possible decision you will have to make, and their probability. And choose a decision policy that selects the best median outcome.
As in my previous example (sorry about it being difficult to follow, though I’m not sure yet what I could say to clarify things), it is inconsistent in the sense that it can lead you to pay for probability distributions over outcomes that you could have achieved for free.
Right. As I just said, “you can… consider them sequentially and use your beliefs about what your choices will be in the future to compute the expected utilities of the possible decisions available to you now.” (edited to fix grammar). This reduces iterated decisions to isolated decisions: you have certain beliefs about what you’ll do in the future, and now you just have to make a decision on the issue facing you now.