I agree that this can create perverse incentives in practice, but that seems like the sort of thing that you should be handling as part of your decision theory, not your utility function.
I’m mainly worried about the perverse incentives part.
I recognize that there’s some weird level-crossing going on here, where I’m doing something like mixing up the decision theory and the utility function. But it seems to me like that’s just a reflection of the weird muddy place our values come from?
You can think of humans a little like self-modifying AIs, but where the modification took place over evolutionary history. The utility function which we eventually arrived at was (sort of) the result of a bargaining process between everyone, and which took some accounting of things like exploitability concerns.
In terms of decision theory, I often think in terms of a generalized NicerBot: extend everyone else the same cofrence-coefficient they extend to you, plus an epsilon (to ensure that two generalized NicerBots end up fully cooperating with each other). This is a pretty decent strategy for any game, generalizing from one of the best strategies for Prisoner’s Dilemma. (Of course there is no “best strategy” in an objective sense.)
But a decision theory like that does mix levels between the decision theory and the utility function!
I feel like the solution of having cofrences not count the other person’s cofrences just doesn’t respect people’s preferences—when I care about the preferences of somebody else, that includes caring about the preferences of the people they care about.
I totally agree with this point; I just don’t know how to balance it against the other point.
A crux for me is the coalition metaphor for utilitarianism. I think of utilitarianism as sort of a natural endpoint of forming beneficial coalitions, where you’ve built a coalition of all life.
If we imagine forming a coalition incrementally, and imagine that the coalition simply averages utility functions with its new members, then there’s an incentive to join the coalition as late as you can, so that your preferences get the largest possible representation. (I know this isn’t the same problem we’re talking about, but I see it as analogous, and so a point in favor of worrying about this sort of thing.)
We can correct that by doing 1/n averaging: every time the coalition gains members, we make a fresh average of all member utility functions (using some utility-function normalization, of course), and everybody voluntarily self-modifies to have the new mixed utility function.
But the problem with this is, we end up punishing agents for self-modifying to care about us before joining. (This is more closely analogous to the problem we’re discussing.) If they’ve already self-modified to care about us more before joining, then their original values just get washed out even more when we re-average everyone.
So really, the implicit assumption I’m making is that there’s an agent “before” altruism, who “chose” to add in everyone’s utility functions. I’m trying to set up the rules to be fair to that agent, in an effort to reward agents for making “the altruistic leap”.
But a decision theory like that does mix levels between the decision theory and the utility function!
I agree, though it’s unclear whether that’s an actual level crossing or just a failure of our ability to be able to properly analyze that strategy. I would lean towards the latter, though I am uncertain.
A crux for me is the coalition metaphor for utilitarianism. I think of utilitarianism as sort of a natural endpoint of forming beneficial coalitions, where you’ve built a coalition of all life.
This is how I think about preference utilitarianism but not how I think about hedonic utilitarianism—for example, a lot of what I value personally is hedonic-utilitarianism-like, but from a social perspective, I think preference utilitarianism is a good Schelling point for something we can jointly agree on. However, I don’t call myself a preference utilitarian—rather, I call myself a hedonic utilitarian—because I think of social Schelling points and my own personal values as pretty distinct objects. And I could certainly imagine someone who terminally valued preference utilitarianism from a personal perspective—which is what I would call actually being a preference utilitarian.
Furthermore, I think that if you’re actually a preference utilitarian vs. if you just think preference utilitarianism is a good Schelling point, then there are lots of cases where you’ll do different things. For example, if you’re just thinking about preference utilitarianism as a useful Schelling point, then you want to carefully consider the incentives that it creates—such as the one that you’re pointing to—but if you terminally value preference utilitarianism, then that seems like a weird thing to be thinking about, since the question you should be thinking about in that context should be more like what is it about preferences that you actually value and why.
If we imagine forming a coalition incrementally, and imagine that the coalition simply averages utility functions with its new members, then there’s an incentive to join the coalition as late as you can, so that your preferences get the largest possible representation. (I know this isn’t the same problem we’re talking about, but I see it as analogous, and so a point in favor of worrying about this sort of thing.)
We can correct that by doing 1/n averaging: every time the coalition gains members, we make a fresh average of all member utility functions (using some utility-function normalization, of course), and everybody voluntarily self-modifies to have the new mixed utility function.
One thing I will say here is that usually when I think about socially agreeing on a preference utilitarian coalition, I think about doing so from more of a CEV standpoint, where the idea isn’t just to integrate the preferences of agents as they currently are, but as they will/should be from a CEV perspective. In that context, it doesn’t really make sense to think about incremental coalition forming, because your CEV (mostly, with some exceptions) should be the same regardless of what point in time you join the coalition.
But the problem with this is, we end up punishing agents for self-modifying to care about us before joining. (This is more closely analogous to the problem we’re discussing.) If they’ve already self-modified to care about us more before joining, then their original values just get washed out even more when we re-average everyone.
I guess this just seems like the correct outcome to me. If you care about the values of the coalition, then the coalition should care less about your preferences, because they can partially satisfy them just by doing what the other people in the coalition want.
So really, the implicit assumption I’m making is that there’s an agent “before” altruism, who “chose” to add in everyone’s utility functions. I’m trying to set up the rules to be fair to that agent, in an effort to reward agents for making “the altruistic leap”.
It certainly makes sense to reward agents for choosing to instrumentally value the coalition—and I would include instrumentally choosing to self-modify yourself to care more about the coalition in that—but I’m not sure why it makes sense to reward agents for terminally valuing the coalition—that is, terminally valuing the coalition independently of any decision theoretic considerations that might cause you to instrumentally modify yourself to do so.
Again, I think this makes more sense from a CEV perspective—if you instrumentally modify yourself to care about the coalition for decision-theoretic reasons, that might change your values, but I don’t think that it should change your CEV. In my view, your CEV should be about your general strategy for how to self-modify yourself in different situations rather than the particular incarnation of you that you’ve currently modified to.
I’m mainly worried about the perverse incentives part.
I recognize that there’s some weird level-crossing going on here, where I’m doing something like mixing up the decision theory and the utility function. But it seems to me like that’s just a reflection of the weird muddy place our values come from?
You can think of humans a little like self-modifying AIs, but where the modification took place over evolutionary history. The utility function which we eventually arrived at was (sort of) the result of a bargaining process between everyone, and which took some accounting of things like exploitability concerns.
In terms of decision theory, I often think in terms of a generalized NicerBot: extend everyone else the same cofrence-coefficient they extend to you, plus an epsilon (to ensure that two generalized NicerBots end up fully cooperating with each other). This is a pretty decent strategy for any game, generalizing from one of the best strategies for Prisoner’s Dilemma. (Of course there is no “best strategy” in an objective sense.)
But a decision theory like that does mix levels between the decision theory and the utility function!
I totally agree with this point; I just don’t know how to balance it against the other point.
A crux for me is the coalition metaphor for utilitarianism. I think of utilitarianism as sort of a natural endpoint of forming beneficial coalitions, where you’ve built a coalition of all life.
If we imagine forming a coalition incrementally, and imagine that the coalition simply averages utility functions with its new members, then there’s an incentive to join the coalition as late as you can, so that your preferences get the largest possible representation. (I know this isn’t the same problem we’re talking about, but I see it as analogous, and so a point in favor of worrying about this sort of thing.)
We can correct that by doing 1/n averaging: every time the coalition gains members, we make a fresh average of all member utility functions (using some utility-function normalization, of course), and everybody voluntarily self-modifies to have the new mixed utility function.
But the problem with this is, we end up punishing agents for self-modifying to care about us before joining. (This is more closely analogous to the problem we’re discussing.) If they’ve already self-modified to care about us more before joining, then their original values just get washed out even more when we re-average everyone.
So really, the implicit assumption I’m making is that there’s an agent “before” altruism, who “chose” to add in everyone’s utility functions. I’m trying to set up the rules to be fair to that agent, in an effort to reward agents for making “the altruistic leap”.
I agree, though it’s unclear whether that’s an actual level crossing or just a failure of our ability to be able to properly analyze that strategy. I would lean towards the latter, though I am uncertain.
This is how I think about preference utilitarianism but not how I think about hedonic utilitarianism—for example, a lot of what I value personally is hedonic-utilitarianism-like, but from a social perspective, I think preference utilitarianism is a good Schelling point for something we can jointly agree on. However, I don’t call myself a preference utilitarian—rather, I call myself a hedonic utilitarian—because I think of social Schelling points and my own personal values as pretty distinct objects. And I could certainly imagine someone who terminally valued preference utilitarianism from a personal perspective—which is what I would call actually being a preference utilitarian.
Furthermore, I think that if you’re actually a preference utilitarian vs. if you just think preference utilitarianism is a good Schelling point, then there are lots of cases where you’ll do different things. For example, if you’re just thinking about preference utilitarianism as a useful Schelling point, then you want to carefully consider the incentives that it creates—such as the one that you’re pointing to—but if you terminally value preference utilitarianism, then that seems like a weird thing to be thinking about, since the question you should be thinking about in that context should be more like what is it about preferences that you actually value and why.
One thing I will say here is that usually when I think about socially agreeing on a preference utilitarian coalition, I think about doing so from more of a CEV standpoint, where the idea isn’t just to integrate the preferences of agents as they currently are, but as they will/should be from a CEV perspective. In that context, it doesn’t really make sense to think about incremental coalition forming, because your CEV (mostly, with some exceptions) should be the same regardless of what point in time you join the coalition.
I guess this just seems like the correct outcome to me. If you care about the values of the coalition, then the coalition should care less about your preferences, because they can partially satisfy them just by doing what the other people in the coalition want.
It certainly makes sense to reward agents for choosing to instrumentally value the coalition—and I would include instrumentally choosing to self-modify yourself to care more about the coalition in that—but I’m not sure why it makes sense to reward agents for terminally valuing the coalition—that is, terminally valuing the coalition independently of any decision theoretic considerations that might cause you to instrumentally modify yourself to do so.
Again, I think this makes more sense from a CEV perspective—if you instrumentally modify yourself to care about the coalition for decision-theoretic reasons, that might change your values, but I don’t think that it should change your CEV. In my view, your CEV should be about your general strategy for how to self-modify yourself in different situations rather than the particular incarnation of you that you’ve currently modified to.