I enjoyed this article and the proposed factors match my intuitions. Predicting variable diminishing returns seems especially hard to me. I also worry that the interactions between rewards will be negative-sum, due to resource constraints.
Resource constraints situations can be positive sum (consider most of the economy). The real problem is between antagonistic preferences, eg maximising flourishing lives vs negative utilitarianism, where a win for one is a loss for the other.
Note that this post considers the setting where we have uncertainty over the true reward function, but we can’t learn about the true reward function. If you can gather information about the true reward function, which <@seems necessary to me@>(@Human-AI Interaction@), then it is almost always worse to take the most likely reward or expected reward as a proxy reward to optimize.
Yes, if you’re in a learning process and treat it as if you weren’t in a learning process, things will go wrong ^_^
My model goes something like this: If increasing values requires using some resource, gaining access to more of the resource can be positive sum, while spending it is negative sum due to opportunity costs. In this model, the economy can be positive sum because it helps with alleviating resource constraints.
But maybe it does not really matter if most interactions are positive-sum until some kind of resource limit is reached and negative-sum only after?
Right. I think my intuition about negative-sum interactions under resource constrainrs combined the zero-sum nature of resource spending with the (perceived) negative-sum nature of competition for resources. But for a unified agent there is no competition for resources, so the argument for resource constraints leading to negative-sum interactions is gone.
To clear up some more confusion: The sum-condition is not what actually matters here, is it? In the first example of 5), the sum of utilities is lower than in the second one. The problem in the second example seems to rather be that the best states for one of the (Edit: the expected) rewards are bad for the other?
That again seems like it would often follow from resource constraints.
But no matter, how I take the default outcome, your second example is always “more positive sum” than the first, because 0.5 + 0.7 + 2x < 1.5 − 0.1 +2x.
Granted, you could construct examples where the inequality is reversed and Goodhart bad corresponds to “more negative sum”, but this still seems to point to the sum-condition not being the central concept here. To me, it seems like “negative min” compared to the default outcome would be closer to the actual problem. This distinction matters, because negative min is a lot weaker than negative sum.
Or am I completely misunderstanding your examples or your point?
(Strictly) convex Pareto boundary: Extreme policies require strong beliefs. (Modulo some normalization of the rewards)
Concave (including linear) Pareto boundary: Extreme policies are favoured, even for moderate beliefs. (In this case, normalization only affects the “tipping point” in beliefs, where the opposite extreme policy is suddenly favoured).
In reality, we will often have concave and convex regions. The concave regions then cause more extreme policies for some beliefs, but the convex regions usually prevent the policy from completely focusing on a single objective.
From this lens, 1) maximum likelihood pushes us to one of the ends of the Pareto boundary, 2) an unlikely true reward pushes us close to the “bad” end, 3) Difficult optimization messes with normalisation (I am still somewhat confused about the exact role of normalization) and 4) Not accounting for diminishing returns bends the pareto boundary to become more concave.
I was thinking about normalisation as linearly rescaling every reward to [0,1] when I wrote the comment. Then, one can always look at [0,1]2, which might make it easier to graphically think about how different beliefs lead to different policies. Different scales can then be translated to a certain reweighting of the beliefs (at least from the perspective of the optimal policy), as maximizing P(R1)S1R1+P(R2)S2R2 is the same as maximizing P(R1)S1P(R1)S1+P(R2)S2R1+P(R2)S2P(R1)S1+P(R2)S2R2
Negative sum vs zero sum (vs positive sum, in fact) depend on defining some “default state”, against which the outcome is compared. A negative sum game can become a positive sum game if you just give all the “players” a fixed bonus (ie translate the default state). Default states are somewhat tricky and often subjective to define.
Now, you said “the best states for one of the rewards are bad for the other”. “Bad” compared with what? I’m taking as a default something like “you make no effort to increase (or decrease) either reward”.
So, my informal definition of “zero sum” is “you may choose to increase either R1 or R2 (roughly) independently of each other, from a fixed budget”. Weakly positive sum would be “the more you increase R1, the easier it gets to increase R2 (and vice versa) from a fixed budget”; strongly positive sum would be “the more you increase R1, the more R2 increases (and vice versa)”.
Negative sum would be the opposite of this (“easier”->”harder” and “increases”->”decreases”).
The reason I distinguish weak and strong, is that if we add diminishing returns, this reduces the impact of weak negative sum, but can’t solve strong negative sum.
I like that summary!
Resource constraints situations can be positive sum (consider most of the economy). The real problem is between antagonistic preferences, eg maximising flourishing lives vs negative utilitarianism, where a win for one is a loss for the other.
Yes, if you’re in a learning process and treat it as if you weren’t in a learning process, things will go wrong ^_^
My model goes something like this: If increasing values requires using some resource, gaining access to more of the resource can be positive sum, while spending it is negative sum due to opportunity costs. In this model, the economy can be positive sum because it helps with alleviating resource constraints.
But maybe it does not really matter if most interactions are positive-sum until some kind of resource limit is reached and negative-sum only after?
Generally, spending resources is zero-sum, not negative sum.
Right. I think my intuition about negative-sum interactions under resource constrainrs combined the zero-sum nature of resource spending with the (perceived) negative-sum nature of competition for resources. But for a unified agent there is no competition for resources, so the argument for resource constraints leading to negative-sum interactions is gone.
Thank you for alleviating my confusion.
To clear up some more confusion: The sum-condition is not what actually matters here, is it? In the first example of 5), the sum of utilities is lower than in the second one. The problem in the second example seems to rather be that the best states for one of the (Edit: the expected) rewards are bad for the other?
That again seems like it would often follow from resource constraints.
Negative vs positive vs zero sum is all relative to what we take to be the default outcome.
I take the default as “no effort is made to increase or decrease any of the reward functions”.
But no matter, how I take the default outcome, your second example is always “more positive sum” than the first, because 0.5 + 0.7 + 2x < 1.5 − 0.1 +2x.
Granted, you could construct examples where the inequality is reversed and Goodhart bad corresponds to “more negative sum”, but this still seems to point to the sum-condition not being the central concept here. To me, it seems like “negative min” compared to the default outcome would be closer to the actual problem. This distinction matters, because negative min is a lot weaker than negative sum.
Or am I completely misunderstanding your examples or your point?
Ok, have corrected it now; the negative-sum formulation was wrong, sorry.
After looking at the update, my model is:
(Strictly) convex Pareto boundary: Extreme policies require strong beliefs. (Modulo some normalization of the rewards)
Concave (including linear) Pareto boundary: Extreme policies are favoured, even for moderate beliefs. (In this case, normalization only affects the “tipping point” in beliefs, where the opposite extreme policy is suddenly favoured).
In reality, we will often have concave and convex regions. The concave regions then cause more extreme policies for some beliefs, but the convex regions usually prevent the policy from completely focusing on a single objective.
From this lens, 1) maximum likelihood pushes us to one of the ends of the Pareto boundary, 2) an unlikely true reward pushes us close to the “bad” end, 3) Difficult optimization messes with normalisation (I am still somewhat confused about the exact role of normalization) and 4) Not accounting for diminishing returns bends the pareto boundary to become more concave.
I think normalisation doesn’t fit in the convex-concave picture. Normalisation is to avoid things like 1%(100R1) being seen as the same as 100%(R1).
I was thinking about normalisation as linearly rescaling every reward to [0,1] when I wrote the comment. Then, one can always look at [0,1]2, which might make it easier to graphically think about how different beliefs lead to different policies. Different scales can then be translated to a certain reweighting of the beliefs (at least from the perspective of the optimal policy), as maximizing P(R1)S1R1+P(R2)S2R2 is the same as maximizing P(R1)S1P(R1)S1+P(R2)S2R1+P(R2)S2P(R1)S1+P(R2)S2R2
I like that way of seeing it.
You are correct; I was unclear (and wrong in that terminology). I will rework the post slightly.
Negative sum vs zero sum (vs positive sum, in fact) depend on defining some “default state”, against which the outcome is compared. A negative sum game can become a positive sum game if you just give all the “players” a fixed bonus (ie translate the default state). Default states are somewhat tricky and often subjective to define.
Now, you said “the best states for one of the rewards are bad for the other”. “Bad” compared with what? I’m taking as a default something like “you make no effort to increase (or decrease) either reward”.
So, my informal definition of “zero sum” is “you may choose to increase either R1 or R2 (roughly) independently of each other, from a fixed budget”. Weakly positive sum would be “the more you increase R1, the easier it gets to increase R2 (and vice versa) from a fixed budget”; strongly positive sum would be “the more you increase R1, the more R2 increases (and vice versa)”.
Negative sum would be the opposite of this (“easier”->”harder” and “increases”->”decreases”).
The reason I distinguish weak and strong, is that if we add diminishing returns, this reduces the impact of weak negative sum, but can’t solve strong negative sum.
Does this help, or add more confusion?