But no matter, how I take the default outcome, your second example is always “more positive sum” than the first, because 0.5 + 0.7 + 2x < 1.5 − 0.1 +2x.
Granted, you could construct examples where the inequality is reversed and Goodhart bad corresponds to “more negative sum”, but this still seems to point to the sum-condition not being the central concept here. To me, it seems like “negative min” compared to the default outcome would be closer to the actual problem. This distinction matters, because negative min is a lot weaker than negative sum.
Or am I completely misunderstanding your examples or your point?
(Strictly) convex Pareto boundary: Extreme policies require strong beliefs. (Modulo some normalization of the rewards)
Concave (including linear) Pareto boundary: Extreme policies are favoured, even for moderate beliefs. (In this case, normalization only affects the “tipping point” in beliefs, where the opposite extreme policy is suddenly favoured).
In reality, we will often have concave and convex regions. The concave regions then cause more extreme policies for some beliefs, but the convex regions usually prevent the policy from completely focusing on a single objective.
From this lens, 1) maximum likelihood pushes us to one of the ends of the Pareto boundary, 2) an unlikely true reward pushes us close to the “bad” end, 3) Difficult optimization messes with normalisation (I am still somewhat confused about the exact role of normalization) and 4) Not accounting for diminishing returns bends the pareto boundary to become more concave.
I was thinking about normalisation as linearly rescaling every reward to [0,1] when I wrote the comment. Then, one can always look at [0,1]2, which might make it easier to graphically think about how different beliefs lead to different policies. Different scales can then be translated to a certain reweighting of the beliefs (at least from the perspective of the optimal policy), as maximizing P(R1)S1R1+P(R2)S2R2 is the same as maximizing P(R1)S1P(R1)S1+P(R2)S2R1+P(R2)S2P(R1)S1+P(R2)S2R2
Negative vs positive vs zero sum is all relative to what we take to be the default outcome.
I take the default as “no effort is made to increase or decrease any of the reward functions”.
But no matter, how I take the default outcome, your second example is always “more positive sum” than the first, because 0.5 + 0.7 + 2x < 1.5 − 0.1 +2x.
Granted, you could construct examples where the inequality is reversed and Goodhart bad corresponds to “more negative sum”, but this still seems to point to the sum-condition not being the central concept here. To me, it seems like “negative min” compared to the default outcome would be closer to the actual problem. This distinction matters, because negative min is a lot weaker than negative sum.
Or am I completely misunderstanding your examples or your point?
Ok, have corrected it now; the negative-sum formulation was wrong, sorry.
After looking at the update, my model is:
(Strictly) convex Pareto boundary: Extreme policies require strong beliefs. (Modulo some normalization of the rewards)
Concave (including linear) Pareto boundary: Extreme policies are favoured, even for moderate beliefs. (In this case, normalization only affects the “tipping point” in beliefs, where the opposite extreme policy is suddenly favoured).
In reality, we will often have concave and convex regions. The concave regions then cause more extreme policies for some beliefs, but the convex regions usually prevent the policy from completely focusing on a single objective.
From this lens, 1) maximum likelihood pushes us to one of the ends of the Pareto boundary, 2) an unlikely true reward pushes us close to the “bad” end, 3) Difficult optimization messes with normalisation (I am still somewhat confused about the exact role of normalization) and 4) Not accounting for diminishing returns bends the pareto boundary to become more concave.
I think normalisation doesn’t fit in the convex-concave picture. Normalisation is to avoid things like 1%(100R1) being seen as the same as 100%(R1).
I was thinking about normalisation as linearly rescaling every reward to [0,1] when I wrote the comment. Then, one can always look at [0,1]2, which might make it easier to graphically think about how different beliefs lead to different policies. Different scales can then be translated to a certain reweighting of the beliefs (at least from the perspective of the optimal policy), as maximizing P(R1)S1R1+P(R2)S2R2 is the same as maximizing P(R1)S1P(R1)S1+P(R2)S2R1+P(R2)S2P(R1)S1+P(R2)S2R2
I like that way of seeing it.
You are correct; I was unclear (and wrong in that terminology). I will rework the post slightly.