(Strictly) convex Pareto boundary: Extreme policies require strong beliefs. (Modulo some normalization of the rewards)
Concave (including linear) Pareto boundary: Extreme policies are favoured, even for moderate beliefs. (In this case, normalization only affects the “tipping point” in beliefs, where the opposite extreme policy is suddenly favoured).
In reality, we will often have concave and convex regions. The concave regions then cause more extreme policies for some beliefs, but the convex regions usually prevent the policy from completely focusing on a single objective.
From this lens, 1) maximum likelihood pushes us to one of the ends of the Pareto boundary, 2) an unlikely true reward pushes us close to the “bad” end, 3) Difficult optimization messes with normalisation (I am still somewhat confused about the exact role of normalization) and 4) Not accounting for diminishing returns bends the pareto boundary to become more concave.
I was thinking about normalisation as linearly rescaling every reward to [0,1] when I wrote the comment. Then, one can always look at [0,1]2, which might make it easier to graphically think about how different beliefs lead to different policies. Different scales can then be translated to a certain reweighting of the beliefs (at least from the perspective of the optimal policy), as maximizing P(R1)S1R1+P(R2)S2R2 is the same as maximizing P(R1)S1P(R1)S1+P(R2)S2R1+P(R2)S2P(R1)S1+P(R2)S2R2
Ok, have corrected it now; the negative-sum formulation was wrong, sorry.
After looking at the update, my model is:
(Strictly) convex Pareto boundary: Extreme policies require strong beliefs. (Modulo some normalization of the rewards)
Concave (including linear) Pareto boundary: Extreme policies are favoured, even for moderate beliefs. (In this case, normalization only affects the “tipping point” in beliefs, where the opposite extreme policy is suddenly favoured).
In reality, we will often have concave and convex regions. The concave regions then cause more extreme policies for some beliefs, but the convex regions usually prevent the policy from completely focusing on a single objective.
From this lens, 1) maximum likelihood pushes us to one of the ends of the Pareto boundary, 2) an unlikely true reward pushes us close to the “bad” end, 3) Difficult optimization messes with normalisation (I am still somewhat confused about the exact role of normalization) and 4) Not accounting for diminishing returns bends the pareto boundary to become more concave.
I think normalisation doesn’t fit in the convex-concave picture. Normalisation is to avoid things like 1%(100R1) being seen as the same as 100%(R1).
I was thinking about normalisation as linearly rescaling every reward to [0,1] when I wrote the comment. Then, one can always look at [0,1]2, which might make it easier to graphically think about how different beliefs lead to different policies. Different scales can then be translated to a certain reweighting of the beliefs (at least from the perspective of the optimal policy), as maximizing P(R1)S1R1+P(R2)S2R2 is the same as maximizing P(R1)S1P(R1)S1+P(R2)S2R1+P(R2)S2P(R1)S1+P(R2)S2R2
I like that way of seeing it.