axioman comments on When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors

axioman 9 Jan 2020 21:03 UTC
1 point
But no matter, how I take the default outcome, your second example is always “more positive sum” than the first, because 0.5 + 0.7 + 2x < 1.5 − 0.1 +2x.
Granted, you could construct examples where the inequality is reversed and Goodhart bad corresponds to “more negative sum”, but this still seems to point to the sum-condition not being the central concept here. To me, it seems like “negative min” compared to the default outcome would be closer to the actual problem. This distinction matters, because negative min is a lot weaker than negative sum.
Or am I completely misunderstanding your examples or your point?
- Stuart_Armstrong 11 Jan 2020 17:37 UTC
  2 points
  Parent
  Ok, have corrected it now; the negative-sum formulation was wrong, sorry.
  - axioman 12 Jan 2020 11:26 UTC
    10 points
    Parent
    After looking at the update, my model is:
    (Strictly) convex Pareto boundary: Extreme policies require strong beliefs. (Modulo some normalization of the rewards)
    Concave (including linear) Pareto boundary: Extreme policies are favoured, even for moderate beliefs. (In this case, normalization only affects the “tipping point” in beliefs, where the opposite extreme policy is suddenly favoured).
    In reality, we will often have concave and convex regions. The concave regions then cause more extreme policies for some beliefs, but the convex regions usually prevent the policy from completely focusing on a single objective.
    From this lens, 1) maximum likelihood pushes us to one of the ends of the Pareto boundary, 2) an unlikely true reward pushes us close to the “bad” end, 3) Difficult optimization messes with normalisation (I am still somewhat confused about the exact role of normalization) and 4) Not accounting for diminishing returns bends the pareto boundary to become more concave.
    - Stuart_Armstrong 13 Jan 2020 12:20 UTC
      2 points
      Parent
      I think normalisation doesn’t fit in the convex-concave picture. Normalisation is to avoid things like $1 % (100 R_{1})$ being seen as the same as $100 % (R_{1})$ .
      - axioman 13 Jan 2020 13:39 UTC
        1 point
        Parent
        I was thinking about normalisation as linearly rescaling every reward to $[0, 1]$ when I wrote the comment. Then, one can always look at $[0, 1]^{2}$ , which might make it easier to graphically think about how different beliefs lead to different policies. Different scales can then be translated to a certain reweighting of the beliefs (at least from the perspective of the optimal policy), as maximizing $P (R_{1}) S_{1} R_{1} + P (R_{2}) S_{2} R_{2}$ is the same as maximizing $\frac{P (R_{1}) S_{1}}{P (R_{1}) S_{1} + P (R_{2}) S_{2}} R_{1} + \frac{P (R_{2}) S_{2}}{P (R_{1}) S_{1} + P (R_{2}) S_{2}} R_{2}$
    - Stuart_Armstrong 13 Jan 2020 12:18 UTC
      2 points
      Parent
      I like that way of seeing it.
- Stuart_Armstrong 10 Jan 2020 16:11 UTC
  2 points
  Parent
  You are correct; I was unclear (and wrong in that terminology). I will rework the post slightly.