David Scott Krueger (formerly: capybaralet) comments on Towards a mechanistic understanding of corrigibility

David Scott Krueger (formerly: capybaralet) 1 Mar 2020 20:11 UTC
LW: 3 AF: 2
AF
We could instead verify that the model optimizes its objective while penalizing itself for becoming more able to optimize its objective.
As phrased, this sounds like it would require correctly (or at least conservatively) tuning the trade-off between these two goals, which might be difficult.
- TurnTrout 1 Mar 2020 20:45 UTC
  LW: 4 AF: 2
  AF Parent
  See How Low Should Fruit Hang Before We Pick It?.
  - David Scott Krueger (formerly: capybaralet) 1 Mar 2020 21:14 UTC
    LW: 1 AF: 1
    AF Parent
    I generally don’t read links when there’s no context provided, and think it’s almost always worth it (from a cooperative perspective) to provide a bit of context.
    Can you give me a TL;DR of why this is relevant or what your point is in posting this link?
    - TurnTrout 1 Mar 2020 21:29 UTC
      LW: 7 AF: 4
      AF Parent
      The post answers to what extent safely tuning that trade-off is feasible, and the surrounding sequence motivates that penalization scheme in greater generality. From Conclusion to ‘Reframing Impact’:
      - David Scott Krueger (formerly: capybaralet) 2 Mar 2020 21:17 UTC
        LW: 7 AF: 4
        AF Parent
        OK, thanks.
        The TL;DR seems to be: “We only need a lower bound on the catastrophe/reasonable impact ratio, and an idea about how much utility is available for reasonable plans.”
        This seems good… can you confirm my understanding below is correct?
        2) RE: “How much utility is available”: I guess we can just set a targeted level of utility gain, and it won’t matter if there are plans we’d consider reasonable that would exceed that level? (e.g. “I’d be happy if we can make 50% more paperclips at the same cost in the next year.”)
        1) RE: “A lower bound”: this seems good because we don’t need to know how extreme catastrophes could be, we can just say: “If (e.g.) the earth or the human species ceased to exist as we know it within the year, that would be catastrophic”.
        TurnTrout 2 Mar 2020 22:13 UTC
        LW: 3 AF: 2
        AF Parent
        That’s correct.