DragonGod comments on Do the Safety Properties of Powerful AI Systems Need to be Adversarially Robust? Why?

DragonGod 9 Feb 2023 14:14 UTC
2 points
0
Actually, my confusion is mostly on the source of the adverse selection pressure.

The only source of adverse selection pressure I understand is from multi-agent interactions in multipolar environments, but my model of agent foundations researchers do not expect multipolar outcomes by default/as the modal case, so I’m not really understanding the mechanism for adverse selection.

I listed some hypotheses as to where the source of adverse selection could come from, but I don’t have a compelling intuitive story for why they’d produce such adverse selection.
- tailcalled 9 Feb 2023 14:20 UTC
  2 points
  0
  Parent
  Could you give an example of a desirable safety property where you are unsure what angle the adversarial pressure would come from?
  - DragonGod 9 Feb 2023 14:57 UTC
    2 points
    0
    Parent
    Corrigibility is the first one that comes to mind.
    - tailcalled 9 Feb 2023 15:18 UTC
      4 points
      0
      Parent
      The John Wentworth argument that you are responding to is:
      
      Goodhart’s Law means that proxies which might at first glance seem approximately-fine will break down when lots of optimization pressure is applied. And when we’re talking about aligning powerful future AI, we’re talking about a lot of optimization pressure. That’s the key idea which generalizes to other alignment strategies: crappy proxies won’t cut it when we start to apply a lot of optimization pressure.
      
      What’s a proxy of corrigibility that you think might at first glance seem approximately-fine?
      - DragonGod 9 Feb 2023 16:10 UTC
        2 points
        0
        Parent
        Obedience/deference seem the obvious proxies of corrigibility.
        tailcalled 9 Feb 2023 21:33 UTC
        4 points
        0
        Parent
        A proxy is supposed to be observable so that it can be used for the purpose it is to be used for.
        What use do you have for a measure of corrigibility, and how do you intend to observe obedience/deference for that use?