tailcalled comments on Do the Safety Properties of Powerful AI Systems Need to be Adversarially Robust? Why?

tailcalled 9 Feb 2023 15:18 UTC
4 points
0
The John Wentworth argument that you are responding to is:

Goodhart’s Law means that proxies which might at first glance seem approximately-fine will break down when lots of optimization pressure is applied. And when we’re talking about aligning powerful future AI, we’re talking about a lot of optimization pressure. That’s the key idea which generalizes to other alignment strategies: crappy proxies won’t cut it when we start to apply a lot of optimization pressure.

What’s a proxy of corrigibility that you think might at first glance seem approximately-fine?
- DragonGod 9 Feb 2023 16:10 UTC
  2 points
  0
  Parent
  Obedience/deference seem the obvious proxies of corrigibility.
  - tailcalled 9 Feb 2023 21:33 UTC
    4 points
    0
    Parent
    A proxy is supposed to be observable so that it can be used for the purpose it is to be used for.
    What use do you have for a measure of corrigibility, and how do you intend to observe obedience/deference for that use?