Wei Dai comments on A broad basin of attraction around human values?

Wei Dai 12 Feb 2022 16:30 UTC
LW: 5 AF: 4
0
AF

I think Paul’s argument amounts to saying that a corrigibility approach focuses directly on mitigating the “lock-in” of wrong preferences, whereas ambitious value learning would try to get the right preferences but has a greater risk of locking-in its best guess.

What’s the actual content of the argument that this is true? From my current perspective, corrigible AI still has a very high risk of lock-in of wrong preferences, due to bad metapreferences of the overseer, and ambitious value learning, or some ways of doing that, could turn out to be less risky with respect to lock-in, because for example you could potentially examine the metapreferences that a value-learning AI has learned, which might make it more obvious that they’re not safe enough as is, triggering attempts to do something about that.