tailcalled answers Training for corrigability: obvious problems?

tailcalled 24 Feb 2023 14:47 UTC
6 points
0
This sort of works, but not enough to solve it.

A core problem lies in the distribution of goals that you vary things over. The AI will be trained to be corrigible within the range of that distribution, but there is no particular guarantee that it will be corrigible outside it.

So you need to make sure that your distribution of goals contain human values. How do you guarantee that it contains that without getting goodharted by instead containing something that superficially resembles human values?

It might be tempting to achieve this by making the distribution very general, with lots of varied goals, so it contains lots of alien values including human values. But then human values are given exponentially snall probability, which utility-wise is similar to the distribution not containing human values.

So you need to somehow give human values a high probability within the distribution. But at that point you’re most of the way to just figuring out what human values are in the first place and directly aligning to them.
- Ben Amitay 24 Feb 2023 15:09 UTC
  1 point
  0
  Parent
  “which utility-wise is similar to the distribution not containing human values.” - from the point of view of corrigibility to human values, or of learning capabilities to achieve human values? For corrigability I don’t see why you need high probability for specific new goal as long as it is diverse enough to make there be no simpler generalization than “don’t care about controling goals”. For capabilities my intuition is that starting with superficially-aligned goals is enough.
  - tailcalled 24 Feb 2023 19:37 UTC
    3 points
    0
    Parent
    Hmm, I think I retract my point. I suspect something similar to my point applies but as written it doesn’t 100% fit and I can’t quickly analyze your proposal and apply my point to it.
- Ben Amitay 24 Feb 2023 15:12 UTC
  1 point
  0
  Parent
  More on the meta level: “This sort of works, but not enough to solve it.”—do you mean “not enough” as in “good try but we probably need something else” or as in “this is a promising direction, just solve some tractable downstream problem”?