Ben Amitay comments on Training for corrigability: obvious problems?

Ben Amitay 24 Feb 2023 15:09 UTC
1 point
0
“which utility-wise is similar to the distribution not containing human values.” - from the point of view of corrigibility to human values, or of learning capabilities to achieve human values? For corrigability I don’t see why you need high probability for specific new goal as long as it is diverse enough to make there be no simpler generalization than “don’t care about controling goals”. For capabilities my intuition is that starting with superficially-aligned goals is enough.
- tailcalled 24 Feb 2023 19:37 UTC
  3 points
  0
  Parent
  Hmm, I think I retract my point. I suspect something similar to my point applies but as written it doesn’t 100% fit and I can’t quickly analyze your proposal and apply my point to it.