“which utility-wise is similar to the distribution not containing human values.” - from the point of view of corrigibility to human values, or of learning capabilities to achieve human values?
For corrigability I don’t see why you need high probability for specific new goal as long as it is diverse enough to make there be no simpler generalization than “don’t care about controling goals”. For capabilities my intuition is that starting with superficially-aligned goals is enough.
Hmm, I think I retract my point. I suspect something similar to my point applies but as written it doesn’t 100% fit and I can’t quickly analyze your proposal and apply my point to it.
“which utility-wise is similar to the distribution not containing human values.” - from the point of view of corrigibility to human values, or of learning capabilities to achieve human values? For corrigability I don’t see why you need high probability for specific new goal as long as it is diverse enough to make there be no simpler generalization than “don’t care about controling goals”. For capabilities my intuition is that starting with superficially-aligned goals is enough.
Hmm, I think I retract my point. I suspect something similar to my point applies but as written it doesn’t 100% fit and I can’t quickly analyze your proposal and apply my point to it.