This seems fragile in ways that make me less optimistic about the approach overall. We have strong reasons to think that value aggregation is intractable, and (by analogy,) in some ways the problem of coherence in CEV is the tricky part. That is, the problem of making sure that we’re not Dutch book-able is, IIRC, NP-complete, and even worse, the problem of aggregating preferences has several impossibility results.
Edit: To clarify, I’m excited about the approach overall, and think it’s likely to be valuable, but this part seems like a big problem.
I’ve posted on the theoretical difficulties of aggregating the utilities of different agents. But doing it in practice is much more feasible (scale the utilities to some not-too-unreasonable scale, add them, maximise sum).
But value extrapolation is different from human value aggregation; for example, low power (or low impact) AIs can be defined with value extrapolation, and that doesn’t need human value aggregation.
I’m skeptical that many of the problems with aggregation don’t both apply to actual individual human values once extrapolated, and generalize to AIs with closely related values, but I’d need to lay out the case for that more clearly. (I did discuss the difficulty of cooperation even given compatible goals a bit in this paper, but it’s nowhere near complete in addressing this issue.)
As Stuart referenced in his comment to your post here, value extrapolation can be the key to AI alignment *without* using it to deduce the set of human values. See the ‘List of partial failures’ in the original post: With value extrapolation, these approaches become viable.
We’re aiming to solve the problem in a way that is acceptable to one given human, and then generalise from that.
This seems fragile in ways that make me less optimistic about the approach overall. We have strong reasons to think that value aggregation is intractable, and (by analogy,) in some ways the problem of coherence in CEV is the tricky part. That is, the problem of making sure that we’re not Dutch book-able is, IIRC, NP-complete, and even worse, the problem of aggregating preferences has several impossibility results.
Edit: To clarify, I’m excited about the approach overall, and think it’s likely to be valuable, but this part seems like a big problem.
I’ve posted on the theoretical difficulties of aggregating the utilities of different agents. But doing it in practice is much more feasible (scale the utilities to some not-too-unreasonable scale, add them, maximise sum).
But value extrapolation is different from human value aggregation; for example, low power (or low impact) AIs can be defined with value extrapolation, and that doesn’t need human value aggregation.
I’m skeptical that many of the problems with aggregation don’t both apply to actual individual human values once extrapolated, and generalize to AIs with closely related values, but I’d need to lay out the case for that more clearly. (I did discuss the difficulty of cooperation even given compatible goals a bit in this paper, but it’s nowhere near complete in addressing this issue.)
It’s worth you write up your point and post it—that tends to clarify the issue, for yourself as well as for others.
Hi David,
As Stuart referenced in his comment to your post here, value extrapolation can be the key to AI alignment *without* using it to deduce the set of human values. See the ‘List of partial failures’ in the original post: With value extrapolation, these approaches become viable.