I think the essence and conclusion of this post are almost certainly correct, not only for the reasons that Matthew Barnett gave (namely that individual users will want to use AGI to further their own goals and desires rather than to fulfill abstract altruistic targets, meaning companies will be incentivized to build such “user intent-aligned” AIs), but also because I consider the concept of a value aligned AGI to be confused and ultimately incoherent. To put it differently, I am very skeptical that a “value aligned” AGI is possible, even in theory.
Wei Dai explained the basic problem about 6 years ago, in a comment on one of the early posts in Rohin Shah’s Value Learning sequence:
On second thought, even if you assume the latter [putting humans in arbitrary virtual environments (along with fake memories of how they got there) in order to observe their reactions], the humans you’re learning from will themselves have problems with distributional shifts. If you give someone a different set of life experiences, they’re going to end up a different person with different values, so it seems impossible to learn a complete and consistent utility function by just placing someone in various virtual environments with fake memories of how they got there and observing what they do. Will this issue be addressed in the sequence?
In response, Rohin correctly pointed out that this perspective implies a great deal of pessimism about even the theoretical possibility of value learning (1, 2):
But more generally, if you think that a different set of life experiences means that you are a different person with different values, then that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed. Not just ambitious value learning, _any_ framework that involves an AI optimizing some expected utility would not work.
[...] Though if you accept that human values are inconsistent and you won’t be able to optimize them directly, I still think “that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed.”
By “true human utility function” I really do mean a single function that when perfectly maximized leads to the optimal outcome.
I think “human values are inconsistent” and “people with different experiences will have different values” and “there are distributional shifts which cause humans to be different than they would otherwise have been” are all different ways of pointing at the same problem.
As I have written before (1, 2), I do not believe that “values” and “beliefs” ultimately make sense as distinct, coherent concepts that carve reality at the joints:
Whenever I see discourse about the values or preferences of beings embedded in a physical universe that goes beyond the boundaries of the domains (namely, low-specificity conversations dominated by intuition) in which such ultimately fake frameworks function reasonably well, I get nervous and confused. I get particularly nervous if the people participating in the discussions are not themselves confused about these matters [...]. Such conversations stretch our intuitive notions past their breaking point by trying to generalize them out of distribution without the appropriate level of rigor and care.
How can we use such large sample spaces when it becomes impossible for limited beings like humans or even AGI to differentiate between those outcomes and their associated events? After all, while we might want an AI to push the world towards a desirable state instead of just misleading us into thinking it has done so, how is it possible for humans (or any other cognitively limited agents) to assign a different value, and thus a different preference ranking, to outcomes that they (even in theory) cannot differentiate (either on the basis of sense data or through thought)?
There has already been a great deal of discussion about these topics on LW (1, 2, etc), and Charlie Steiner’s distillation of it in his excellently-written Reducing Goodhart sequence still seems entirely correct:
Humans don’t have our values written in Fortran on the inside of our skulls, we’re collections of atoms that only do agent-like things within a narrow band of temperatures and pressures. It’s not that there’s some pre-theoretic set of True Values hidden inside people and we’re merely having trouble getting to them—no, extracting any values at all from humans is a theory-laden act of inference, relying on choices like “which atoms exactly count as part of the person” and “what do you do if the person says different things at different times?”
The natural framing of Goodhart’s law—in both mathematics and casual language—makes the assumption that there’s some specific True Values in here, some V to compare to U. But this assumption, and the way of thinking built on top of it, is crucially false when you get down to the nitty gritty of how to model humans and infer their values.
I think the essence and conclusion of this post are almost certainly correct, not only for the reasons that Matthew Barnett gave (namely that individual users will want to use AGI to further their own goals and desires rather than to fulfill abstract altruistic targets, meaning companies will be incentivized to build such “user intent-aligned” AIs), but also because I consider the concept of a value aligned AGI to be confused and ultimately incoherent. To put it differently, I am very skeptical that a “value aligned” AGI is possible, even in theory.
Wei Dai explained the basic problem about 6 years ago, in a comment on one of the early posts in Rohin Shah’s Value Learning sequence:
In response, Rohin correctly pointed out that this perspective implies a great deal of pessimism about even the theoretical possibility of value learning (1, 2):
As I have written before (1, 2), I do not believe that “values” and “beliefs” ultimately make sense as distinct, coherent concepts that carve reality at the joints:
There has already been a great deal of discussion about these topics on LW (1, 2, etc), and Charlie Steiner’s distillation of it in his excellently-written Reducing Goodhart sequence still seems entirely correct: