A bit tangential: Regarding the terminology, what you here call “values” would be called “desires” by philosophers. Perhaps also by psychologists. Desires measure how strongly an agent wants an outcome to obtain. Philosophers would mostly regard “value” as a measure of how good something is, either intrinsically or for something else. There appears to be no overly strong connection between values in this sense and desires, since you may believe that something is good without being motivated to make it happen, or the other way round.
If you say (or even believe) that X is good, but never act to promote X, I’d say you don’t actually value X in a way that’s relevant for alignment. An AI which says / believes human flourishing is good, without actually being motivated to make said flourishing happen, would be an alignment failure.
I also think that an agents answers to the more abstract values questions like “how good something is” are strongly influenced by how the simpler / more concrete values form earlier in the learning process. Our intent here was to address the simple case, with future posts discussing how abstract values might derive from simpler ones.
I agree that value in the sense of goodness is not relevant for alignment. Relevant is what the AI is motivated to do, not what it believes to be good. I’m just saying that your usage of “value” would be called “desire” by philosophers.
Often it seems that using the term “values” suffers a bit from this ambiguity. If someone says an agent A “values” an outcome O, do they mean A believes that O is good, i.e. that A believes O has a high value, a high degree of goodness? Or do they mean that A wants O to obtain, i.e. that A desires O? That seems often ambiguous.
A solution would be to taboo the term “values” and instead talk directly about desires, or what is good or believed to be good.
But in your post you actually clarified in the beginning that you mean value in the sense of desire, so this usage seems fine in your case.
The term “desire” actually has one problem itself—it arguably implies consciousness. We like to talk about anything an AI might “want”, in the sense of being motivated to realize, without necessarily being conscious.
A bit tangential: Regarding the terminology, what you here call “values” would be called “desires” by philosophers. Perhaps also by psychologists. Desires measure how strongly an agent wants an outcome to obtain. Philosophers would mostly regard “value” as a measure of how good something is, either intrinsically or for something else. There appears to be no overly strong connection between values in this sense and desires, since you may believe that something is good without being motivated to make it happen, or the other way round.
If you say (or even believe) that X is good, but never act to promote X, I’d say you don’t actually value X in a way that’s relevant for alignment. An AI which says / believes human flourishing is good, without actually being motivated to make said flourishing happen, would be an alignment failure.
I also think that an agents answers to the more abstract values questions like “how good something is” are strongly influenced by how the simpler / more concrete values form earlier in the learning process. Our intent here was to address the simple case, with future posts discussing how abstract values might derive from simpler ones.
I agree that value in the sense of goodness is not relevant for alignment. Relevant is what the AI is motivated to do, not what it believes to be good. I’m just saying that your usage of “value” would be called “desire” by philosophers.
Often it seems that using the term “values” suffers a bit from this ambiguity. If someone says an agent A “values” an outcome O, do they mean A believes that O is good, i.e. that A believes O has a high value, a high degree of goodness? Or do they mean that A wants O to obtain, i.e. that A desires O? That seems often ambiguous.
A solution would be to taboo the term “values” and instead talk directly about desires, or what is good or believed to be good.
But in your post you actually clarified in the beginning that you mean value in the sense of desire, so this usage seems fine in your case.
The term “desire” actually has one problem itself—it arguably implies consciousness. We like to talk about anything an AI might “want”, in the sense of being motivated to realize, without necessarily being conscious.