Quintin Pope comments on The shard theory of human values

Quintin Pope 5 Sep 2022 23:57 UTC
5 points
1
If you say (or even believe) that X is good, but never act to promote X, I’d say you don’t actually value X in a way that’s relevant for alignment. An AI which says / believes human flourishing is good, without actually being motivated to make said flourishing happen, would be an alignment failure.

I also think that an agents answers to the more abstract values questions like “how good something is” are strongly influenced by how the simpler / more concrete values form earlier in the learning process. Our intent here was to address the simple case, with future posts discussing how abstract values might derive from simpler ones.
- cubefox 6 Sep 2022 13:29 UTC
  2 points
  7
  Parent
  I agree that value in the sense of goodness is not relevant for alignment. Relevant is what the AI is motivated to do, not what it believes to be good. I’m just saying that your usage of “value” would be called “desire” by philosophers.
  
  Often it seems that using the term “values” suffers a bit from this ambiguity. If someone says an agent A “values” an outcome O, do they mean A believes that O is good, i.e. that A believes O has a high value, a high degree of goodness? Or do they mean that A wants O to obtain, i.e. that A desires O? That seems often ambiguous.
  
  A solution would be to taboo the term “values” and instead talk directly about desires, or what is good or believed to be good.
  
  But in your post you actually clarified in the beginning that you mean value in the sense of desire, so this usage seems fine in your case.
  
  The term “desire” actually has one problem itself—it arguably implies consciousness. We like to talk about anything an AI might “want”, in the sense of being motivated to realize, without necessarily being conscious.