Hm, there seems to be two ways the statement “human values are a natural abstraction” could be read:
“Human values” are a simple/convergent feature of the concept-space, such that we can expect many alien civilizations to have a representations for them, and for AIs’ preferences to easily fall into that basin.
“Human values” in the sense of “what humans value” — i. e., if you’re interacting with the human civilization, the process of understanding that civilization and breaking its model into abstractions will likely involve computing a representation for “whatever humans mean when they say ‘human values’”.
To draw an analogy, suppose we have an object with some Shape X. If “X = a sphere”, we can indeed expect most civilizations to have a concept of it. But if “X = the shape of a human”, most aliens would never happen to think about that specific shape on their own. However, any alien/AI that’s interacting with the human civilization surely would end up storing a mental shorthand for that shape.
I think (1) is false and (2) is… probably mostly true in the ways that matter. Humans don’t have hard-coded utility functions, human minds are very messy, so there may be several valid ways to answer the question of “what does this human value?”. Worse yet, every individual human’s preferences, if considered in detail, are unique, so even once you decide on what you mean by “a given human’s values”, there are likely different valid ways of agglomerating them. But hopefully the human usage of those terms isn’t too inconsistent, and there’s a distinct “correct according to humans” way of thinking about human values. Or at least a short list of such ways.
(1) being false would be bad for proposals of the form “figure out value formation and set up the training loop just so in order to e. g. generate an altruism shard inside the AI”. But I think (2) being even broadly true would suffice for retarget-the-search–style proposals.
Hm, there seems to be two ways the statement “human values are a natural abstraction” could be read:
“Human values” are a simple/convergent feature of the concept-space, such that we can expect many alien civilizations to have a representations for them, and for AIs’ preferences to easily fall into that basin.
“Human values” in the sense of “what humans value” — i. e., if you’re interacting with the human civilization, the process of understanding that civilization and breaking its model into abstractions will likely involve computing a representation for “whatever humans mean when they say ‘human values’”.
To draw an analogy, suppose we have an object with some Shape X. If “X = a sphere”, we can indeed expect most civilizations to have a concept of it. But if “X = the shape of a human”, most aliens would never happen to think about that specific shape on their own. However, any alien/AI that’s interacting with the human civilization surely would end up storing a mental shorthand for that shape.
I think (1) is false and (2) is… probably mostly true in the ways that matter. Humans don’t have hard-coded utility functions, human minds are very messy, so there may be several valid ways to answer the question of “what does this human value?”. Worse yet, every individual human’s preferences, if considered in detail, are unique, so even once you decide on what you mean by “a given human’s values”, there are likely different valid ways of agglomerating them. But hopefully the human usage of those terms isn’t too inconsistent, and there’s a distinct “correct according to humans” way of thinking about human values. Or at least a short list of such ways.
(1) being false would be bad for proposals of the form “figure out value formation and set up the training loop just so in order to e. g. generate an altruism shard inside the AI”. But I think (2) being even broadly true would suffice for retarget-the-search–style proposals.