I’m pretty sure John has disavowed the arguement presented in Alignment by Default.
EDIT: And alignment by default was based on the notion that “natural abstractions are pretty discrete”, “there is a natural abstraction for the space of human values” and “the training process will bias the AI towards natural abstractions”. I think John disavows the middle claim. Which, to be clear, is stating that there is a natural abstraction which serves as a pointer to human values. Not that the AI has a bunch of computations running niceness or co-operation or honour or so on.
Specifically, the claim I endorse is that the inputs to human values are natural abstractions. This is mostly a claim in relation to the Pointers Problem: humans value high-level abstract stuff, but we value that stuff “in the world” (as opposed to valuing our perceptions of the stuff), but abstraction is ultimately in the mind rather than the territory… hard to make those pieces play together at the same time. I claim that the way those pieces fit together is that the things-humans-value are expressed in terms of natural abstractions.
That does not imply that “human values” themselves are a natural abstraction; I’m agnostic on that question. Alignment by Default unfortunately did a poor job communicating the distinction.
My intuition is that niceness/cooperation/etc are more likely than “human values” to be natural abstractions, but I still wouldn’t be very highly confident. And even if human versions of niceness/cooperation/etc are natural abstractions, I agree with Nate’s point that (in my terminology) they may be specific to an environment composed mainly of human minds.
The whole matter is something I have a lot of uncertainty around. In general, which value-loaded concepts are natural abstractions (and in which environments) is an empirical question, and my main hope is that those questions can be answered empirically before full-blown superhuman AGI.
Hm, there seems to be two ways the statement “human values are a natural abstraction” could be read:
“Human values” are a simple/convergent feature of the concept-space, such that we can expect many alien civilizations to have a representations for them, and for AIs’ preferences to easily fall into that basin.
“Human values” in the sense of “what humans value” — i. e., if you’re interacting with the human civilization, the process of understanding that civilization and breaking its model into abstractions will likely involve computing a representation for “whatever humans mean when they say ‘human values’”.
To draw an analogy, suppose we have an object with some Shape X. If “X = a sphere”, we can indeed expect most civilizations to have a concept of it. But if “X = the shape of a human”, most aliens would never happen to think about that specific shape on their own. However, any alien/AI that’s interacting with the human civilization surely would end up storing a mental shorthand for that shape.
I think (1) is false and (2) is… probably mostly true in the ways that matter. Humans don’t have hard-coded utility functions, human minds are very messy, so there may be several valid ways to answer the question of “what does this human value?”. Worse yet, every individual human’s preferences, if considered in detail, are unique, so even once you decide on what you mean by “a given human’s values”, there are likely different valid ways of agglomerating them. But hopefully the human usage of those terms isn’t too inconsistent, and there’s a distinct “correct according to humans” way of thinking about human values. Or at least a short list of such ways.
(1) being false would be bad for proposals of the form “figure out value formation and set up the training loop just so in order to e. g. generate an altruism shard inside the AI”. But I think (2) being even broadly true would suffice for retarget-the-search–style proposals.
I’m pretty sure John has disavowed the arguement presented in Alignment by Default.
EDIT: And alignment by default was based on the notion that “natural abstractions are pretty discrete”, “there is a natural abstraction for the space of human values” and “the training process will bias the AI towards natural abstractions”. I think John disavows the middle claim. Which, to be clear, is stating that there is a natural abstraction which serves as a pointer to human values. Not that the AI has a bunch of computations running niceness or co-operation or honour or so on.
Specifically, the claim I endorse is that the inputs to human values are natural abstractions. This is mostly a claim in relation to the Pointers Problem: humans value high-level abstract stuff, but we value that stuff “in the world” (as opposed to valuing our perceptions of the stuff), but abstraction is ultimately in the mind rather than the territory… hard to make those pieces play together at the same time. I claim that the way those pieces fit together is that the things-humans-value are expressed in terms of natural abstractions.
That does not imply that “human values” themselves are a natural abstraction; I’m agnostic on that question. Alignment by Default unfortunately did a poor job communicating the distinction.
My intuition is that niceness/cooperation/etc are more likely than “human values” to be natural abstractions, but I still wouldn’t be very highly confident. And even if human versions of niceness/cooperation/etc are natural abstractions, I agree with Nate’s point that (in my terminology) they may be specific to an environment composed mainly of human minds.
The whole matter is something I have a lot of uncertainty around. In general, which value-loaded concepts are natural abstractions (and in which environments) is an empirical question, and my main hope is that those questions can be answered empirically before full-blown superhuman AGI.
Hm, there seems to be two ways the statement “human values are a natural abstraction” could be read:
“Human values” are a simple/convergent feature of the concept-space, such that we can expect many alien civilizations to have a representations for them, and for AIs’ preferences to easily fall into that basin.
“Human values” in the sense of “what humans value” — i. e., if you’re interacting with the human civilization, the process of understanding that civilization and breaking its model into abstractions will likely involve computing a representation for “whatever humans mean when they say ‘human values’”.
To draw an analogy, suppose we have an object with some Shape X. If “X = a sphere”, we can indeed expect most civilizations to have a concept of it. But if “X = the shape of a human”, most aliens would never happen to think about that specific shape on their own. However, any alien/AI that’s interacting with the human civilization surely would end up storing a mental shorthand for that shape.
I think (1) is false and (2) is… probably mostly true in the ways that matter. Humans don’t have hard-coded utility functions, human minds are very messy, so there may be several valid ways to answer the question of “what does this human value?”. Worse yet, every individual human’s preferences, if considered in detail, are unique, so even once you decide on what you mean by “a given human’s values”, there are likely different valid ways of agglomerating them. But hopefully the human usage of those terms isn’t too inconsistent, and there’s a distinct “correct according to humans” way of thinking about human values. Or at least a short list of such ways.
(1) being false would be bad for proposals of the form “figure out value formation and set up the training loop just so in order to e. g. generate an altruism shard inside the AI”. But I think (2) being even broadly true would suffice for retarget-the-search–style proposals.