Now, the basic problem: our agent’s utility function is mostly a function of latent variables. … Those latent variables:
May not correspond to any particular variables in the AI’s world-model and/or the physical world
May not be estimated by the agent at all (because lazy evaluation)
May not be determined by the agent’s observed data
… and of course the agent’s model might just not be very good, in terms of predictive power.
And you also discuss how:
Human “values” are defined within the context of humans’ world-models, and don’t necessarily make any sense at all outside of the model.
My two concerns are as follows. Firstly, that the problems mentioned in these quotes above are quite different from the problem of constructing a feedback signal which points to a concept which we know an AI already possesses. Suppose that you meet an alien and you have a long conversation about the human concept of happiness, until you reach a shared understanding of the concept. In other words, you both agree on what “the referents of these pointers” are, and what “the real-world things (if any) to which they’re pointing” are? But let’s say that the alien still doesn’t care at all about human happiness. Would you say that we have a “pointer problem” with respect to this alien? If so, it’s a very different type of pointer problem than the one you have with respect to a child who believes in ghosts. I guess you could say that there are two different but related parts of the pointer problem? But in that case it seems valuable to distinguish more clearly between them.
My second concern is that requiring pointers to be sufficient to “to get the AI to do what we mean” means that they might differ wildly depending on the motivation system of that specific AI and the details of “what we mean”. For example, imagine if alien A is already be willing to obey any commands you give, as long as it understands them; alien B can be induced to do so via operant conditioning; alien C would only acquire human values via neurosurgery; alien D would only do so after millennia of artificial selection. So in the context of alien A, a precise english phrase is a sufficient pointer; for alien B, a few labeled examples qualifies as a pointer; for alien C, identifying a specific cluster of neurons (and how it’s related to surrounding neurons) serves as a pointer; for alien D, only a millennium of supervision is a sufficient pointer. And then these all might change when we’re talking about pointing to a different concept.
And so adding the requirement that a pointer can “get the AI to do what we mean” makes it seem to me like the thing we’re talking about is more like a whole alignment scheme than just a “pointer”.
The post did emphasize, in many places, that there may not be any real-world thing corresponding to a human concept, and therefore constructing a pointer is presumably impossible. But the “thing may not exist” problem is only one potential blocker to constructing a pointer. Just because there exists some real-world structure corresponding to a human concept, or an AI has some structure corresponding to a human concept, does not mean we have a pointer. It just means that it should be possible, in principle, to create a pointer.
So, the concept-existence problem is a strict subset of the pointer problem.
Second, there are definitely parts of a whole alignment scheme which are not the pointer problem. For instance, inner alignment, decision theory shenanigans (e.g. commitment races), and corrigibility are all orthogonal to the pointers problem (or at least the pointers-to-values problem). Constructing a feedback signal which rewards the thing we want is not the same as building an AI which does the thing we want.
Third, and most important...
My second concern is that requiring pointers to be sufficient to “to get the AI to do what we mean” means that they might differ wildly depending on the motivation system of that specific AI and the details of “what we mean”. For example...
The key point is that all these examples involve solving an essentially-similar pointer problem. In example A, we need to ensure that our English-language commands are sufficient to specify everything we care about which the alien wouldn’t guess on its own; that’s the part which is a pointer problem. In example B, we need to ensure that our examples actually reward the thing we want, along all relevant dimensions, and do not allow any degrees of freedom to Goodhart; that’s the part which is a pointer problem. In example C, we need to identify which of its neurons correspond to the concepts we want, and ensure that the correspondence is robust; that’s the part which is a pointer problem. Example D is essentially the same as B, with weaker implicit priors.
The essence of each of these is “make sure we actually point to the thing we want, and not to anything else”. That’s the part which is a pointer problem.
To put it differently, the whole alignment problem is “get an AI to do what I mean”, while the pointer problem part is roughly “specify what I mean well enough that I could use the specification to get an AI to do what I mean”, assuming problems like “get AI to follow specification” can be solved.
In example B, we need to ensure that our examples actually reward the thing we want, along all relevant dimensions, and do not allow any degrees of freedom to Goodhart; that’s the part which is a pointer problem.
Why do you think we need to do this? Do you think the human reward system does that, in order to successfully imbue a person with their values?
the pointer problem part is roughly “specify what I mean well enough that I could use the specification to get an AI to do what I mean”, assuming problems like “get AI to follow specification” can be solved.
On the view of this post, is it that we would get a really good “evaluation module” for the AI to use, and the “get AI to follow specification” corresponds to “make AI want to generate plans evaluated highly by that procedure”? Or something else?
Why do you think we need to do this? Do you think the human reward system does that, in order to successfully imbue a person with their values?
In the context of the discussion with Richard, I was assuming the general model in which we want an inner optimizer’s objective to match an outer optimization objective. We can of course drop that assumption (as you usually do), but then we still need to know what objective/values we want to imbue in the final system. And whatever final objective/values we’re aiming for, we need it to actually match what we want along all the relevant dimensions, and not allow any degrees of freedom to Goodhart; that would be the corresponding problem for the sort of approach you think about.
On the view of this post, is it that we would get a really good “evaluation module” for the AI to use, and the “get AI to follow specification” corresponds to “make AI want to generate plans evaluated highly by that procedure”? Or something else?
No, I am not assuming anything that specific. The pointers problem is not meant to be a problem with one particular class of approaches to constructing aligned AI; it is meant to be a foundational problem in saying what-we-want. Insofar as we haven’t solved the pointers problem, we have not yet understood the type signature of our own values; not only do we not know what we want, we don’t even understand the type signature of “wanting things”.
Thanks for the reply. To check that I understand your position, would you agree that solving outer alignment plus solving reward tampering would solve the pointers problem in the context of machine learning?
Broadly speaking, I think our disagreement here is closely related to one we’ve discussed before, about how much sense it makes to talk about outer alignment in isolation (and also about your definition of inner alignment), so I probably won’t pursue this further.
Yeah, I wouldn’t even include reward tampering. Outer alignment, as I think about it, is mostly the pointer problem, and the (values) pointer problem is a subset of outer alignment. (Though e.g. Evan would define it differently.)
Above you say:
And you also discuss how:
My two concerns are as follows. Firstly, that the problems mentioned in these quotes above are quite different from the problem of constructing a feedback signal which points to a concept which we know an AI already possesses. Suppose that you meet an alien and you have a long conversation about the human concept of happiness, until you reach a shared understanding of the concept. In other words, you both agree on what “the referents of these pointers” are, and what “the real-world things (if any) to which they’re pointing” are? But let’s say that the alien still doesn’t care at all about human happiness. Would you say that we have a “pointer problem” with respect to this alien? If so, it’s a very different type of pointer problem than the one you have with respect to a child who believes in ghosts. I guess you could say that there are two different but related parts of the pointer problem? But in that case it seems valuable to distinguish more clearly between them.
My second concern is that requiring pointers to be sufficient to “to get the AI to do what we mean” means that they might differ wildly depending on the motivation system of that specific AI and the details of “what we mean”. For example, imagine if alien A is already be willing to obey any commands you give, as long as it understands them; alien B can be induced to do so via operant conditioning; alien C would only acquire human values via neurosurgery; alien D would only do so after millennia of artificial selection. So in the context of alien A, a precise english phrase is a sufficient pointer; for alien B, a few labeled examples qualifies as a pointer; for alien C, identifying a specific cluster of neurons (and how it’s related to surrounding neurons) serves as a pointer; for alien D, only a millennium of supervision is a sufficient pointer. And then these all might change when we’re talking about pointing to a different concept.
And so adding the requirement that a pointer can “get the AI to do what we mean” makes it seem to me like the thing we’re talking about is more like a whole alignment scheme than just a “pointer”.
Ok, a few things here...
The post did emphasize, in many places, that there may not be any real-world thing corresponding to a human concept, and therefore constructing a pointer is presumably impossible. But the “thing may not exist” problem is only one potential blocker to constructing a pointer. Just because there exists some real-world structure corresponding to a human concept, or an AI has some structure corresponding to a human concept, does not mean we have a pointer. It just means that it should be possible, in principle, to create a pointer.
So, the concept-existence problem is a strict subset of the pointer problem.
Second, there are definitely parts of a whole alignment scheme which are not the pointer problem. For instance, inner alignment, decision theory shenanigans (e.g. commitment races), and corrigibility are all orthogonal to the pointers problem (or at least the pointers-to-values problem). Constructing a feedback signal which rewards the thing we want is not the same as building an AI which does the thing we want.
Third, and most important...
The key point is that all these examples involve solving an essentially-similar pointer problem. In example A, we need to ensure that our English-language commands are sufficient to specify everything we care about which the alien wouldn’t guess on its own; that’s the part which is a pointer problem. In example B, we need to ensure that our examples actually reward the thing we want, along all relevant dimensions, and do not allow any degrees of freedom to Goodhart; that’s the part which is a pointer problem. In example C, we need to identify which of its neurons correspond to the concepts we want, and ensure that the correspondence is robust; that’s the part which is a pointer problem. Example D is essentially the same as B, with weaker implicit priors.
The essence of each of these is “make sure we actually point to the thing we want, and not to anything else”. That’s the part which is a pointer problem.
To put it differently, the whole alignment problem is “get an AI to do what I mean”, while the pointer problem part is roughly “specify what I mean well enough that I could use the specification to get an AI to do what I mean”, assuming problems like “get AI to follow specification” can be solved.
Why do you think we need to do this? Do you think the human reward system does that, in order to successfully imbue a person with their values?
On the view of this post, is it that we would get a really good “evaluation module” for the AI to use, and the “get AI to follow specification” corresponds to “make AI want to generate plans evaluated highly by that procedure”? Or something else?
In the context of the discussion with Richard, I was assuming the general model in which we want an inner optimizer’s objective to match an outer optimization objective. We can of course drop that assumption (as you usually do), but then we still need to know what objective/values we want to imbue in the final system. And whatever final objective/values we’re aiming for, we need it to actually match what we want along all the relevant dimensions, and not allow any degrees of freedom to Goodhart; that would be the corresponding problem for the sort of approach you think about.
No, I am not assuming anything that specific. The pointers problem is not meant to be a problem with one particular class of approaches to constructing aligned AI; it is meant to be a foundational problem in saying what-we-want. Insofar as we haven’t solved the pointers problem, we have not yet understood the type signature of our own values; not only do we not know what we want, we don’t even understand the type signature of “wanting things”.
Thanks for the reply. To check that I understand your position, would you agree that solving outer alignment plus solving reward tampering would solve the pointers problem in the context of machine learning?
Broadly speaking, I think our disagreement here is closely related to one we’ve discussed before, about how much sense it makes to talk about outer alignment in isolation (and also about your definition of inner alignment), so I probably won’t pursue this further.
Yeah, I wouldn’t even include reward tampering. Outer alignment, as I think about it, is mostly the pointer problem, and the (values) pointer problem is a subset of outer alignment. (Though e.g. Evan would define it differently.)