Why do you think we need to do this? Do you think the human reward system does that, in order to successfully imbue a person with their values?
In the context of the discussion with Richard, I was assuming the general model in which we want an inner optimizer’s objective to match an outer optimization objective. We can of course drop that assumption (as you usually do), but then we still need to know what objective/values we want to imbue in the final system. And whatever final objective/values we’re aiming for, we need it to actually match what we want along all the relevant dimensions, and not allow any degrees of freedom to Goodhart; that would be the corresponding problem for the sort of approach you think about.
On the view of this post, is it that we would get a really good “evaluation module” for the AI to use, and the “get AI to follow specification” corresponds to “make AI want to generate plans evaluated highly by that procedure”? Or something else?
No, I am not assuming anything that specific. The pointers problem is not meant to be a problem with one particular class of approaches to constructing aligned AI; it is meant to be a foundational problem in saying what-we-want. Insofar as we haven’t solved the pointers problem, we have not yet understood the type signature of our own values; not only do we not know what we want, we don’t even understand the type signature of “wanting things”.
In the context of the discussion with Richard, I was assuming the general model in which we want an inner optimizer’s objective to match an outer optimization objective. We can of course drop that assumption (as you usually do), but then we still need to know what objective/values we want to imbue in the final system. And whatever final objective/values we’re aiming for, we need it to actually match what we want along all the relevant dimensions, and not allow any degrees of freedom to Goodhart; that would be the corresponding problem for the sort of approach you think about.
No, I am not assuming anything that specific. The pointers problem is not meant to be a problem with one particular class of approaches to constructing aligned AI; it is meant to be a foundational problem in saying what-we-want. Insofar as we haven’t solved the pointers problem, we have not yet understood the type signature of our own values; not only do we not know what we want, we don’t even understand the type signature of “wanting things”.