In example B, we need to ensure that our examples actually reward the thing we want, along all relevant dimensions, and do not allow any degrees of freedom to Goodhart; that’s the part which is a pointer problem.
Why do you think we need to do this? Do you think the human reward system does that, in order to successfully imbue a person with their values?
the pointer problem part is roughly “specify what I mean well enough that I could use the specification to get an AI to do what I mean”, assuming problems like “get AI to follow specification” can be solved.
On the view of this post, is it that we would get a really good “evaluation module” for the AI to use, and the “get AI to follow specification” corresponds to “make AI want to generate plans evaluated highly by that procedure”? Or something else?
Why do you think we need to do this? Do you think the human reward system does that, in order to successfully imbue a person with their values?
In the context of the discussion with Richard, I was assuming the general model in which we want an inner optimizer’s objective to match an outer optimization objective. We can of course drop that assumption (as you usually do), but then we still need to know what objective/values we want to imbue in the final system. And whatever final objective/values we’re aiming for, we need it to actually match what we want along all the relevant dimensions, and not allow any degrees of freedom to Goodhart; that would be the corresponding problem for the sort of approach you think about.
On the view of this post, is it that we would get a really good “evaluation module” for the AI to use, and the “get AI to follow specification” corresponds to “make AI want to generate plans evaluated highly by that procedure”? Or something else?
No, I am not assuming anything that specific. The pointers problem is not meant to be a problem with one particular class of approaches to constructing aligned AI; it is meant to be a foundational problem in saying what-we-want. Insofar as we haven’t solved the pointers problem, we have not yet understood the type signature of our own values; not only do we not know what we want, we don’t even understand the type signature of “wanting things”.
Why do you think we need to do this? Do you think the human reward system does that, in order to successfully imbue a person with their values?
On the view of this post, is it that we would get a really good “evaluation module” for the AI to use, and the “get AI to follow specification” corresponds to “make AI want to generate plans evaluated highly by that procedure”? Or something else?
In the context of the discussion with Richard, I was assuming the general model in which we want an inner optimizer’s objective to match an outer optimization objective. We can of course drop that assumption (as you usually do), but then we still need to know what objective/values we want to imbue in the final system. And whatever final objective/values we’re aiming for, we need it to actually match what we want along all the relevant dimensions, and not allow any degrees of freedom to Goodhart; that would be the corresponding problem for the sort of approach you think about.
No, I am not assuming anything that specific. The pointers problem is not meant to be a problem with one particular class of approaches to constructing aligned AI; it is meant to be a foundational problem in saying what-we-want. Insofar as we haven’t solved the pointers problem, we have not yet understood the type signature of our own values; not only do we not know what we want, we don’t even understand the type signature of “wanting things”.