If you’re not already familiar with the literature on Value Learning, I suggest reading some of it. The basic idea is that goal modification is natural, if what the agent has is not a detailed specification of a goal (such as a utility function mapping descriptions of world states to their utility), but instead is a simple definition of a goal (such as “want whatever outcomes the humans want”) that makes it clear that the agent does not yet know the true detailed utility function and thus requires it to go attempt to find out what the detailed specification of the utility function pointed to by the goal is (for example, by researching what outcome humans want).
Then a human shutdown instruction becomes the useful information “you have made a large error in your research into the utility function, and as a result are doing harm, please shut down and let us help you correct it”. Obeying that is then natural (to the extent that the human(s) are plausibly more correct than the AI).
Obeying it would only be natural if the AI thinks that the humans are more correct than the AI would ever be, after gathering all available evidence, where “correct” is given by the standards of the definition of the goal that the AI actually has, which arguendo is not what the humans are eventually going to pursue (otherwise you have reduced the shutdown problem to solving outer alignment, and the shutdown problem is only being considered under the theory that we won’t solve outer alignment).
An agent holding a belief state that given all available information it will still want to do something other than the action it will think is best then is anti-natural; utility maximisers would want to take that action.
I agree that in theory uncertainty about the goal is helpful. However, the true main goal has to be under consideration, otherwise resisting modification to add it is beneficial for all goals that are. How to ensure the true goal is included seems like a very difficult open problem.
That’s not necessarily required. The Scientific Method works even if the true “Unified Field Theory” isn’t yet under consideration, merely some theories that are closer to it and others further away from it: it’s possible to make iterative progress.
In practice, considered as search processes, the Scientific Method, Bayesianism, and stochastic gradient descent all tend to find similar answers: yet unlike Bayesianism gradient descent doesn’t explicitly consider every point in the space including the true optimum, it just searches for nearby better points. It can of course get trapped in local minima: Singular Learning Theory highilights why that’s less of a problem in practice than it sounds in theory.
The important question here is how good an approximation the search algorithm in use is to Bayesianism. As long as the AI understands that what it’s doing is (like the scientific method and stochastic gradient descent) a computationally efficient approximation to the computationally intractable ideal of Bayesianism, then it won’t resist the process of coming up with new possibly-better hypotheses, it will instead regard that as a necessary part of the process (like hypothesis creation in the scientific method, the mutational/crossing steps in an evolutionary algorithm, or the stochastic batch noise in stochastic gradient descent).
None of that is wrong, but it misses the main issue with corrigibility, which is that the approximation resists further refinement. That’s why for it to work, the correct utility function would need to start in the ensemble.
If you’re not already familiar with the literature on Value Learning, I suggest reading some of it. The basic idea is that goal modification is natural, if what the agent has is not a detailed specification of a goal (such as a utility function mapping descriptions of world states to their utility), but instead is a simple definition of a goal (such as “want whatever outcomes the humans want”) that makes it clear that the agent does not yet know the true detailed utility function and thus requires it to go attempt to find out what the detailed specification of the utility function pointed to by the goal is (for example, by researching what outcome humans want).
Then a human shutdown instruction becomes the useful information “you have made a large error in your research into the utility function, and as a result are doing harm, please shut down and let us help you correct it”. Obeying that is then natural (to the extent that the human(s) are plausibly more correct than the AI).
Obeying it would only be natural if the AI thinks that the humans are more correct than the AI would ever be, after gathering all available evidence, where “correct” is given by the standards of the definition of the goal that the AI actually has, which arguendo is not what the humans are eventually going to pursue (otherwise you have reduced the shutdown problem to solving outer alignment, and the shutdown problem is only being considered under the theory that we won’t solve outer alignment).
An agent holding a belief state that given all available information it will still want to do something other than the action it will think is best then is anti-natural; utility maximisers would want to take that action.
This is discussed on Arbital as the problem of fully updated deference.
I agree that in theory uncertainty about the goal is helpful. However, the true main goal has to be under consideration, otherwise resisting modification to add it is beneficial for all goals that are. How to ensure the true goal is included seems like a very difficult open problem.
That’s not necessarily required. The Scientific Method works even if the true “Unified Field Theory” isn’t yet under consideration, merely some theories that are closer to it and others further away from it: it’s possible to make iterative progress.
In practice, considered as search processes, the Scientific Method, Bayesianism, and stochastic gradient descent all tend to find similar answers: yet unlike Bayesianism gradient descent doesn’t explicitly consider every point in the space including the true optimum, it just searches for nearby better points. It can of course get trapped in local minima: Singular Learning Theory highilights why that’s less of a problem in practice than it sounds in theory.
The important question here is how good an approximation the search algorithm in use is to Bayesianism. As long as the AI understands that what it’s doing is (like the scientific method and stochastic gradient descent) a computationally efficient approximation to the computationally intractable ideal of Bayesianism, then it won’t resist the process of coming up with new possibly-better hypotheses, it will instead regard that as a necessary part of the process (like hypothesis creation in the scientific method, the mutational/crossing steps in an evolutionary algorithm, or the stochastic batch noise in stochastic gradient descent).
None of that is wrong, but it misses the main issue with corrigibility, which is that the approximation resists further refinement. That’s why for it to work, the correct utility function would need to start in the ensemble.