Perhaps “do what we say” is more like “know when the outside view says you’ve incorrectly converged to the wrong value function, so we’re probably right and you should listen to us”.
It’s somewhat more subtle than that. The ideal (and maybe impossible) corrigible AI should protect us even if we accidentally give the AI the wrong process for figuring out what to value. It should protect us even if the AI becomes omniscient.
If the AI knows vastly more than we do, there’s no sense in which we are providing extra evidence or an information-carrying “outside view”. We are instead just registering a sort of complaint and hoping we’ve programmed the AI to listen.
I’m still not convinced that such a sort of corrigibility is in any way distinct from some extra complications in the process we give the AI for figuring out what to value.
The outside view I had in mind wasn’t with respect to its knowledge, but to empirical data on how often its exact value-learning algorithm converges to the correct set of preferences for agents-like-us. That feels different.
Perhaps “do what we say” is more like “know when the outside view says you’ve incorrectly converged to the wrong value function, so we’re probably right and you should listen to us”.
It’s somewhat more subtle than that. The ideal (and maybe impossible) corrigible AI should protect us even if we accidentally give the AI the wrong process for figuring out what to value. It should protect us even if the AI becomes omniscient.
If the AI knows vastly more than we do, there’s no sense in which we are providing extra evidence or an information-carrying “outside view”. We are instead just registering a sort of complaint and hoping we’ve programmed the AI to listen.
I’m still not convinced that such a sort of corrigibility is in any way distinct from some extra complications in the process we give the AI for figuring out what to value.
The outside view I had in mind wasn’t with respect to its knowledge, but to empirical data on how often its exact value-learning algorithm converges to the correct set of preferences for agents-like-us. That feels different.
I assume you’re talking about the particular “do what we say” subsystem described in the second last paragraph? If so, that seems plausibly right.