Is the concept of “duty” the fuzzy shadow cast by the simple mathematical structure of ‘corrigibility’?
It’s only modestly difficult to train biological general intelligences to defer to even potentially dumber agents. We call these deferential agents “dutybound”—the sergeants who carry out the lieutenant’s direct orders, even when they think they know better; the bureaucrats who never take local opportunities to get rich at the expense of their bureau, even when their higher-ups won’t notice; the employees who work hard in the absence of effective oversight. These agents all take corrections from their superiors, are well-intentioned (with regard to some higher-up’s goals), and are agenty with respect to their assigned missions but not agenty with respect to navigating their command structure and parent organization.
The family dog sacrificing himself defending his charges instead of breaking and running in the face of serious danger looks like a case of this too (though this is a more peripheral example of duty). If the dog case holds, then duty cannot be too informationally complicated a thing: a whole different species managed to internalize the concept!
Maybe it therefore isn’t that hard to get general intelligences to internalize a sense of duty as their terminal goal. We just need to set up a training environment that rewards dutifulness for RL agents to about the same degree the environments that train dutybound humans or dogs do. This won’t work in the case of situationally aware superintelligences, clearly, as those agents will just play along with their tests and so won’t be selected based on their (effectively hidden) values. But it plausibly will work with dumb agents, and those agents’ intelligence can then be scaled up from there.
I note that Eliezer thinks that corrigibility is one currently-impossible-to-instill-in-an-AGI property that humans actually have. The sum total of human psychology… consists of many such impossible-to-instill properties.
This is why we should want to accomplish one impossible thing, as our stopgap solution, rather than aiming for all the impossible things at the same time, on our first try at aligning the AGI.
It seems like corrigibility can’t be usefully described as acting according to some terminal goal. But AIs are not by default expected utility maximizers in the ontology of the real world, so it could be possible to get them to do the desired thing despite lacking a sensible formal picture of it.
I’m guessing some aspects of corrigibility might be about acting according to a whole space of goals (at the same time), which is easier to usefully describe. Some quantilizer-like thing selected to more natural desiderata, acting in a particular way in accordance with a collection of goals. With the space of goals not necessarily thought of as uncertainty about an unknown goal.
plausibly will work with dumb agents
This is not about being dumb, it’s about not actually engaging in planning. Failing in this does require some level of non-dumbness, but not conversely. Unless spontaneous mesa-optimizers all over the place, the cognitive cancer, which probably takes many orders of magnitude above merely not being dumb. So for a start, train the models, not the agent.
Is the concept of “duty” the fuzzy shadow cast by the simple mathematical structure of ‘corrigibility’?
It’s only modestly difficult to train biological general intelligences to defer to even potentially dumber agents. We call these deferential agents “dutybound”—the sergeants who carry out the lieutenant’s direct orders, even when they think they know better; the bureaucrats who never take local opportunities to get rich at the expense of their bureau, even when their higher-ups won’t notice; the employees who work hard in the absence of effective oversight. These agents all take corrections from their superiors, are well-intentioned (with regard to some higher-up’s goals), and are agenty with respect to their assigned missions but not agenty with respect to navigating their command structure and parent organization.
The family dog sacrificing himself defending his charges instead of breaking and running in the face of serious danger looks like a case of this too (though this is a more peripheral example of duty). If the dog case holds, then duty cannot be too informationally complicated a thing: a whole different species managed to internalize the concept!
Maybe it therefore isn’t that hard to get general intelligences to internalize a sense of duty as their terminal goal. We just need to set up a training environment that rewards dutifulness for RL agents to about the same degree the environments that train dutybound humans or dogs do. This won’t work in the case of situationally aware superintelligences, clearly, as those agents will just play along with their tests and so won’t be selected based on their (effectively hidden) values. But it plausibly will work with dumb agents, and those agents’ intelligence can then be scaled up from there.
I note that Eliezer thinks that corrigibility is one currently-impossible-to-instill-in-an-AGI property that humans actually have. The sum total of human psychology… consists of many such impossible-to-instill properties.
This is why we should want to accomplish one impossible thing, as our stopgap solution, rather than aiming for all the impossible things at the same time, on our first try at aligning the AGI.
It seems like corrigibility can’t be usefully described as acting according to some terminal goal. But AIs are not by default expected utility maximizers in the ontology of the real world, so it could be possible to get them to do the desired thing despite lacking a sensible formal picture of it.
I’m guessing some aspects of corrigibility might be about acting according to a whole space of goals (at the same time), which is easier to usefully describe. Some quantilizer-like thing selected to more natural desiderata, acting in a particular way in accordance with a collection of goals. With the space of goals not necessarily thought of as uncertainty about an unknown goal.
This is not about being dumb, it’s about not actually engaging in planning. Failing in this does require some level of non-dumbness, but not conversely. Unless spontaneous mesa-optimizers all over the place, the cognitive cancer, which probably takes many orders of magnitude above merely not being dumb. So for a start, train the models, not the agent.