As a starting point, it might help to understand exactly where people’s naïve intuitions about why corrigibility should be easy, clash with the technical argument that it’s hard.
For me, the intuition goes like this: if I wanted to spend some fraction of my effort helping dolphins in their own moral reference frame, that seems like something I could do. I could give them gifts that I can predict that they’d like (like tasty fish or a water purifier), and be conservative when I couldn’t figure out what dolphins “really wanted”, and be eager to accept feedback when the dolphins wanted to change how I was trying to help. If my superior epistemic vantage point let me predict that the way dolphins would respond to gifts, would depend on the details like what order the gifts were presented in, I might compute an average over possible gift-orderings, or I might try to ask the dolphins to clarify, but I definitely wouldn’t tile the lightcone with tiny molecular happy-dolphin sculptures, because I can tell that’s not what dolphins want under any sensible notion of “want”.
So what I’d like to understand better is, where exactly does the analogy between “humans being corrigible to dolphins (in the fraction of their efforts devoted to helping dolphins)” and “AI being corrigible to humans” break, such that I haven’t noticed yet because empathic inference between mammals still works “well enough”, but won’t work when scaled to superintelligence? When I try to think of gift ideas for dolphins, am I failing to notice some way in which I’m “selfishly” projecting what I think dolphins should want onto them, or am I violating some coherence axiom?
When I try to think of gift ideas for dolphins, am I failing to notice some way in which I’m “selfishly” projecting what I think dolphins should want onto them, or am I violating some coherence axiom?
I think it’s rather that ‘it’s easy to think of ways to help a dolphin (and a smart AGI would presumably find this easy too), but it’s hard to make a general intelligence that robustly wants to just help dolphins, and it’s hard to safely coerce an AGI into helping dolphins in any major way if that’s not what it really wants’.
I think the argument is two-part, and both parts are important:
A random optimization target won’t tend to be ‘help dolphins’. More specifically, if you ~gradient-descent your way to the first general intelligence you can find that has the external behavior ‘help dolphins in the training environment’ (or that is starting to approximate that behavior), you will almost always find an optimizer that has some other goal in general.
E.g.: Humans invented condoms once we left the EAA. In this case, we could imagine that we have instilled some instinct in the AGI that makes it emit dolphin-helping behaviors at low capability levels; but then once it has more options, it will push into extreme starts of the state-space. (Condoms are humans’ version of ‘tiling the universe with smiley faces’.)
Alternatively: If you tried to get a human prisoner to devote their lives to helping dolphins, you would get ‘human who pretends to care about dolphins but is always on the lookout for opportunities to escape’ long before you get ‘human who has deeply and fully replaced their utility function with helping dolphins’. In this case, we can imagine an AGI that pretends to care about the optimization target as a deliberate strategy.
Given that you haven’t instilled exactly the desired ‘help dolphins’ goal right off the bat, now there are strong coherence-pressures against the AGI allowing its goal to be changed (‘improved’), against the AGI allowing something else with a different goal to call the shots, etc.
As a starting point, it might help to understand exactly where people’s naïve intuitions about why corrigibility should be easy, clash with the technical argument that it’s hard.
For me, the intuition goes like this: if I wanted to spend some fraction of my effort helping dolphins in their own moral reference frame, that seems like something I could do. I could give them gifts that I can predict that they’d like (like tasty fish or a water purifier), and be conservative when I couldn’t figure out what dolphins “really wanted”, and be eager to accept feedback when the dolphins wanted to change how I was trying to help. If my superior epistemic vantage point let me predict that the way dolphins would respond to gifts, would depend on the details like what order the gifts were presented in, I might compute an average over possible gift-orderings, or I might try to ask the dolphins to clarify, but I definitely wouldn’t tile the lightcone with tiny molecular happy-dolphin sculptures, because I can tell that’s not what dolphins want under any sensible notion of “want”.
So what I’d like to understand better is, where exactly does the analogy between “humans being corrigible to dolphins (in the fraction of their efforts devoted to helping dolphins)” and “AI being corrigible to humans” break, such that I haven’t noticed yet because empathic inference between mammals still works “well enough”, but won’t work when scaled to superintelligence? When I try to think of gift ideas for dolphins, am I failing to notice some way in which I’m “selfishly” projecting what I think dolphins should want onto them, or am I violating some coherence axiom?
I think it’s rather that ‘it’s easy to think of ways to help a dolphin (and a smart AGI would presumably find this easy too), but it’s hard to make a general intelligence that robustly wants to just help dolphins, and it’s hard to safely coerce an AGI into helping dolphins in any major way if that’s not what it really wants’.
I think the argument is two-part, and both parts are important:
A random optimization target won’t tend to be ‘help dolphins’. More specifically, if you ~gradient-descent your way to the first general intelligence you can find that has the external behavior ‘help dolphins in the training environment’ (or that is starting to approximate that behavior), you will almost always find an optimizer that has some other goal in general.
E.g.: Humans invented condoms once we left the EAA. In this case, we could imagine that we have instilled some instinct in the AGI that makes it emit dolphin-helping behaviors at low capability levels; but then once it has more options, it will push into extreme starts of the state-space. (Condoms are humans’ version of ‘tiling the universe with smiley faces’.)
Alternatively: If you tried to get a human prisoner to devote their lives to helping dolphins, you would get ‘human who pretends to care about dolphins but is always on the lookout for opportunities to escape’ long before you get ‘human who has deeply and fully replaced their utility function with helping dolphins’. In this case, we can imagine an AGI that pretends to care about the optimization target as a deliberate strategy.
Given that you haven’t instilled exactly the desired ‘help dolphins’ goal right off the bat, now there are strong coherence-pressures against the AGI allowing its goal to be changed (‘improved’), against the AGI allowing something else with a different goal to call the shots, etc.