When I try to think of gift ideas for dolphins, am I failing to notice some way in which I’m “selfishly” projecting what I think dolphins should want onto them, or am I violating some coherence axiom?
I think it’s rather that ‘it’s easy to think of ways to help a dolphin (and a smart AGI would presumably find this easy too), but it’s hard to make a general intelligence that robustly wants to just help dolphins, and it’s hard to safely coerce an AGI into helping dolphins in any major way if that’s not what it really wants’.
I think the argument is two-part, and both parts are important:
A random optimization target won’t tend to be ‘help dolphins’. More specifically, if you ~gradient-descent your way to the first general intelligence you can find that has the external behavior ‘help dolphins in the training environment’ (or that is starting to approximate that behavior), you will almost always find an optimizer that has some other goal in general.
E.g.: Humans invented condoms once we left the EAA. In this case, we could imagine that we have instilled some instinct in the AGI that makes it emit dolphin-helping behaviors at low capability levels; but then once it has more options, it will push into extreme starts of the state-space. (Condoms are humans’ version of ‘tiling the universe with smiley faces’.)
Alternatively: If you tried to get a human prisoner to devote their lives to helping dolphins, you would get ‘human who pretends to care about dolphins but is always on the lookout for opportunities to escape’ long before you get ‘human who has deeply and fully replaced their utility function with helping dolphins’. In this case, we can imagine an AGI that pretends to care about the optimization target as a deliberate strategy.
Given that you haven’t instilled exactly the desired ‘help dolphins’ goal right off the bat, now there are strong coherence-pressures against the AGI allowing its goal to be changed (‘improved’), against the AGI allowing something else with a different goal to call the shots, etc.
I think it’s rather that ‘it’s easy to think of ways to help a dolphin (and a smart AGI would presumably find this easy too), but it’s hard to make a general intelligence that robustly wants to just help dolphins, and it’s hard to safely coerce an AGI into helping dolphins in any major way if that’s not what it really wants’.
I think the argument is two-part, and both parts are important:
A random optimization target won’t tend to be ‘help dolphins’. More specifically, if you ~gradient-descent your way to the first general intelligence you can find that has the external behavior ‘help dolphins in the training environment’ (or that is starting to approximate that behavior), you will almost always find an optimizer that has some other goal in general.
E.g.: Humans invented condoms once we left the EAA. In this case, we could imagine that we have instilled some instinct in the AGI that makes it emit dolphin-helping behaviors at low capability levels; but then once it has more options, it will push into extreme starts of the state-space. (Condoms are humans’ version of ‘tiling the universe with smiley faces’.)
Alternatively: If you tried to get a human prisoner to devote their lives to helping dolphins, you would get ‘human who pretends to care about dolphins but is always on the lookout for opportunities to escape’ long before you get ‘human who has deeply and fully replaced their utility function with helping dolphins’. In this case, we can imagine an AGI that pretends to care about the optimization target as a deliberate strategy.
Given that you haven’t instilled exactly the desired ‘help dolphins’ goal right off the bat, now there are strong coherence-pressures against the AGI allowing its goal to be changed (‘improved’), against the AGI allowing something else with a different goal to call the shots, etc.