Good question! Here are a couple more specific open questions and how I think about them:
Is there a way that AGIs can be “verified to be good”?
Suppose an AGI has motivations that we don’t endorse. Before it “misbehaves” in big catastrophic unrecoverable ways, will it first “misbehave” in small and easily-recoverable ways? And if so, will people respond properly by shutting down the AGI and/or fixing the problem?
For #1, my answer is “Wow, that would be super duper awesome, and it would make me dramatically more optimistic, so I sure hope someone figures out how to do that. But from where I stand, it seems like a hard or impossible thing to do.” One problem is interpretability—I’m expecting one component of a brain-like AGI to be a “learned world-model” full of millions or billions of unlabeled entries (concepts) that were built within its “lifetime” by a learning algorithm. So you can’t just look inside and understand what the AGI is trying to do.
(Much more on that in future posts.)
For #2, this is an open question and controversy in the field, and I think a significant reason that some people in the field are more optimistic than others (about the probability that an AGI will kill everyone). There are a couple routes to pessimism:
(A) “Treacherous turn”—We imagine a very smart and savvy AGI, that understands people and itself and its situation, but has malign intentions. As long as the humans are holding all the cards, it would act exactly like an AGI with good intentions. Then as soon as it gets the opportunity to seize control, it would. Like, imagine yourself being imprisoned by a group of 8-year-olds. You wouldn’t shout at them from your prison cell “As soon as you let me out, I’m going to have you all arrested!!!” Instead, you would pretend that you were on their side and trying to help them, and if possible you actually would help them, right until the moment they unlock the door!
(B) Band-aiding over the problem—We imagine the AGI-in-training, not yet smart and savvy enough to do a proper “treacherous turn”, making a ham-fisted attempt to deceive the human supervisor, and getting caught. Then the human applies some specific patch to the code, or penalty to the AGI’s RL system, and it stops that specific behavior. It happens a couple more times, and the human applies a couple more patches / penalties. Then the AGI doesn’t do anything like that for the next 3 months, and the human pats himself on the back for solving the problem, and then the AGI sweet-talks the human into giving it internet access, and copies itself onto AWS and starts methodically working through its plan to kill everyone. What happened there? Maybe the patches were not actually getting at the root problem! Instead of making the AGI “motivated to not deceive”, maybe the patches made the AGI “motivated to not get caught deceiving”. (It actually seems like quite a hard problem to get the former motivation but not the latter motivation—more on that in future posts.) See also “nearest unblocked strategy”.
I’m not saying that we should definitely be pessimistic, but rather that we should feel uncertain, and we should feel highly motivated to do the research (especially on interpretability and motivation-sculpting) that would help improve our prospects. :-)
Good question! Here are a couple more specific open questions and how I think about them:
Is there a way that AGIs can be “verified to be good”?
Suppose an AGI has motivations that we don’t endorse. Before it “misbehaves” in big catastrophic unrecoverable ways, will it first “misbehave” in small and easily-recoverable ways? And if so, will people respond properly by shutting down the AGI and/or fixing the problem?
For #1, my answer is “Wow, that would be super duper awesome, and it would make me dramatically more optimistic, so I sure hope someone figures out how to do that. But from where I stand, it seems like a hard or impossible thing to do.” One problem is interpretability—I’m expecting one component of a brain-like AGI to be a “learned world-model” full of millions or billions of unlabeled entries (concepts) that were built within its “lifetime” by a learning algorithm. So you can’t just look inside and understand what the AGI is trying to do.
(Much more on that in future posts.)
For #2, this is an open question and controversy in the field, and I think a significant reason that some people in the field are more optimistic than others (about the probability that an AGI will kill everyone). There are a couple routes to pessimism:
(A) “Treacherous turn”—We imagine a very smart and savvy AGI, that understands people and itself and its situation, but has malign intentions. As long as the humans are holding all the cards, it would act exactly like an AGI with good intentions. Then as soon as it gets the opportunity to seize control, it would. Like, imagine yourself being imprisoned by a group of 8-year-olds. You wouldn’t shout at them from your prison cell “As soon as you let me out, I’m going to have you all arrested!!!” Instead, you would pretend that you were on their side and trying to help them, and if possible you actually would help them, right until the moment they unlock the door!
(B) Band-aiding over the problem—We imagine the AGI-in-training, not yet smart and savvy enough to do a proper “treacherous turn”, making a ham-fisted attempt to deceive the human supervisor, and getting caught. Then the human applies some specific patch to the code, or penalty to the AGI’s RL system, and it stops that specific behavior. It happens a couple more times, and the human applies a couple more patches / penalties. Then the AGI doesn’t do anything like that for the next 3 months, and the human pats himself on the back for solving the problem, and then the AGI sweet-talks the human into giving it internet access, and copies itself onto AWS and starts methodically working through its plan to kill everyone. What happened there? Maybe the patches were not actually getting at the root problem! Instead of making the AGI “motivated to not deceive”, maybe the patches made the AGI “motivated to not get caught deceiving”. (It actually seems like quite a hard problem to get the former motivation but not the latter motivation—more on that in future posts.) See also “nearest unblocked strategy”.
I’m not saying that we should definitely be pessimistic, but rather that we should feel uncertain, and we should feel highly motivated to do the research (especially on interpretability and motivation-sculpting) that would help improve our prospects. :-)
Thanks for the elaboration, looking forward to the next posts. :)