Research that makes the case for AGI x-risk clearer
I ended up going into detail on this, in the process of making an entry to the FLI’s aspirational worldbuilding contest. So, it’ll be posted in full about a month from now. But for now, I’ll summarize:
We should prepare stuff in advance for identifying and directly manipulating the components of an AGI that engage in ruminative thought. This should be possible, there are certain structures of questions and answers that will reliably emerge, “what is the big blank blue thing at the top of the image” “it’s probably the sky”, and such. We wont know how to read or speak its mentalese, at first, but we will be able to learn it by looking for known claims and going from there.
Once we have AGI, we should use this stuff to query the AGI’s own internal beliefs about whether certain catastrophic outcomes would come about, under the condition that it had been given internet access.
If the queries return true, then we have clear evidence of the presence of immense danger. We have a Demonstration of Cataclysmic Trajectory. This is going to be much more likely to get the world to take notice and react, than the loads of abstract reasoning about fundamental patterns of rational agency or whatever, that we’ve offered them so far. (Normal people don’t trust abstract reasoning, and they mostly shouldn’t! It’s tricksy!)
From there, national funding for a global collaboration for alignment, and a means to convince security-minded parts of the government to implement the pretty tough global security policies required, so that the alignment project will no longer need to solve the problem in 5 years, and can instead take, say, 30.
(And then we solve the symbol grounding problem, and then we figure out value learning, and then we learn how best to aggregate the learned values, and then we’ll have solved the alignment problem)
I ended up going into detail on this, in the process of making an entry to the FLI’s aspirational worldbuilding contest. So, it’ll be posted in full about a month from now. But for now, I’ll summarize:
We should prepare stuff in advance for identifying and directly manipulating the components of an AGI that engage in ruminative thought. This should be possible, there are certain structures of questions and answers that will reliably emerge, “what is the big blank blue thing at the top of the image” “it’s probably the sky”, and such. We wont know how to read or speak its mentalese, at first, but we will be able to learn it by looking for known claims and going from there.
Once we have AGI, we should use this stuff to query the AGI’s own internal beliefs about whether certain catastrophic outcomes would come about, under the condition that it had been given internet access.
If the queries return true, then we have clear evidence of the presence of immense danger. We have a Demonstration of Cataclysmic Trajectory. This is going to be much more likely to get the world to take notice and react, than the loads of abstract reasoning about fundamental patterns of rational agency or whatever, that we’ve offered them so far. (Normal people don’t trust abstract reasoning, and they mostly shouldn’t! It’s tricksy!)
From there, national funding for a global collaboration for alignment, and a means to convince security-minded parts of the government to implement the pretty tough global security policies required, so that the alignment project will no longer need to solve the problem in 5 years, and can instead take, say, 30.
(And then we solve the symbol grounding problem, and then we figure out value learning, and then we learn how best to aggregate the learned values, and then we’ll have solved the alignment problem)