I’ve seen Eliezer Yudkowsky claim that we don’t need to worry about s-risks from AI, since the Alignment Problem would need to be something like 95% solved in order for s-risks to crop up in a worryingly-large number of a TAI’s failure modes: a threshold he thinks we are nowhere near crossing. If this is true, it seems to carry the troubling implication that alignment research could be net-negative, conditional on how difficult it will be for us to conquer that remaining 5% of the Alignment Problem in the time we have.
So is there any work being done on figuring out where that threshold might be, after which we need to worry about s-risks from TAI? Should this line of reasoning have policy implications, and is this argument about an “s-risk threshold” largely accepted?
One thing that seems worth mentioning is that, based on my understanding of Alignment Theory, if some smarter version of ChaosGPT did kill all humans, it wouldn’t be because of the instructions it was given, but for the same reason any unaligned AI would kill all humans—that is, because it’s unaligned. It’s hard for me to imagine a scenario where an unaligned agent like ChaosGPT would be more likely to kill everyone than any given unaligned AI; the whole deal with the Outer Alignment Problem is that we don’t yet know how to get agents to do the things we want them to do, regardless of whether those things are benevolent or destructive or anything in between.
Still, I agree that this sets a horrible precedent and that this sort of thing should be prosecuted in the future, if only because at some point if we do solve Alignment, an agent like ChaosGPT could be dangerous for (obvious) different reasons, unrelated to being unaligned.