In a larger picture, you should also factor in the probability that the oversight (over the breeding of misaligned tendencies) will always be vigilant. The entire history of safety science tells us that this is unlikely, or downright impossible. Mistakes will happen, “obligatory checks” do get skipped, and entirely unanticipated failure modes do emerge. And we should convince ourselves that nothing of this will happen, with decent probability, from the moment of the first AGI deployment, until “the end of time” (or practically, we should show, theoretically, that the ensuing recursive self-improvement or quasi-self-improvement sociotechnical dynamics will only coverage to more resilience rather than less resilience). This is very hard to demonstrate, but it must be done to justify AGI deployment. I didn’t see evidence in the post that Anthropic appreciates this angle of looking at the problem enough.
In a larger picture, you should also factor in the probability that the oversight (over the breeding of misaligned tendencies) will always be vigilant. The entire history of safety science tells us that this is unlikely, or downright impossible. Mistakes will happen, “obligatory checks” do get skipped, and entirely unanticipated failure modes do emerge. And we should convince ourselves that nothing of this will happen, with decent probability, from the moment of the first AGI deployment, until “the end of time” (or practically, we should show, theoretically, that the ensuing recursive self-improvement or quasi-self-improvement sociotechnical dynamics will only coverage to more resilience rather than less resilience). This is very hard to demonstrate, but it must be done to justify AGI deployment. I didn’t see evidence in the post that Anthropic appreciates this angle of looking at the problem enough.