Thank you for this post. It looks like the people at Anthropic have put a lot of thought into this which is good to see.
You mention that there are often surprising qualitative differences between larger and smaller models. How seriously is Anthropic considering a scenario where there is a sudden jump in certain dangerous capabilities (in particular deception) at some level of model intelligence? Does it seem plausible that it might not be possible to foresee this jump from experiments on even slighter weaker models?
We certainly think that abrupt changes of safety properties are very possible! See discussion of how the most pessimistic scenarios may seem optimistic until very powerful systems are created in this post, and also our paper on Predictability and Surprise.
With that said, I think we tend to expect a bit of continuity. Empirically, even the “abrupt changes” we observe with respect to model size tend to take place over order-of-magnitude changes in compute. (There are examples of things like the formation of induction heads where qualitative changes in model properties can happen quite fast over the course of training).
But we certainly wouldn’t claim to know this with any confidence, and wouldn’t take the possibility of extremely abrupt changes off the table!
In a larger picture, you should also factor in the probability that the oversight (over the breeding of misaligned tendencies) will always be vigilant. The entire history of safety science tells us that this is unlikely, or downright impossible. Mistakes will happen, “obligatory checks” do get skipped, and entirely unanticipated failure modes do emerge. And we should convince ourselves that nothing of this will happen, with decent probability, from the moment of the first AGI deployment, until “the end of time” (or practically, we should show, theoretically, that the ensuing recursive self-improvement or quasi-self-improvement sociotechnical dynamics will only coverage to more resilience rather than less resilience). This is very hard to demonstrate, but it must be done to justify AGI deployment. I didn’t see evidence in the post that Anthropic appreciates this angle of looking at the problem enough.
Thank you for this post. It looks like the people at Anthropic have put a lot of thought into this which is good to see.
You mention that there are often surprising qualitative differences between larger and smaller models. How seriously is Anthropic considering a scenario where there is a sudden jump in certain dangerous capabilities (in particular deception) at some level of model intelligence? Does it seem plausible that it might not be possible to foresee this jump from experiments on even slighter weaker models?
We certainly think that abrupt changes of safety properties are very possible! See discussion of how the most pessimistic scenarios may seem optimistic until very powerful systems are created in this post, and also our paper on Predictability and Surprise.
With that said, I think we tend to expect a bit of continuity. Empirically, even the “abrupt changes” we observe with respect to model size tend to take place over order-of-magnitude changes in compute. (There are examples of things like the formation of induction heads where qualitative changes in model properties can happen quite fast over the course of training).
But we certainly wouldn’t claim to know this with any confidence, and wouldn’t take the possibility of extremely abrupt changes off the table!
In a larger picture, you should also factor in the probability that the oversight (over the breeding of misaligned tendencies) will always be vigilant. The entire history of safety science tells us that this is unlikely, or downright impossible. Mistakes will happen, “obligatory checks” do get skipped, and entirely unanticipated failure modes do emerge. And we should convince ourselves that nothing of this will happen, with decent probability, from the moment of the first AGI deployment, until “the end of time” (or practically, we should show, theoretically, that the ensuing recursive self-improvement or quasi-self-improvement sociotechnical dynamics will only coverage to more resilience rather than less resilience). This is very hard to demonstrate, but it must be done to justify AGI deployment. I didn’t see evidence in the post that Anthropic appreciates this angle of looking at the problem enough.