Note: While I think the argument in this post is important and evidence, I overall expect phase changes to be a big deal. I consider my grokking work pretty compelling evidence that phase changes are tied to the formation of circuits and I’m excited about doing more research on this direction. Though it’s not at all obvious to me that sophisticated behaviors like deception will be a single circuit vs many.
The picture of phase changes from your post, as well as the theoretical analysis here, both suggest that you may be able to observe capabilities as they form if you know what to look for. It seems like a kind of similar situation to the one suggested in the OP, though I think with a different underlying mechanism (in the OP I think there probably is no similar phase change in the model itself for any of these tasks) and a different set of things to measure.
In general if we get taken by surprise by a capability, it seems fairly likely that the story in retrospect would be that we just didn’t know what to measure. So for people worried about rapid emergence it seems natural to try to get really good at predicting these kinds of abrupt changes, whether they are coming from phase changes in the model, non-convexities from RL exploration (which exhibit hyperbolic phase changes), or performance measurements that elide progress.
(My guess would be that deception is significantly smoother than any of the trends discussed here, which are themselves significantly smoother than bona fide phase changes, just because larger and larger combinations tend to get smoother and smoother. But it still seems possible to get taken by surprise especially if you aren’t being very careful.)
Note: While I think the argument in this post is important and evidence, I overall expect phase changes to be a big deal. I consider my grokking work pretty compelling evidence that phase changes are tied to the formation of circuits and I’m excited about doing more research on this direction. Though it’s not at all obvious to me that sophisticated behaviors like deception will be a single circuit vs many.
The picture of phase changes from your post, as well as the theoretical analysis here, both suggest that you may be able to observe capabilities as they form if you know what to look for. It seems like a kind of similar situation to the one suggested in the OP, though I think with a different underlying mechanism (in the OP I think there probably is no similar phase change in the model itself for any of these tasks) and a different set of things to measure.
In general if we get taken by surprise by a capability, it seems fairly likely that the story in retrospect would be that we just didn’t know what to measure. So for people worried about rapid emergence it seems natural to try to get really good at predicting these kinds of abrupt changes, whether they are coming from phase changes in the model, non-convexities from RL exploration (which exhibit hyperbolic phase changes), or performance measurements that elide progress.
(My guess would be that deception is significantly smoother than any of the trends discussed here, which are themselves significantly smoother than bona fide phase changes, just because larger and larger combinations tend to get smoother and smoother. But it still seems possible to get taken by surprise especially if you aren’t being very careful.)