If it’s true that models are “starting to become barely capable of noticing that they are falling for this pattern” then I agree it’s a good sign (assuming that we want the models to become capable of “general intelligence”, of course, which we might not). I hadn’t noticed any such change, but if you tell me you’ve seen it I’ll believe you and accordingly reduce my level of belief that there’s a really fundamental hole here.
It’s necessary to point it out to the model to see whether it might be able to understand, it doesn’t visibly happen on its own, and it’s hard to judge how well the model understands what’s happening with its behavior unless you start discussing it in detail (which is to a different extent for different models). The process of learning about this I’m following is to start discussing general reasoning skills that the model is failing at when it repeatedly can’t make progress on solving some object level problem (instead of discussing details of the object level problem itself). And then I observe how the model is failing to understand and apply the general reasoning skills that I’m explaining.
I’d say the current best models are not yet at the stage where they can understand such issues well when I try to explain, so I don’t expect the next generation to become autonomously agentic yet (with any post-training). But they keep getting slightly better at this, with the first glimpses of understanding appearing in the original GPT-4.
If it’s true that models are “starting to become barely capable of noticing that they are falling for this pattern” then I agree it’s a good sign (assuming that we want the models to become capable of “general intelligence”, of course, which we might not). I hadn’t noticed any such change, but if you tell me you’ve seen it I’ll believe you and accordingly reduce my level of belief that there’s a really fundamental hole here.
It’s necessary to point it out to the model to see whether it might be able to understand, it doesn’t visibly happen on its own, and it’s hard to judge how well the model understands what’s happening with its behavior unless you start discussing it in detail (which is to a different extent for different models). The process of learning about this I’m following is to start discussing general reasoning skills that the model is failing at when it repeatedly can’t make progress on solving some object level problem (instead of discussing details of the object level problem itself). And then I observe how the model is failing to understand and apply the general reasoning skills that I’m explaining.
I’d say the current best models are not yet at the stage where they can understand such issues well when I try to explain, so I don’t expect the next generation to become autonomously agentic yet (with any post-training). But they keep getting slightly better at this, with the first glimpses of understanding appearing in the original GPT-4.