I have no clue how that works in a stable manner, but I don’t think that current architectures can learn this even if you scale them up.
I definitely agree with this if “stable” also implies “the thing we actually want.”
I would worry that the System 1->System 2 push is a low level convergent property across a wide range of possible architectures that have something like goals. Even as the optimization target diverges from what we’re really trying to make it learn, I could see it still picking up more deliberate thought just because it helps for so many different things.
That said, I would agree that current token predictors don’t seem to do this naturally. We can elicit a simulation of it by changing how we use the predictor, but the optimizer doesn’t operate across multiple steps and can’t directly push for it. (I’m actually hoping we can make use of this property somehow to make some stronger claims about a corrigible architecture, though I’m far from certain that current token predictor architectures scaled up can’t do well enough via simulation.)
I definitely agree with this if “stable” also implies “the thing we actually want.”
I would worry that the System 1->System 2 push is a low level convergent property across a wide range of possible architectures that have something like goals. Even as the optimization target diverges from what we’re really trying to make it learn, I could see it still picking up more deliberate thought just because it helps for so many different things.
That said, I would agree that current token predictors don’t seem to do this naturally. We can elicit a simulation of it by changing how we use the predictor, but the optimizer doesn’t operate across multiple steps and can’t directly push for it. (I’m actually hoping we can make use of this property somehow to make some stronger claims about a corrigible architecture, though I’m far from certain that current token predictor architectures scaled up can’t do well enough via simulation.)
Only half a joke! :P