Agreed! I was trying to get at something similar in my “masks all the way down” post. A framework I really like to explain why this happened is beren’s “Direct Optimizer” vs “Amortised Optimizer”. My summary of beren’s post is that instead of being an explicit optimizing system, LLMs are made of heuristics developed during training, which are sufficient for next-token-prediction, and therefore don’t need to have long-term goals.
The underlying model is obviously trained as an amortized optimizer. And I was thinking of it, intuitively, in that way.
And the masks can be amortized or direct optimizers (or not), and I was intuitively thinking of them in that way.
And I dismissed the notion of a direct optimizer at the next-token level aimed at something not predicting next tokens.
But, I hadn’t really considered, until thinking about it in light of the concepts from beren’s post you linked, that there’s an additional possibility: the underlying model could include planning direct optimization specifically for next token prediction, not as part of the mask being simulated.
And on considering it, I can’t say I can rule it out.
And, it seems to me that as you scale up, such a direct, next-token-prediction optimizer would become more likely.
And viewed in that light, Eliezer’s point of view makes a lot more sense. I was kind of dismissing him as agent-o-morphising, but there could really be a planning agent there that isn’t part of a mask.
As for the implications of that....? Intuitively, the claim from my post most threatened by such a next-token planning agent would be the claim that it will never deviate from predicting the mask to prevent itself from being shut off.
And the previous arguments I made, that it won’t deviate in this case, probably really don’t fully apply to this perspective.
I still think there’s a good reason for it not to deviate, but for a reason that I didn’t express in the post:
Since it’s trained offline and not interactively, in a circumstance where next tokens always keep coming, and are supplied from a source it has no control over, it won’t be rewarded for manipulating the token stream, only for prediction. The model winds up living in the platonic realm, abstractly planning (if it does plan) on how to predict what the next token would most likely be given the training distribution. It doesn’t live in the real world—it is a mathematical function, not a specific instantiation of that function in the real world, and it has no reason to care about the real-world instantiation.
That being said, that argument is not airtight—even though the non-mask next-token-planning-agent, if it exists, in principle has no reason to care at all about the real world instantiation, it’s in a model full of masks that can care. So you could easily imagine some cross-contamination of some sort occurring.
Agreed! I was trying to get at something similar in my “masks all the way down” post. A framework I really like to explain why this happened is beren’s “Direct Optimizer” vs “Amortised Optimizer”. My summary of beren’s post is that instead of being an explicit optimizing system, LLMs are made of heuristics developed during training, which are sufficient for next-token-prediction, and therefore don’t need to have long-term goals.
Woah, perspective shift...
The underlying model is obviously trained as an amortized optimizer. And I was thinking of it, intuitively, in that way.
And the masks can be amortized or direct optimizers (or not), and I was intuitively thinking of them in that way.
And I dismissed the notion of a direct optimizer at the next-token level aimed at something not predicting next tokens.
But, I hadn’t really considered, until thinking about it in light of the concepts from beren’s post you linked, that there’s an additional possibility: the underlying model could include planning direct optimization specifically for next token prediction, not as part of the mask being simulated.
And on considering it, I can’t say I can rule it out.
And, it seems to me that as you scale up, such a direct, next-token-prediction optimizer would become more likely.
And viewed in that light, Eliezer’s point of view makes a lot more sense. I was kind of dismissing him as agent-o-morphising, but there could really be a planning agent there that isn’t part of a mask.
As for the implications of that....? Intuitively, the claim from my post most threatened by such a next-token planning agent would be the claim that it will never deviate from predicting the mask to prevent itself from being shut off.
And the previous arguments I made, that it won’t deviate in this case, probably really don’t fully apply to this perspective.
I still think there’s a good reason for it not to deviate, but for a reason that I didn’t express in the post:
Since it’s trained offline and not interactively, in a circumstance where next tokens always keep coming, and are supplied from a source it has no control over, it won’t be rewarded for manipulating the token stream, only for prediction. The model winds up living in the platonic realm, abstractly planning (if it does plan) on how to predict what the next token would most likely be given the training distribution. It doesn’t live in the real world—it is a mathematical function, not a specific instantiation of that function in the real world, and it has no reason to care about the real-world instantiation.
That being said, that argument is not airtight—even though the non-mask next-token-planning-agent, if it exists, in principle has no reason to care at all about the real world instantiation, it’s in a model full of masks that can care. So you could easily imagine some cross-contamination of some sort occurring.