We think this occurs because in general there are groups of belief states that are degenerate in the sense that they have the same next-token distribution. In that case, the formalism presented in this post says that even though the distinction between those states must be represented in the transformers internal, the transformer is able to lose those distinctions for the purpose of predicting the next token (in the local sense), which occurs most directly right before the unembedding.
I wonder if you could force the Mixed-State Presentation to be “conserved” in later layers by training the model with different objectives. For instance, training on next-token prediction and next-token-after-that prediction might force the model to be a lot more “rigorous” about its MSP.
Papers from Google have shown that you can get more predictable results from LLMs if you train then on both next-token prediction and “fill-the-blanks” tasks where random tokens are removed from the middle of a text. I suspect it would also apply here.
I wonder if you could force the Mixed-State Presentation to be “conserved” in later layers by training the model with different objectives. For instance, training on next-token prediction and next-token-after-that prediction might force the model to be a lot more “rigorous” about its MSP.
Papers from Google have shown that you can get more predictable results from LLMs if you train then on both next-token prediction and “fill-the-blanks” tasks where random tokens are removed from the middle of a text. I suspect it would also apply here.