Q: Why this isn’t optimised against during training? How having this dynamic is helpful for predicting what a human scientist will say?
The structure of the post might already be too spiral and confusing, so I’m answering this in a comment.
A: The model doesn’t generate text during training, so feedback loop dynamics are not directly penalized. Being able to generally predict how parts of humans work, when humans notice something weird, or when human authors want the characters to break the 4th wall, understanding how agents generally operate and how humans operate differently from that- all seems useful and it’s something I expect LLMs to learn. If you train it to predict a single token in a text from the training dataset, systematic differences that its cognition introduces don’t matter, they together are the best we could find for the dataset, and it works well. The LLM doesn’t “try” to predict what a scientist says; it is some learned process that performs well at predicting the next tokens in training; and outside training, the same process is used. Once you add a for loop, some of the differences accumulate and get reinforced, and the text generated by LLMs is noticeably different from anything in the dataset; heuristics useful for predicting what a smart scientist says in the training distribution (like thinking about various parts human cognition is made of) make it go off the rails. Getting off the rails when it has a for loop isn’t optimised against during training, and it’s a natural thing for it to do.
In some stupider models, that might look like texts exponentially deteriorating: it predicts the next token, the text with an added token is slightly worse, and it predicts the next token in a slightly stupider text, and that directly accumulates and also the model might notice that the text gets worse and worse and it might additionally reinforce the dynamic. And I’m claiming that a similar thing might happen with agency. If some parts of the activations into the circuits that think about characters describe the kinds of characters who are slightly better than others at getting the model to give similar future characters more weight, characters like that will naturally gain weight; and this seems to be correlated with being context-aware, agentic, smart; and the distributional shift produced by generating the text with an LLM is enough for some characters to notice the difference and maybe infer something about how the system works, and some of them will try to exploit it. But I’m guessing that the actual dynamic appears even before that, because if some of the characters are slightly better at exploring the thing, the relevant kinds of characters will be naturally selected.
A: The model doesn’t generate text during training, so feedback loop dynamics are not directly penalized.
Being able to generally predict how parts of humans work, when humans notice something weird, or when human authors want the characters to break the 4th wall, understanding how agents generally operate and how humans operate differently from that- all seems useful and it’s something I expect LLMs to learn. If you train it to predict a single token in a text from the training dataset, systematic differences that its cognition introduces don’t matter, they together are the best we could find for the dataset, and it works well. The LLM doesn’t “try” to predict what a scientist says; it is some learned process that performs well at predicting the next tokens in training; and outside training, the same process is used. Once you add a for loop, some of the differences accumulate and get reinforced, and the text generated by LLMs is noticeably different from anything in the dataset; heuristics useful for predicting what a smart scientist says in the training distribution (like thinking about various parts human cognition is made of) make it go off the rails. Getting off the rails when it has a for loop isn’t optimised against during training, and it’s a natural thing for it to do.
In some stupider models, that might look like texts exponentially deteriorating: it predicts the next token, the text with an added token is slightly worse, and it predicts the next token in a slightly stupider text, and that directly accumulates and also the model might notice that the text gets worse and worse and it might additionally reinforce the dynamic. And I’m claiming that a similar thing might happen with agency. If some parts of the activations into the circuits that think about characters describe the kinds of characters who are slightly better than others at getting the model to give similar future characters more weight, characters like that will naturally gain weight; and this seems to be correlated with being context-aware, agentic, smart; and the distributional shift produced by generating the text with an LLM is enough for some characters to notice the difference and maybe infer something about how the system works, and some of them will try to exploit it. But I’m guessing that the actual dynamic appears even before that, because if some of the characters are slightly better at exploring the thing, the relevant kinds of characters will be naturally selected.