the incentive for a model to become situationally aware (that is, to understand how it itself fits into the world) is only minimally relevant to performance on the LLM pre-training objective (though note that this can cease to be true if we introduce RL fine-tuning).
Why is this supposed to be true? Intuitively, this seems to clash with the authors view that anthropic reasoning is likely to be problematic. From another angle, I expect performance gain from situational awareness to increase as dataset cleaning/​curation increases. Dataset cleaning has increased in stringency over time. As a simple example, see my post on dataset deduplication and situational awareness.
Some ways in which situational awareness could improve performance on next token prediction include: modeling the data curation process, helping predict other AIs via the model introspecting on its own structure, the same thing for ML papers, predicting the actions of AI labs by understanding how their AIs work, the model predicting its own output if any such output shows up in training (e.g. via RLHF), etc.
Why is this supposed to be true? Intuitively, this seems to clash with the authors view that anthropic reasoning is likely to be problematic. From another angle, I expect performance gain from situational awareness to increase as dataset cleaning/​curation increases. Dataset cleaning has increased in stringency over time. As a simple example, see my post on dataset deduplication and situational awareness.
To be clear, I think situational awareness is relevant in pre-training, just less so than in many other cases (e.g. basically any RL setup, including RLHF) where the model is acting directly in the world (and when exactly in the model’s development it gets an understanding of the training process matters a lot for deceptive alignment).
From footnote 6 above: