Also, now that I think of it, it’s different for the model to:
represent hidden subtle information which is natural for the task, like in that example you gave with “written by”
using that hidden subtext for computation which is unnatural in the given context—to covertly do some other task than just the one it is asked for
While 1. definitely happens, I see no reason why 2. would be learned in pre-training. In principle 2. could be learned in RL fine-tuning, if there’s some incentive to use hidden reasoning. Although I imagine it must be hard.
Also, now that I think of it, it’s different for the model to:
represent hidden subtle information which is natural for the task, like in that example you gave with “written by”
using that hidden subtext for computation which is unnatural in the given context—to covertly do some other task than just the one it is asked for
While 1. definitely happens, I see no reason why 2. would be learned in pre-training. In principle 2. could be learned in RL fine-tuning, if there’s some incentive to use hidden reasoning. Although I imagine it must be hard.