I agree that this is worth investigating. The problem seems to be that there are many confounders that are very difficult to control for. That makes it difficult to tackle the problem in practice.
For example, in the simulated world you mention I would not be surprised to see a large number of hallucinations show up. How could we be sure that any given issue we find is a meaningful problem that is worth investigating, and not a random hallucination?
Can an LLM learn things unrelated to its training data?
Suppose it has learned to spend a few cycles pondering high-level concepts before working on the task at hand, because that is often useful. When it gets RLHF feedback, any conclusions unrelated to the task that it made during that pondering would also receive positive reward, and would therefore get baked into the weights.
This idea could very well be wrong. The gradients may be weakened during backpropagation before they get to the unrelated ideas, because the ideas did not directly contribute to the task. But on the other hand, this sounds somewhat similar to human psychology: We also remember ideas better if we have them while experiencing strong emotions, because strong emotions seem to be roughly how the brain decides what data is worth updating on.
Has anyone done experiments on this?
A simple experiment: finetune an LLM on data that encourages thinking about topic X before it does anything else. Then measure performance on topic X. Then finetune for a while on completely unrelated topic Y. Then measure performance on topic X again. Did it go up?