It looks like the bias is still in effect for me in GPT-4o. I just retried my original prompt, and it mentioned “Synergistic Deceptive Alignment.”
The phenomenon definitely isn’t consistent. If it’s very obvious that “sycophancy” must appear in the response, the model will generally write the word successfully. Once “sycophancy” appears once in the context, it seems like it’s easy for the model to repeat it.
Interesting, I’m not sure about this.
Your first bullet point is sort of addressed in my post—it says that in the limit as you require more and more detailed understanding, it starts to be helpful to have causal understanding as well. It probably is counterproductive to learn more about ibuprofen if I just want to relieve pain in typical situations, because of the opportunity cost if nothing else. But as you require more and more detail, you might need to understand “low-level physical phenomena,” in addition to “computationally efficient abstractions” (not instead of)
The second example seems hard to apply to LLM problem-solving, but seems pretty accurate in the case of humans.
Maybe a better example of unhelpful causal reasoning for LLMs would be understanding that their strategy goes against their stated values. If the LLM believes that telling the user to do meth is actually good for them, that will let it get higher reward than if it had the more accurate understanding “this is bad for the user but I’m going to do it anyway and pretend that it is good for them.” This more complex reasoning would probably be distracting, and the LLM would have a high prior against being overly deceitful.
That article looks very interesting and relevant, thanks a lot!