I don’t think it is correct to conceptualize MLE as a “goal” that may or may not be “myopic.” LLMs are simulators, not prediction-correctness-optimizers; we can infer this from the fact that they don’t intervene in their environment to make it more predictable. When I worry about LLMs being non-myopic agents, I worry about what happens when they have been subjected to lots of fine tuning, perhaps via Ajeya Cotra’s idea of “HFDT,” for a while after pre-training. Thus, while pretraining from human preferences might shift the initial distribution that the model predicts at the start of finetuning in a way which seems like it would likely push the final outcome of fine-tuning in a more aligned direction, it is far from a solution to the deeper problem of agent alignment that I think is really the core.
Hm, that might be a potential point of confusion. I agree that there’s no agentic stuff, at least without RL or a memory source, but the LLM is still pursuing the goal of maximizing the likelihood of the training data, which comes apart pretty quickly from the preferences of humans, for many reasons.
You’re right that it doesn’t actively intervene, mostly because of the following:
There’s no RL, usually.
It is memoryless, in the sense that it forgets itself.
It doesn’t have a way to store arbitrarily long/complex problems in their memory, nor can it write memories to a brain.
But the Maximum Likelihood Estimation goal still gives you misaligned behavior, and I’ll give you examples:
So the LLM is still optimizing for Maximum Likelihood Estimation, it just has certain limitations so that it just misaligns it passively, instead of actively.
I don’t think it is correct to conceptualize MLE as a “goal” that may or may not be “myopic.” LLMs are simulators, not prediction-correctness-optimizers; we can infer this from the fact that they don’t intervene in their environment to make it more predictable. When I worry about LLMs being non-myopic agents, I worry about what happens when they have been subjected to lots of fine tuning, perhaps via Ajeya Cotra’s idea of “HFDT,” for a while after pre-training. Thus, while pretraining from human preferences might shift the initial distribution that the model predicts at the start of finetuning in a way which seems like it would likely push the final outcome of fine-tuning in a more aligned direction, it is far from a solution to the deeper problem of agent alignment that I think is really the core.
Hm, that might be a potential point of confusion. I agree that there’s no agentic stuff, at least without RL or a memory source, but the LLM is still pursuing the goal of maximizing the likelihood of the training data, which comes apart pretty quickly from the preferences of humans, for many reasons.
You’re right that it doesn’t actively intervene, mostly because of the following:
There’s no RL, usually.
It is memoryless, in the sense that it forgets itself.
It doesn’t have a way to store arbitrarily long/complex problems in their memory, nor can it write memories to a brain.
But the Maximum Likelihood Estimation goal still gives you misaligned behavior, and I’ll give you examples:
Completing buggy Python code in a buggy way
https://arxiv.org/abs/2107.03374
Or to espouse views consistent with those expressed in the prompt (sycophancy).
https://arxiv.org/pdf/2212.09251.pdf
So the LLM is still optimizing for Maximum Likelihood Estimation, it just has certain limitations so that it just misaligns it passively, instead of actively.