How does it give the AI a myopic goal? It seems like it’s basically just a clever form of prompt engineering in the sense that it alters the conditional distribution that the base model is predicting, albeit in a more robustly good way than most/all prompts, but base models aren’t myopic agents, they aren’t agents at all. As such I’m not concerned about pure simulators/predictors posing xrisks, but what happens when people do RL on them to turn them into agents (or similar techniques like decision transformers). I think its plausible that pretraining from human feedback partially addresses this by pushing the model’s outputs into a more aligned distribution from the get go when we do RLHF, but it is very much not obvious that it solves the deeper problems with RL more broadly (inner alignment and scalable oversight/sycophancy).
It’s basically replacing Maximum Likelihood Estimation, the goal that LLMs and simulators currently use, with the goal of cross-entropy from a feedback-annotated webtext distribution, and in particular it’s a simple, myopic goal, which prevents deceptive alignment.
In particular, even if we turn it into an agent, it will be a pretty myopic one, or an aligned, non-myopic agent at worst.
How?
Specifically, the fact that it can both improve at PEP8, which is essentially generating correct python code, as well as being better at not getting personal identifying information is huge. Especially that second task, as it’s indirectly speaking to a very important question: Can we control powerseeking such that an AI doesn’t powerseek if it would be misaligned to a human’s interest’s? In particular, if the model doesn’t try to get personal identifying information, then it’s also voluntarily limiting it’s ability to seek power when it detects that it’s misaligned with a human’s values. That’s arguably one of the core functions of any functional alignment strategy: Controlling powerseeking.
I don’t think it is correct to conceptualize MLE as a “goal” that may or may not be “myopic.” LLMs are simulators, not prediction-correctness-optimizers; we can infer this from the fact that they don’t intervene in their environment to make it more predictable. When I worry about LLMs being non-myopic agents, I worry about what happens when they have been subjected to lots of fine tuning, perhaps via Ajeya Cotra’s idea of “HFDT,” for a while after pre-training. Thus, while pretraining from human preferences might shift the initial distribution that the model predicts at the start of finetuning in a way which seems like it would likely push the final outcome of fine-tuning in a more aligned direction, it is far from a solution to the deeper problem of agent alignment that I think is really the core.
Hm, that might be a potential point of confusion. I agree that there’s no agentic stuff, at least without RL or a memory source, but the LLM is still pursuing the goal of maximizing the likelihood of the training data, which comes apart pretty quickly from the preferences of humans, for many reasons.
You’re right that it doesn’t actively intervene, mostly because of the following:
There’s no RL, usually.
It is memoryless, in the sense that it forgets itself.
It doesn’t have a way to store arbitrarily long/complex problems in their memory, nor can it write memories to a brain.
But the Maximum Likelihood Estimation goal still gives you misaligned behavior, and I’ll give you examples:
So the LLM is still optimizing for Maximum Likelihood Estimation, it just has certain limitations so that it just misaligns it passively, instead of actively.
It’s basically replacing Maximum Likelihood Estimation, the goal that LLMs and simulators currently use, with the goal of cross-entropy from a feedback-annotated webtext distribution, and in particular it’s a simple, myopic goal, which prevents deceptive alignment.
In particular, even if we turn it into an agent, it will be a pretty myopic one, or an aligned, non-myopic agent at worst.
Specifically, the fact that it can both improve at PEP8, which is essentially generating correct python code, as well as being better at not getting personal identifying information is huge. Especially that second task, as it’s indirectly speaking to a very important question: Can we control powerseeking such that an AI doesn’t powerseek if it would be misaligned to a human’s interest’s? In particular, if the model doesn’t try to get personal identifying information, then it’s also voluntarily limiting it’s ability to seek power when it detects that it’s misaligned with a human’s values. That’s arguably one of the core functions of any functional alignment strategy: Controlling powerseeking.
I don’t think it is correct to conceptualize MLE as a “goal” that may or may not be “myopic.” LLMs are simulators, not prediction-correctness-optimizers; we can infer this from the fact that they don’t intervene in their environment to make it more predictable. When I worry about LLMs being non-myopic agents, I worry about what happens when they have been subjected to lots of fine tuning, perhaps via Ajeya Cotra’s idea of “HFDT,” for a while after pre-training. Thus, while pretraining from human preferences might shift the initial distribution that the model predicts at the start of finetuning in a way which seems like it would likely push the final outcome of fine-tuning in a more aligned direction, it is far from a solution to the deeper problem of agent alignment that I think is really the core.
Hm, that might be a potential point of confusion. I agree that there’s no agentic stuff, at least without RL or a memory source, but the LLM is still pursuing the goal of maximizing the likelihood of the training data, which comes apart pretty quickly from the preferences of humans, for many reasons.
You’re right that it doesn’t actively intervene, mostly because of the following:
There’s no RL, usually.
It is memoryless, in the sense that it forgets itself.
It doesn’t have a way to store arbitrarily long/complex problems in their memory, nor can it write memories to a brain.
But the Maximum Likelihood Estimation goal still gives you misaligned behavior, and I’ll give you examples:
Completing buggy Python code in a buggy way
https://arxiv.org/abs/2107.03374
Or to espouse views consistent with those expressed in the prompt (sycophancy).
https://arxiv.org/pdf/2212.09251.pdf
So the LLM is still optimizing for Maximum Likelihood Estimation, it just has certain limitations so that it just misaligns it passively, instead of actively.