You’re right that optimism bias is an issue, but optimism bias is generally an individual phenomenon, and the most important phenomenon is what people share instead of what they believe, so negative news being shared more is the most important issue.
But recently we found a technique of alignment that solves almost every alignment problem in one go, and scales well with data.
Pre training on human feedback? I think it’s promising but we have no direct evidence of how it interacts with RL finetuning to make LLMs into agents which is the key question.
Yes, I’m talking about that technique known as Pretraining from Human Feedback.
The biggest reasons I’m so optimistic about the technique, even with it’s limitations, is the following:
It almost completely or completely solves deceptive alignment by giving it a myopic goal, so there’s far less incentive or no incentive to be deceptive.
It scales well with data, which is extremely useful, that is the more data it has, the more aligned it will be.
The tests, while sort of unimportant from our perspective, gave tentative evidence for the proposition that we can control power seeking such that we can avoid having an AI power seek if it’s misaligned and actually power seek only when it’s aligned.
They dissolved, rather than resolved embedded agency/embedded alignment concerned by using offline learning. In particular, the AI can’t hack or manipulate a human’s values, unlike online learning. In essence they translated the ontology of Cartesianism and it’s boundaries in a sensible way to an embedded world.
It’s not a total one shot solution, but it’s the closest we came to a one shot-solution, and I can see a path to alignment that’s fairly straightforward from here.
How does it give the AI a myopic goal? It seems like it’s basically just a clever form of prompt engineering in the sense that it alters the conditional distribution that the base model is predicting, albeit in a more robustly good way than most/all prompts, but base models aren’t myopic agents, they aren’t agents at all. As such I’m not concerned about pure simulators/predictors posing xrisks, but what happens when people do RL on them to turn them into agents (or similar techniques like decision transformers). I think its plausible that pretraining from human feedback partially addresses this by pushing the model’s outputs into a more aligned distribution from the get go when we do RLHF, but it is very much not obvious that it solves the deeper problems with RL more broadly (inner alignment and scalable oversight/sycophancy).
I agree scaling well with data is quite good. But see (1)
How?
I was never that concerned about this, but I agree that it does seem good to offload more training to pretraining as opposed to finetuning for this and other reasons
How does it give the AI a myopic goal? It seems like it’s basically just a clever form of prompt engineering in the sense that it alters the conditional distribution that the base model is predicting, albeit in a more robustly good way than most/all prompts, but base models aren’t myopic agents, they aren’t agents at all. As such I’m not concerned about pure simulators/predictors posing xrisks, but what happens when people do RL on them to turn them into agents (or similar techniques like decision transformers). I think its plausible that pretraining from human feedback partially addresses this by pushing the model’s outputs into a more aligned distribution from the get go when we do RLHF, but it is very much not obvious that it solves the deeper problems with RL more broadly (inner alignment and scalable oversight/sycophancy).
It’s basically replacing Maximum Likelihood Estimation, the goal that LLMs and simulators currently use, with the goal of cross-entropy from a feedback-annotated webtext distribution, and in particular it’s a simple, myopic goal, which prevents deceptive alignment.
In particular, even if we turn it into an agent, it will be a pretty myopic one, or an aligned, non-myopic agent at worst.
How?
Specifically, the fact that it can both improve at PEP8, which is essentially generating correct python code, as well as being better at not getting personal identifying information is huge. Especially that second task, as it’s indirectly speaking to a very important question: Can we control powerseeking such that an AI doesn’t powerseek if it would be misaligned to a human’s interest’s? In particular, if the model doesn’t try to get personal identifying information, then it’s also voluntarily limiting it’s ability to seek power when it detects that it’s misaligned with a human’s values. That’s arguably one of the core functions of any functional alignment strategy: Controlling powerseeking.
I don’t think it is correct to conceptualize MLE as a “goal” that may or may not be “myopic.” LLMs are simulators, not prediction-correctness-optimizers; we can infer this from the fact that they don’t intervene in their environment to make it more predictable. When I worry about LLMs being non-myopic agents, I worry about what happens when they have been subjected to lots of fine tuning, perhaps via Ajeya Cotra’s idea of “HFDT,” for a while after pre-training. Thus, while pretraining from human preferences might shift the initial distribution that the model predicts at the start of finetuning in a way which seems like it would likely push the final outcome of fine-tuning in a more aligned direction, it is far from a solution to the deeper problem of agent alignment that I think is really the core.
Hm, that might be a potential point of confusion. I agree that there’s no agentic stuff, at least without RL or a memory source, but the LLM is still pursuing the goal of maximizing the likelihood of the training data, which comes apart pretty quickly from the preferences of humans, for many reasons.
You’re right that it doesn’t actively intervene, mostly because of the following:
There’s no RL, usually.
It is memoryless, in the sense that it forgets itself.
It doesn’t have a way to store arbitrarily long/complex problems in their memory, nor can it write memories to a brain.
But the Maximum Likelihood Estimation goal still gives you misaligned behavior, and I’ll give you examples:
So the LLM is still optimizing for Maximum Likelihood Estimation, it just has certain limitations so that it just misaligns it passively, instead of actively.
You’re right that optimism bias is an issue, but optimism bias is generally an individual phenomenon, and the most important phenomenon is what people share instead of what they believe, so negative news being shared more is the most important issue.
But recently we found a technique of alignment that solves almost every alignment problem in one go, and scales well with data.
Pre training on human feedback? I think it’s promising but we have no direct evidence of how it interacts with RL finetuning to make LLMs into agents which is the key question.
Yes, I’m talking about that technique known as Pretraining from Human Feedback.
The biggest reasons I’m so optimistic about the technique, even with it’s limitations, is the following:
It almost completely or completely solves deceptive alignment by giving it a myopic goal, so there’s far less incentive or no incentive to be deceptive.
It scales well with data, which is extremely useful, that is the more data it has, the more aligned it will be.
The tests, while sort of unimportant from our perspective, gave tentative evidence for the proposition that we can control power seeking such that we can avoid having an AI power seek if it’s misaligned and actually power seek only when it’s aligned.
They dissolved, rather than resolved embedded agency/embedded alignment concerned by using offline learning. In particular, the AI can’t hack or manipulate a human’s values, unlike online learning. In essence they translated the ontology of Cartesianism and it’s boundaries in a sensible way to an embedded world.
It’s not a total one shot solution, but it’s the closest we came to a one shot-solution, and I can see a path to alignment that’s fairly straightforward from here.
How does it give the AI a myopic goal? It seems like it’s basically just a clever form of prompt engineering in the sense that it alters the conditional distribution that the base model is predicting, albeit in a more robustly good way than most/all prompts, but base models aren’t myopic agents, they aren’t agents at all. As such I’m not concerned about pure simulators/predictors posing xrisks, but what happens when people do RL on them to turn them into agents (or similar techniques like decision transformers). I think its plausible that pretraining from human feedback partially addresses this by pushing the model’s outputs into a more aligned distribution from the get go when we do RLHF, but it is very much not obvious that it solves the deeper problems with RL more broadly (inner alignment and scalable oversight/sycophancy).
I agree scaling well with data is quite good. But see (1)
How?
I was never that concerned about this, but I agree that it does seem good to offload more training to pretraining as opposed to finetuning for this and other reasons
It’s basically replacing Maximum Likelihood Estimation, the goal that LLMs and simulators currently use, with the goal of cross-entropy from a feedback-annotated webtext distribution, and in particular it’s a simple, myopic goal, which prevents deceptive alignment.
In particular, even if we turn it into an agent, it will be a pretty myopic one, or an aligned, non-myopic agent at worst.
Specifically, the fact that it can both improve at PEP8, which is essentially generating correct python code, as well as being better at not getting personal identifying information is huge. Especially that second task, as it’s indirectly speaking to a very important question: Can we control powerseeking such that an AI doesn’t powerseek if it would be misaligned to a human’s interest’s? In particular, if the model doesn’t try to get personal identifying information, then it’s also voluntarily limiting it’s ability to seek power when it detects that it’s misaligned with a human’s values. That’s arguably one of the core functions of any functional alignment strategy: Controlling powerseeking.
I don’t think it is correct to conceptualize MLE as a “goal” that may or may not be “myopic.” LLMs are simulators, not prediction-correctness-optimizers; we can infer this from the fact that they don’t intervene in their environment to make it more predictable. When I worry about LLMs being non-myopic agents, I worry about what happens when they have been subjected to lots of fine tuning, perhaps via Ajeya Cotra’s idea of “HFDT,” for a while after pre-training. Thus, while pretraining from human preferences might shift the initial distribution that the model predicts at the start of finetuning in a way which seems like it would likely push the final outcome of fine-tuning in a more aligned direction, it is far from a solution to the deeper problem of agent alignment that I think is really the core.
Hm, that might be a potential point of confusion. I agree that there’s no agentic stuff, at least without RL or a memory source, but the LLM is still pursuing the goal of maximizing the likelihood of the training data, which comes apart pretty quickly from the preferences of humans, for many reasons.
You’re right that it doesn’t actively intervene, mostly because of the following:
There’s no RL, usually.
It is memoryless, in the sense that it forgets itself.
It doesn’t have a way to store arbitrarily long/complex problems in their memory, nor can it write memories to a brain.
But the Maximum Likelihood Estimation goal still gives you misaligned behavior, and I’ll give you examples:
Completing buggy Python code in a buggy way
https://arxiv.org/abs/2107.03374
Or to espouse views consistent with those expressed in the prompt (sycophancy).
https://arxiv.org/pdf/2212.09251.pdf
So the LLM is still optimizing for Maximum Likelihood Estimation, it just has certain limitations so that it just misaligns it passively, instead of actively.