Thanks for the follow-up, this helps me understand your view!
At any given point, the reward model will be vulnerable to arbitrary adversarial attacks under sufficient optimization pressure, but we don’t need arbitrary optimization against any given reward model. Like, each human update lets you optimize a bit more against the reward model which gets you the ability to get somewhat closer to the policy you actually want.
Sure, this feels basically right to me. My reframing of this would be that we could do in principle do RL directly with feedback provided by a human. Learning a reward model lets us gain some sample efficiency over this, but sometimes it fails to generalize correctly to what a human would say, and adversarial examples are an important special case of this. But provided the policy is the final output of this process, not the reward model, it doesn’t matter if the reward model is robust or not—just that the feedback the policy has received has steered it into a good position.
It seems to me like your views either imply that sample efficiency is low enough now that high quality RLHF currently can’t compete with other ways of training AIs which are less safe but have cheaper reward signals (e.g., heavy training on automated outcomes based feedback or low quality human reward signal). Or possibly that this will happen in the future as models get more powerful (reversing the current trend toward better sample efficiency). My understanding is that RLHF is quite commercially popular and sample efficiency isn’t a huge issue. Perhaps I’m missing some important gap between how RLHF is used now and how it will have to be used in the future to align powerful systems?
This argument feels like it’s proving too much. InstructGPT isn’t perfect, but it does produce a lot less toxic output and follow instructions a lot better than the base model GPT-3. RLHF seems to work, and GPT-4 is even better, showing that it gets easier with bigger models. Why should we expect this trend to reverse? Why are we worried about this safety thing anyway?
I actually find this style of argument pretty plausible: I’m a relative optimist on this forum, I do think that some fairly basic methods like a souped-up RLHF might well be sufficient to make things go OK (while preferring to have more principled methods giving us a bigger safety margin). But I’m somewhat surprised to hear you making that case!
Suppose we condition on RLHF failing. At a high level, failures split into: (a) human labelers rewarded the wrong thing (e.g. fooling humans); (b) the reward model failed to predict human labelers judgement and rewarded the wrong thing (e.g. reward hacking); (c) RL produced a policy that is capable enough to be dangerous but is optimizing something other than the reward model (e.g. mesa-optimization). I find all three of these risks plausible, and I don’t see a specific reason to privilege (a) or (c) substantially over (b). It sounds like you’re most concerned about (a), and I’d love to hear your reasons for that.
However, I do think it’s interesting to explore concrete failure modes: given RLHF is working well now, what does my view imply about how it might stop working? One scenario I find plausible is that as models get bigger, the sample efficiency of RLHF continues to increase, since the models have higher fidelity representations of a greater variety of tasks. RLHF therefore just needs to localize the task that’s already in the network. However, the performance of this process is ultimately capped by what the base model already represents. I can see two ways this could go wrong:
1. Misaligned base model. If the foundation model that’s being RLHF’d is already misaligned, then a small amount of RL training is not going to be enough to disabuse it of this. By contrast, a large amount of RL training (of the order of 10% of the training steps used for self-supervised learning) with a high-fidelity reward signal might. Unfortunately, we can’t do that much RL training without just reward hacking, so we never try. (Or we take that many time steps, but with a KL penalty, forcing the model to stay close to the unaligned base model.)
2. Harmless base model. If the foundation model starts off harmless (not necessarily aligned, just not actively trying to cause harm), then I’d expect RLHF’ing it to only improve things so long as the training signal never rewards bad behavior. However, the designers want the model to significantly outperform humans at this task. The model has capacity to learn to do this, but can’t just leverage existing capabilities in the foundation model, as the performance of that model is limited to that of the best humans it saw in the self-supervised training data. So, we need to do RL for many more time steps. Collecting fresh human data for that is prohibitive, so we rely on a reward model—unfortunately that gets hacked.
I think I find the second case more compelling. The first case seems concerning as well, but it seems like quite a scary scenario even if robustness gets solved, whereas I expect fixing robustness to actually make a significant dent in the second scenario.
That said, I feel like I should emphasize in all this that I largely think robustness is an intractable problem to solve, and that while it’s worth trying to improve it at the margin I’m most excited by efforts to make systems not need robustness. I think you make a good point that having humans in the loop in the training makes RLHF degrade more gracefully than reward learning approaches that train on a fixed offline dataset, and the KL penalty also helps. I suspect there are many similar algorithmic tweaks that would make algorithms less sensitive to robustness.
For example, riffing off going from an offline to online dataset, could we improve things further by collecting a dataset that anticipates where the RL process might try to exploit it? That sounds fanciful, but there’s a simple hack you can do to get something like this. Just train a policy on the current reward model. Then collect human feedback from that policy. Then roll the policy back to the last checkpoint, and repeat the training using the new reward model. You could do this step once per checkpoint, or keep doing it until you get human approval to move on (e.g. the reward model now aligns with human feedback). In this way you should be able to avoid ever giving the policy reward for the wrong behavior. I suspect this process would be much less sample efficient than vanilla RLHF, but it would have better safety properties, and measuring how much slower it is could be a good proxy for how severe the “robustness tax” is.
Also note that vanilla RLHF doesn’t necessarily require optimizing against a reward model. For instance, a recently released paper Direct Policy Optimization (DPO) avoids this step entirely.
I’m a bit confused by the claim here, although I’ve only read the abstract and skimmed the paper so perhaps it’d become obvious from a closer read. As far as I can tell, the cited paper focuses on motion-planning, and considers a rather restricted setting of LQR policies. This is a reasonable starting point, but a human communicating that a cart pole should stand up (or a desired quadcoptor trajectory) feels much simpler than even toy tasks for RLHF in LLMs like summarizing text. So I don’t really see this as much evidence in favor of being able to drop the reward model. Generally, any highly sample efficient model-based RL approach would enable RL training to proceed without a reward model by having humans directly label the trajectory,
Ah, that paper makes a lot more sense. A reward model was attractive in the original Deep RL From Human Preferences paper because the environment was complex and non-differentiable: using RL was a natural fit. It’s always seemed a bit stranger to use RL for fine-tuning language models, especially in the prompt-completion setting where the “environment” is trivial. (RL becomes more natural when you start introducing external tools, or conversations with humans.)
I’ll need to take a closer look at the paper, but it looks like they derive the DPO objective by starting from the RL objective under KL optimization. So if it does what it says on the tin, then I’d expect the resulting policy incentives to be similar. My hunch is the problem of reward hacking has shifted from an explicit to implicit problem rather than being eliminated, although I’m certainly not confident on this. Could be interesting to study using a similar approach to the Scaling Laws for Reward Model Overoptimization paper.