Thanks for flagging this disagreement Ryan. I enjoyed our earlier conversation (on LessWrong and in-person) and updated in favor of the sample efficiency framing, although we (clearly) still have some significant differences in perspective here. Would love to catch up again sometime and see if we can converge more on this. I’ll try and summarize my current take and our key disagreements for the benefit of other readers.
I think I mostly agree with you that in the special case of vanilla RLHF this problem is equivalent to a sample efficiency problem. Specifically, I’m referring to the case where we perform RL on a learned reward model; that reward model is trained based on human feedback from an earlier version of the RL policy; and this process iterates. In this case, if the RL algorithm learns to exploit the reward model (which it will, in contemporary systems, without some regularization like a KL penalty) then the reward model will receive corrective feedback from the human. At worst, this process will just not converge, and the policy will just bounce from one adversarial example to another—useless, but probably not that dangerous. In practice, it’ll probably work fine given enough human data and after tuning parameters.
However, I think sample efficiency could be a really big deal! Resolving this issue of overseers being exploited I expect could change the asymptotic sample complexity (e.g. exponential to linear) rather than just changing the constant factor. My understanding is that your take is that sample efficiency is unlikely to be a problem because RLHF works fine now, is fairly sample efficient, and improves with model scale—so why should we expect it to get worse?
I’d argue first that sample efficiency now may actually be quite bad. We don’t exactly have any contemporary model that I’d call aligned. GPT-4 and Claude are a lot better than what I’d expect from base models their size—but “better than just imitating internet text” is a low bar. I expect if we had ~infinite high quality data to do RLHF on these models would be much more aligned. (I’m not sure if having ~infinite data of the same quality that we do now would help; I tend to assume you can trade less quantity for increased quality, but there are obviously some limits here.)
I’m additionally concerned that sample efficiency may be highly task dependent. RLHF is a pretty finnicky method, so we’re tending to see the success cases of it. What if there are just certain tasks that it’s really hard to use RLHF for (perhaps because the base model doesn’t already have a good representation of it)? There’ll be a strong economic pressure to develop systems that do that task anyway, just using less reliable proxies for that task objective.
(A similar argument will apply for various recursive oversight schemes or debate.)
This might be the most interesting disagreement, and I’d love to dig into this more. With RLHF I can see how you can avoid the problem with sufficient samples since the human won’t be fooled by the AdvEx. But this stops working in a domain where you need scalable oversight as the inputs are too complex for a human to judge, so can’t provide any input.
The strongest argument I can see for your view is that scalable oversight procedures already have to deal with a human that says “I don’t know” for a lot of inputs. So, perhaps you can make a base model that perfectly mimics what the human would say on a large subset of inputs, and for AdvEx’s (as well as some other inputs) says “I don’t know”. This is still a hard problem—my impression was adversarial example detection is still far from solved—but is plausibly a fair bit easier than full robustness (which I suspect isn’t possible). Then you can just use your scalable oversight procedure to make the “I don’t knows” go away.
Alteratively, if you think the issue is that periodically being incentivized to adversarially attack the reward model has serious problematic effects on the inductive biases of RL, it seems relevant to argue for why this would be the case. I don’t really see why this would be important. It seems like periodically being somewhat trained to find different advexes shouldn’t have much effect on how the AI generalizes?
I think this is an area where we disagree but it doesn’t feel central to my view—I can see it going either way, and I think I’d still be concerned by whether the oversight process is robust even if the process wasn’t path dependent (e.g. we just did random restarting of the policy every time we update the reward model).
Thanks for flagging this disagreement Ryan. I enjoyed our earlier conversation (on LessWrong and in-person) and updated in favor of the sample efficiency framing, although we (clearly) still have some significant differences in perspective here. Would love to catch up again sometime and see if we can converge more on this. I’ll try and summarize my current take and our key disagreements for the benefit of other readers.
I think I mostly agree with you that in the special case of vanilla RLHF this problem is equivalent to a sample efficiency problem. Specifically, I’m referring to the case where we perform RL on a learned reward model; that reward model is trained based on human feedback from an earlier version of the RL policy; and this process iterates. In this case, if the RL algorithm learns to exploit the reward model (which it will, in contemporary systems, without some regularization like a KL penalty) then the reward model will receive corrective feedback from the human. At worst, this process will just not converge, and the policy will just bounce from one adversarial example to another—useless, but probably not that dangerous. In practice, it’ll probably work fine given enough human data and after tuning parameters.
However, I think sample efficiency could be a really big deal! Resolving this issue of overseers being exploited I expect could change the asymptotic sample complexity (e.g. exponential to linear) rather than just changing the constant factor. My understanding is that your take is that sample efficiency is unlikely to be a problem because RLHF works fine now, is fairly sample efficient, and improves with model scale—so why should we expect it to get worse?
I’d argue first that sample efficiency now may actually be quite bad. We don’t exactly have any contemporary model that I’d call aligned. GPT-4 and Claude are a lot better than what I’d expect from base models their size—but “better than just imitating internet text” is a low bar. I expect if we had ~infinite high quality data to do RLHF on these models would be much more aligned. (I’m not sure if having ~infinite data of the same quality that we do now would help; I tend to assume you can trade less quantity for increased quality, but there are obviously some limits here.)
I’m additionally concerned that sample efficiency may be highly task dependent. RLHF is a pretty finnicky method, so we’re tending to see the success cases of it. What if there are just certain tasks that it’s really hard to use RLHF for (perhaps because the base model doesn’t already have a good representation of it)? There’ll be a strong economic pressure to develop systems that do that task anyway, just using less reliable proxies for that task objective.
This might be the most interesting disagreement, and I’d love to dig into this more. With RLHF I can see how you can avoid the problem with sufficient samples since the human won’t be fooled by the AdvEx. But this stops working in a domain where you need scalable oversight as the inputs are too complex for a human to judge, so can’t provide any input.
The strongest argument I can see for your view is that scalable oversight procedures already have to deal with a human that says “I don’t know” for a lot of inputs. So, perhaps you can make a base model that perfectly mimics what the human would say on a large subset of inputs, and for AdvEx’s (as well as some other inputs) says “I don’t know”. This is still a hard problem—my impression was adversarial example detection is still far from solved—but is plausibly a fair bit easier than full robustness (which I suspect isn’t possible). Then you can just use your scalable oversight procedure to make the “I don’t knows” go away.
I think this is an area where we disagree but it doesn’t feel central to my view—I can see it going either way, and I think I’d still be concerned by whether the oversight process is robust even if the process wasn’t path dependent (e.g. we just did random restarting of the policy every time we update the reward model).