ryan_greenblatt comments on Even Superhuman Go AIs Have Surprising Failure Modes

ryan_greenblatt 21 Jul 2023 5:32 UTC
LW: 4 AF: 2
2
AF

Many proposed solutions to the alignment problem involve one “helper AI” providing a feedback signal steering the main AI system towards desirable behavior. Unfortunately, if the helper AI system is vulnerable to adversarial attack, then the main AI system will achieve a higher rating by the helper AI if it exploits the helper instead of achieving the desired task. … To address this, we have proposed a new research program of fault-tolerant alignment strategies.

I remain skeptical this is a serious problem as opposed to a reason that sample efficiency is somewhat lower than it otherwise would be. This also changes the inductive biases of various RL schemes by changing the intermediate incentives of the policy, but these changes seem to be mostly innocuous to me.

(To be clear, I think AIs not being adversarially robust is generally notable and has various implications, I just disagree that it causes any issues for these sorts of training schemes which use helper AIs.)

Another way to put my claim is “RLHF is already very fault tolerant and degrades well with the policy learning to adversarially attack the reward model”. (A similar argument will apply for various recursive oversight schemes or debate.)

In my language, I’d articulate this sample efficiency argument as “because the policy AI will keep learning to attack the non-robust helper AI between human preference data which fixes the latest attack, human preference data will be required more frequently than it otherwise would be. This lack of sample efficiency will be a serious problem because [INSERT PROBLEM HERE]”.

I personally don’t really see strong arguments why improvements to sample efficiency are super leveraged at the margin given commercial incentives and some other factors. Beyond this, I think just working on sample efficiency of RL seems like a pretty reasonable way to improve sample efficiency if you do think it seems very important and net positive. It’s a straight forward empirical problem and I don’t think there are any serious changes to problem between GPT-4 and AGI which mean that normal empirical iteration isn’t viable.

Alteratively, if you think the issue is that periodically being incentivized to adversarially attack the reward model has serious problematic effects on the inductive biases of RL, it seems relevant to argue for why this would be the case. I don’t really see why this would be important. It seems like periodically being somewhat trained to find different advexes shouldn’t have much effect on how the AI generalizes?

(See also this comment thread here, Adam and I talked about this some in person afterwards also)
- AdamGleave 22 Jul 2023 4:00 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Thanks for flagging this disagreement Ryan. I enjoyed our earlier conversation (on LessWrong and in-person) and updated in favor of the sample efficiency framing, although we (clearly) still have some significant differences in perspective here. Would love to catch up again sometime and see if we can converge more on this. I’ll try and summarize my current take and our key disagreements for the benefit of other readers.
  
  I think I mostly agree with you that in the special case of vanilla RLHF this problem is equivalent to a sample efficiency problem. Specifically, I’m referring to the case where we perform RL on a learned reward model; that reward model is trained based on human feedback from an earlier version of the RL policy; and this process iterates. In this case, if the RL algorithm learns to exploit the reward model (which it will, in contemporary systems, without some regularization like a KL penalty) then the reward model will receive corrective feedback from the human. At worst, this process will just not converge, and the policy will just bounce from one adversarial example to another—useless, but probably not that dangerous. In practice, it’ll probably work fine given enough human data and after tuning parameters.
  
  However, I think sample efficiency could be a really big deal! Resolving this issue of overseers being exploited I expect could change the asymptotic sample complexity (e.g. exponential to linear) rather than just changing the constant factor. My understanding is that your take is that sample efficiency is unlikely to be a problem because RLHF works fine now, is fairly sample efficient, and improves with model scale—so why should we expect it to get worse?
  
  I’d argue first that sample efficiency now may actually be quite bad. We don’t exactly have any contemporary model that I’d call aligned. GPT-4 and Claude are a lot better than what I’d expect from base models their size—but “better than just imitating internet text” is a low bar. I expect if we had ~infinite high quality data to do RLHF on these models would be much more aligned. (I’m not sure if having ~infinite data of the same quality that we do now would help; I tend to assume you can trade less quantity for increased quality, but there are obviously some limits here.)
  
  I’m additionally concerned that sample efficiency may be highly task dependent. RLHF is a pretty finnicky method, so we’re tending to see the success cases of it. What if there are just certain tasks that it’s really hard to use RLHF for (perhaps because the base model doesn’t already have a good representation of it)? There’ll be a strong economic pressure to develop systems that do that task anyway, just using less reliable proxies for that task objective.
  
  (A similar argument will apply for various recursive oversight schemes or debate.)
  
  This might be the most interesting disagreement, and I’d love to dig into this more. With RLHF I can see how you can avoid the problem with sufficient samples since the human won’t be fooled by the AdvEx. But this stops working in a domain where you need scalable oversight as the inputs are too complex for a human to judge, so can’t provide any input.
  
  The strongest argument I can see for your view is that scalable oversight procedures already have to deal with a human that says “I don’t know” for a lot of inputs. So, perhaps you can make a base model that perfectly mimics what the human would say on a large subset of inputs, and for AdvEx’s (as well as some other inputs) says “I don’t know”. This is still a hard problem—my impression was adversarial example detection is still far from solved—but is plausibly a fair bit easier than full robustness (which I suspect isn’t possible). Then you can just use your scalable oversight procedure to make the “I don’t knows” go away.
  
  Alteratively, if you think the issue is that periodically being incentivized to adversarially attack the reward model has serious problematic effects on the inductive biases of RL, it seems relevant to argue for why this would be the case. I don’t really see why this would be important. It seems like periodically being somewhat trained to find different advexes shouldn’t have much effect on how the AI generalizes?
  
  I think this is an area where we disagree but it doesn’t feel central to my view—I can see it going either way, and I think I’d still be concerned by whether the oversight process is robust even if the process wasn’t path dependent (e.g. we just did random restarting of the policy every time we update the reward model).