Even the most advanced contemporary machine learning systems are vulnerable to adversarial attack. The safety community has often assumed adversarial robustness to be a problem that will be solved naturally as machine learning (ML) systems grow more capable and general. However, recent work has shown that superhuman systems in a narrow domain such as AlphaZero are highly vulnerable to adversarial attack, as are general but less capable systems like large language models. This raises the possibility that adversarial (worst-case) robustness will continue to lag behind average-case capabilities. In other words, transformative AI systems are likely to be exploitable.
Exploitability will cause a wide variety of current alignment proposals to fail. Most extant agendas seek to align the main ML system with the assistance of helper ML systems. The main ML system is the primary system that takes actions in the world (e.g. interacting with users), with the helper ML systems acting as scaffolding to train and/or verify the main ML system. These alignment schemes will fail if the helpers are exploited by the main system – and we expect helpers to be vulnerable to exploitation (see Contemporary ML systems are exploitable by default).
In Table 1 we present a subjective risk matrix for a range of popular alignment agendas, evaluating the degree to which main ML systems have the ability and incentive to exploit the helper. We find many alignment agendas have a high risk of exploitation, with all having at least some risk.
Alignment Agenda | Main System’s Ability to Exploit Helper | Main System’s Incentive to Exploit Helper | Risk of Exploit |
---|---|---|---|
RL on learned reward model (e.g. RLHF, IRL) | Medium | High | High |
Scalable oversight (e.g. recursive reward modeling, AI safety via debate) | Medium | High | High |
Imitation learning (e.g. behavioral cloning, supervised fine-tuning) | Medium | Low | Low-Medium |
Imitative Iterated Distillation and Amplification | High | Low | Medium |
Auditing Tool (e.g. Adversarial Testing, Transparency) | Low | Medium | Low-Medium |
Table 1: Subjective risk matrix for popular alignment agendas (see next section), using a helper ML system to assist with aligning the main ML system that will eventually be deployed.
We are most concerned by vulnerabilities in the helpers as this can impact the alignment of the main system. By contrast, an aligned but adversarially exploitable main system would not necessarily pose a danger, especially if the main system can recursively self-improve to fix itself. However, there is a possibility that even superintelligent systems cannot attain adversarial robustness. This would be a volatile situation, which could conceivably collapse into chaos (systems frequently exploiting each other), an implicit equilibrium (e.g. mutually assured destruction), or an explicit agreement (e.g. all AI systems self-modify to commit to not exploiting one another).
We see two possible approaches to fixing this: improving adversarial robustness, or developing fault tolerant alignment methods that can work even in the presence of vulnerable ML systems. We are most excited by fault tolerant alignment, as it is highly neglected and plausibly tractable, although further work is needed to solidify this approach. By contrast, adversarial robustness is an area that has received significant attention from the ML research community (low neglectedness)[1]but with only modest progress (low to medium tractability).
In the remainder of this document, we will argue that systems are exploitable by default, explore the implications this has for alignment agendas in several different scenarios, and outline several research directions we are excited by.
Alignment agendas need robustness
Most alignment schemes implicitly depend on some aligned helper ML systems that can assist us in training or validating a main ML system, like a proto-AGI. In this section we assume the helper is exploitable, and explore the problems this causes for these alignment schemes. We justify this assumption this assumption in the next section.
A helper is more likely to be exploited if the main system is incentivized to exploit it, such as by the helper directly providing a training signal the main system is being optimized for, or if the main system has gained situational awareness. Conversely, it is less likely to happen if the optimization process is pursuing a goal that is largely independent from the helper’s output, such as if the helper is an interpretability tool checking for the presence of mesa-optimization. In the best case, we might even be able to (as in the case of IDA) control the optimization pressure to explicitly avoid inputs that might be adversarial.
The likelihood of an exploit also depends on the ability of the main system. This includes the computational resources available to develop an exploit, which could range from just forward-passes of the main system’s network to the main system having access to a large compute cluster. It also depends on how much information the main system has about the helper, which could range from merely a high-level knowledge of possible evaluation techniques (black box), through to query access to the helper (gray box), through to full access to its weights and source code (white box). We can’t rely on security by obscurity: many zero-shot black-box adversarial attacks have been conducted by human researchers, and superintelligent ML systems might be even more creative. Still, even superintelligent ML systems are not omnipotent, and in practice the “adversary” might be a relatively stupid optimization process like SGD.
Referring back to Table 1, one class of alignment agendas that is particularly sensitive to lack of robustness is those where the helper is an overseer providing feedback to the main system, such as in RLHF and most scalable oversight proposals. In this case, exploiting the helper is part of the training objective of the main system, explaining the ubiquity of reward hacking.
Corrective feedback from a human overseer can help mitigate reward hacking when the reward model is trained online, but there remains a significant risk that the main system hacks the helper more rapidly than a human is able to provide corrective feedback. Alternatively, the main system may simply learn to trick the human. Scalable oversight methods are designed to guard against tricking the human, but cannot protect against scenarios that are adversarial to both the human and scalable oversight model.
Imitation learning methods seem less risky, as the optimization pressure is simply to match the empirical distribution of a demonstration dataset. The closest to “reward hacking” in this setting would be overfitting to the dataset, a relatively benign failure mode. There is still some risk of inner optimization objectives arising, which could then be adversarial to other systems (e.g. attempt to hide themselves from transparency tools), but comparatively speaking this is one of the methods with the lowest risk of adversarial failure. Unfortunately, it also maxes out at around human performance.
Imitative Iterated Distillation and Amplification (IDA) is less sensitive to exploitability. IDA is a proposal to take an imitation learned model and amplify it beyond human-level performance. Casting it in our framework, the helper is a base imitation-learned model that we presume is safe and aligned. The main system is the model produced by IDA on the helper. At the first step, the helper is amplified by giving it the ability to ask questions to copies of itself. Critically, the helper has an incentive to avoid asking questions that are likely to cause it to fail. In this sense, IDA channels the optimization pressure away from adversarial exploitation, unlike e.g. vanilla RL.
The flipside is that the helper also likely has a high ability to exploit itself owing to this introspective knowledge of its own vulnerabilities. [2] This could result in some instability in the training scheme: if at any point a sub-part of the question tree is adversarial, the response could propagate to “infect” the rest of the question tree. This is related to the problem of security amplification.
One of the less concerning cases is for tools used only to evaluate, and not train a model, as is often the case for adversarial testing or transparency. Here, the main system has no direct incentive to fool the helper as it does not provide a training signal. However, in practical applications some optimization pressure against the helper will still leak in, for example if the helper is used to decide whether to abort a training run or change its hyperparameters. This would result in problems similar to those observed today with researchers overfitting to the test set.
Even if we only used these helper methods as a one-off final check on the main system, the main system might still be incentivised to fool these evaluation methods if it has situational awareness and long-term objectives. However, it will be a challenging exploit to pull off, as the main system may have only partial knowledge of the design of the helper (e.g. what tests the helper will perform) and cannot engage in direct trial-and-error exploitation of it. Additionally, transparency tools are typically applied to the weights of the main system, which the main system can at best only indirectly control (by gradient hacking). [3] Unfortunately, evaluation methods are not themselves solutions to alignment, but can provide an additional safeguard to avoid deployment of malign systems.
Contemporary ML systems are exploitable by default
Our best guess is that all contemporary ML systems solving non-trivial tasks are exploitable by a moderately resourced adversary. ℓp-norm adversarial examples in image classifiers were first described by Szegedy et al. in 2013, and nearly a decade later state-of-the-art image classifiers remain vulnerable despite intense research interest in adversarial defenses. These vulnerabilities can be exploited in real-world settings by physical adversarial attacks, and there are even naturally occurring images that are challenging for a wide variety of models. Moreover, analogous issues have been found in a diverse range of ML systems including language models, graph analysis, robotic policies and superhuman Go programs.
To the best of our knowledge, no ML system solving a non-trivial problem has ever withstood a well-resourced attack. [4] Adversarial defenses can be divided into those that are broken, and those that have not yet attracted concerted effort to break them. This should not be too surprising: the same could be said of most software systems in general.
One difference is that software security has notably improved over time. Although there almost certainly exist remote root exploits in most major operating systems, finding one is decidedly non-trivial, and is largely out of reach of most attackers. By contrast, exploiting ML systems is often alarmingly easy.
Figure 1: A typographic attack enables a no-code exploit of OpenAI Clip. More examples.This is not to say we haven’t made progress. There has been an immense amount of work defending against ℓp-norm adversarial examples, and this has made attacks harder: requiring more sophisticated methods, or a larger ℓp-norm perturbation. For example, a state-of-the-art (SOTA) method DensePure achieves 77.8% certified accuracy on ImageNet for perturbations up to 0.5/255 ℓ2-norm. However, this accuracy is still far behind the SOTA for clean images, which currently stands at 91.0% top-1 accuracy with CoCa. Moreover, the certified accuracy of DensePure drops to 54.6% at a 1.5/255 ℓ2-norm perturbation – which is visually imperceptible to humans. This is well below the 62% achieved by AlexNet back in 2012.
There is substantial evidence for a trade-off between accuracy and robustness. Tsipras et al (2019) demonstrate this trade-off theoretically in a simplified setting. Moreover, there is ample empirical evidence for this. For example, DensePure was SOTA in 2022 for certified accuracy on adversarial inputs but achieved only 84% accuracy on clean images. By contrast, non-robust models achieved this accuracy 4 years earlier such as AmoebaNetA in 2018. There appears to therefore be a significant “robustness tax” to pay, analogous to the alignment tax.[5]
In addition to certified methods such as DensePure, there are also a variety of defense methods that provide empirical protection against adversarial attack but without provable guarantees. However, the protection they provide is partial at best. For example, a SOTA method DiffPure achieves 74% accuracy on clean images in ImageNet but only 43% accuracy under a 4⁄255 ℓ∞-norm perturbation. There is also a significant robustness tax here: Table 5 from the DiffPure paper shows that accuracy on clean images drops from 99.43% on CelebA-HQ to 94% with the diffusion defense.
To make matters worse, real attackers have a much broader range of possible attacks outlined by Gilmers et al (2018), such as rotating images, perturbing physical parameters in rendered images, adversarially selecting images from a real-world dataset, adversarial patches, single-pixel attacks and latent adversarial perturbations. We would like to be robust to all these attacks, but there appears to be fundamental trade-offs between robustness to different attacks, with Tramer et al (2019) showing such a trade-off between different types of ℓp-bounded and spatial perturbations. Moreover, there are currently no effective methods to defend against unrestricted adversarial examples outside of toy settings.
Although the ubiquitous presence of adversarial examples in contemporary ML systems is concerning, there is one glimmer of hope. Perhaps these adversarial examples are merely an artifact of the ML systems being insufficiently capable? Once the system reaches or surpasses human-level performance, we might hope it would have learned a set of representations at least as good as that of a human, and be no more vulnerable to adversarial attack than we are.
Unfortunately, recent work casts doubt on this. In Wang et al (2022), we find adversarial policies that beat KataGo, a superhuman Go program. We trained our adversarial policy with less than 14% of the compute that KataGo was trained with, but wins against a superhuman version of KataGo 97% of the time. This is not specific to KataGo: our exploit transfers to ELF OpenGo and Leela Zero, and in concurrent work from DeepMind Timbers et al (2022) were able to exploit an in-house replica of AlphaZero.
Of course, results in Go may not generalize to other settings, but we chose to study Go because we expected the systems to be unusually hard to exploit. In particular, since Go is a zero-sum game, being robust to adversaries is the key design objective, rather than merely one desiderata amongst many. Additionally, KataGo and AlphaZero use Monte-Carlo Tree Search coupled with a neural network evaluation. In general, we would expect search (which is provably optimal in the limit) to be harder to exploit than neural networks alone, and although search does make the system harder to exploit we are able to attack it even up to 10 million visits – far in excess of the threshold needed for superhuman performance, and well above the level used in most games.
There remains a possibility that although narrowly superhuman systems are vulnerable, more general systems might be robust. Large language models are the most general systems we have today, yet work by Ziegler et al (2022) find they are still exploitable even after significant adversarial training. Moreover, the existence of apparently fundamental tradeoffs between accuracy and robustness suggests that the most capable AI systems at any given time may be particularly likely to be vulnerable (Tsipras et al, 2019; Tramer et al, 2019).
Of course, at some point systems might be developed that are adversarially robust. This could be by “overshooting” on capability and generality, and then paying a robustness tax to get a suitably capable or general but robust system. Alternatively, new techniques might be developed that reduce or eliminate the robustness tax. Most optimistically, it is possible that general, human-level systems are naturally robust even though generality or human-level performance on their own are insufficient. In the next section, we will consider different possibilities for when and if adversarially robust systems are developed, and the implications this has for safety.
Future trajectories for robustness
We will consider three possible cases:
We solve adversarial robustness before transformative AI is developed;
We solve it after transformative AI is developed;
It is never solved.
Although coarse-grained, we believe this case split captures the most important distinctions.
For the purpose of this section, we will consider adversarial robustness to be solved if systems cannot be practically exploited to cause catastrophic outcomes. This is intended to be a low bar. In particular, this definition tolerates bounded errors. For example, we would tolerate threat actors being able to permanently trick AI systems into giving them 10% more resources in a trade. We’d also tolerate threat actors being able to temporarily divert even the majority of the AI’s resources, so long as this did not lead to permanent negative effects and that attackers eventually run out of such exploits.
We summarize our subjective credence in each of the cases below, and explore the cases qualitatively in the following sections.
Before you read further, consider estimating your credence for the following question, corresponding to Case 1:
I’d like to register that I disagree with the claim that standard online RLHF requires adversarial robustness in AIs persay. (I agree that it requires that humans are adversarially robust to the AI, but this is a pretty different problem.)
In particular, the place where adversarial robustness shows up is in sample efficiency. So, poor adversarial robustness is equivalent to poor sample efficiency. My understanding is that the trend is toward higher not lower sample efficiency with scale, so this seems on track.
This same reasoning also applies to recursive oversight schemes or debate.
Overall, my guess is that commercial incentives and normal scaling will cause the sample efficiencies of these processes to be quite high in practice. Additionally, my guess would be that improving adversarial robustness isn’t the best way to push on sample efficiency.
That said, I do think there are many important reasons to improve adversarial robustness.
To check my understanding, is your view something like:
1. If the reward model isn’t adversarially robust, then the RL component of RLHF will exploit it.
2. These generations will show up in the data presented to the human. Provided the human is adversarially robust, then the human feedback will provide corrective signal to the reward model.
3. The reward model will stop being vulnerable to those adversarial examples, although may still be vulnerable to other adversarial examples.
4. If we repeat this iterative process enough times, we’ll end up with a robust reward model.
Under this model, improving adversarial robustness just means we need fewer iterations, showing up as improved sample efficiency.
I agree with this view up to a point. It does seem likely that with sufficient iterations, you’d get an accurate reward model. However, I think the difference in sample efficiency could be profound: e.g exponential (needing to get explicit corrective human feedback for most adversarial examples) vs linear (generalizing in the right way from human feedback). In that scenario, we may as well just ditch the reward model and provide training signal to the policy directly from human feedback.
In practice, we’ve seen that adversarial training (with practical amounts of compute) improves robustness but models are still very much vulnerable to attacks. I don’t see why RLHF’s implicit adversarial training would end up doing better than explicit adversarial training.
In general I tend to view sample efficiency discussions as tricky without some quantitative comparison. There’s a sense in which decision trees and a variety of other simple learning algorithms are a viable approach to AGI, they’re just very sample (and compute) inefficient.
The main reason I can see why RLHF may not need adversarial robustness is if the KL penalty from base model approach people currently use is actually enough.
This is pretty close to my understanding, with one important objection.
Thanks for responding and trying to engage with my perspective.
Objection
I don’t claim we’ll necessarily ever get a fully robust reward model, just that the reward model will be mostly robust on average to the actual policy you use as long as human feedback is provided at a frequent enough interval. We never needed a good robust reward model which works on every input, we just needed a reward model which captured our local preferences about the actual current policy distribution sufficiently well.
At any given point, the reward model will be vulnerable to arbitrary adversarial attacks under sufficient optimization pressure, but we don’t need arbitrary optimization against any given reward model. Like, each human update lets you optimize a bit more against the reward model which gets you the ability to get somewhat closer to the policy you actually want.
KL penalty is presumably an important part of the picture.
My overall claim is just that normal RLHF is pretty fault tolerant and degrades pretty gracefully with lack of robustness.
Sample efficiency seems fine now and progress continues
Beyond this, my view is considerably informed by ‘sample efficiency seems fine in practice now and is better better not worse with more powerful models (from my limited understanding)’. It’s possible this trend could reverse with sufficiently big models, but I’d find this surprising and don’t see any particular reason to expect this. Technical advancement in RLHF is ongoing and will improve sample efficiency further.
It seems to me like your views either imply that sample efficiency is low enough now that high quality RLHF currently can’t compete with other ways of training AIs which are less safe but have cheaper reward signals (e.g., heavy training on automated outcomes based feedback or low quality human reward signal). Or possibly that this will happen in the future as models get more powerful (reversing the current trend toward better sample efficiency). My understanding is that RLHF is quite commercially popular and sample efficiency isn’t a huge issue. Perhaps I’m missing some important gap between how RLHF is used now and how it will have to be used in the future to align powerful systems?
Also note that vanilla RLHF doesn’t necessarily require optimizing against a reward model. For instance, a recently released paper Direct Preference Optimization (DPO) avoids this step entirely. It’s unclear if DPO will win out for various reasons, but as far as I can tell, you should think that DPO should totally resolve the robustness concern with vanilla RLHF. Of course, recursive oversight schemes and other things which use AIs to help oversee AIs could still involve optimizing against AIs.
Edit: I originally linked to the wrong paper, oops
Thanks for the follow-up, this helps me understand your view!
Sure, this feels basically right to me. My reframing of this would be that we could do in principle do RL directly with feedback provided by a human. Learning a reward model lets us gain some sample efficiency over this, but sometimes it fails to generalize correctly to what a human would say, and adversarial examples are an important special case of this. But provided the policy is the final output of this process, not the reward model, it doesn’t matter if the reward model is robust or not—just that the feedback the policy has received has steered it into a good position.
This argument feels like it’s proving too much. InstructGPT isn’t perfect, but it does produce a lot less toxic output and follow instructions a lot better than the base model GPT-3. RLHF seems to work, and GPT-4 is even better, showing that it gets easier with bigger models. Why should we expect this trend to reverse? Why are we worried about this safety thing anyway?
I actually find this style of argument pretty plausible: I’m a relative optimist on this forum, I do think that some fairly basic methods like a souped-up RLHF might well be sufficient to make things go OK (while preferring to have more principled methods giving us a bigger safety margin). But I’m somewhat surprised to hear you making that case!
Suppose we condition on RLHF failing. At a high level, failures split into: (a) human labelers rewarded the wrong thing (e.g. fooling humans); (b) the reward model failed to predict human labelers judgement and rewarded the wrong thing (e.g. reward hacking); (c) RL produced a policy that is capable enough to be dangerous but is optimizing something other than the reward model (e.g. mesa-optimization). I find all three of these risks plausible, and I don’t see a specific reason to privilege (a) or (c) substantially over (b). It sounds like you’re most concerned about (a), and I’d love to hear your reasons for that.
However, I do think it’s interesting to explore concrete failure modes: given RLHF is working well now, what does my view imply about how it might stop working? One scenario I find plausible is that as models get bigger, the sample efficiency of RLHF continues to increase, since the models have higher fidelity representations of a greater variety of tasks. RLHF therefore just needs to localize the task that’s already in the network. However, the performance of this process is ultimately capped by what the base model already represents. I can see two ways this could go wrong:
1. Misaligned base model. If the foundation model that’s being RLHF’d is already misaligned, then a small amount of RL training is not going to be enough to disabuse it of this. By contrast, a large amount of RL training (of the order of 10% of the training steps used for self-supervised learning) with a high-fidelity reward signal might. Unfortunately, we can’t do that much RL training without just reward hacking, so we never try. (Or we take that many time steps, but with a KL penalty, forcing the model to stay close to the unaligned base model.)
2. Harmless base model. If the foundation model starts off harmless (not necessarily aligned, just not actively trying to cause harm), then I’d expect RLHF’ing it to only improve things so long as the training signal never rewards bad behavior. However, the designers want the model to significantly outperform humans at this task. The model has capacity to learn to do this, but can’t just leverage existing capabilities in the foundation model, as the performance of that model is limited to that of the best humans it saw in the self-supervised training data. So, we need to do RL for many more time steps. Collecting fresh human data for that is prohibitive, so we rely on a reward model—unfortunately that gets hacked.
I think I find the second case more compelling. The first case seems concerning as well, but it seems like quite a scary scenario even if robustness gets solved, whereas I expect fixing robustness to actually make a significant dent in the second scenario.
That said, I feel like I should emphasize in all this that I largely think robustness is an intractable problem to solve, and that while it’s worth trying to improve it at the margin I’m most excited by efforts to make systems not need robustness. I think you make a good point that having humans in the loop in the training makes RLHF degrade more gracefully than reward learning approaches that train on a fixed offline dataset, and the KL penalty also helps. I suspect there are many similar algorithmic tweaks that would make algorithms less sensitive to robustness.
For example, riffing off going from an offline to online dataset, could we improve things further by collecting a dataset that anticipates where the RL process might try to exploit it? That sounds fanciful, but there’s a simple hack you can do to get something like this. Just train a policy on the current reward model. Then collect human feedback from that policy. Then roll the policy back to the last checkpoint, and repeat the training using the new reward model. You could do this step once per checkpoint, or keep doing it until you get human approval to move on (e.g. the reward model now aligns with human feedback). In this way you should be able to avoid ever giving the policy reward for the wrong behavior. I suspect this process would be much less sample efficient than vanilla RLHF, but it would have better safety properties, and measuring how much slower it is could be a good proxy for how severe the “robustness tax” is.
I’m a bit confused by the claim here, although I’ve only read the abstract and skimmed the paper so perhaps it’d become obvious from a closer read. As far as I can tell, the cited paper focuses on motion-planning, and considers a rather restricted setting of LQR policies. This is a reasonable starting point, but a human communicating that a cart pole should stand up (or a desired quadcoptor trajectory) feels much simpler than even toy tasks for RLHF in LLMs like summarizing text. So I don’t really see this as much evidence in favor of being able to drop the reward model. Generally, any highly sample efficient model-based RL approach would enable RL training to proceed without a reward model by having humans directly label the trajectory,
It’s unclear to me if this is true because modelling the humans that generated the training data sufficiently well probably requires the model to be smarter than the humans it is modelling. So I expect the current regime where RLHF just elicits a particular persona from the model rather than teaching any new abilities to be sufficient to reach superhuman capabilities.
(Here’s a possibly arcane remark. It’s worth noting that I think always correct reward models are sufficient for high stakes alignment via runtime filtering (technically you just need to never give a very bad output decent reward). So, always correct reward models would be great if you could get them. Note that always correct could look pretty different than current advsarial robustness; in particular, you also need to worry about collusion and stuff like seizing physical control of the reward channel. I also think that avoiding high stakes failure via adversarial training and/or other robustness style tech seems generally good. I’ve been assuming we’re talking about the low stakes alignment problem (aka outer alignment))
I think this trend extrapolation argument seems fine in the absence of a specific defeater. In the case of AI takeover, there is a clear defeater to ‘models have been getting more aligned over time’.
I was trying to get at which specific defeater you thought overcame the trend expolation argument. Here’s an attempt at an exhaustive list of defeaters:
Thinking sample efficiency would get worse in the future breaking the current trend
Thinking that the current ‘trend extrapolated’ sample efficiency would be insufficent or otherwise good to improve on the margin
Thinking that there would be important negative consequences of reward model non-robustness which aren’t well described by sample efficiency (e.g., teaching the model to lie via first practicing on the reward model, or having an easier time exploring into bad behavior or something)
It’s also worth noting that if I was centrally interested in (2) I would push on that directly. (But you have other applications of robustness in mind, so this might not be that interesting.)
I still don’t understand which of (1), (2), or (3) your most worried about. (Maybe (3) based on some argument I don’t yet understand? I also don’t see why hacking the reward model is dangerous which is maybe an important crux here.)
Aside on alignment trend extrapolation:
I’m also not really sure how to measure ‘aligned’ in a way that makes sense given that models have been also getting smarter. It seems plausible that alignment has been notably improving over time? Beyond this, the more natural trend extrapolation might be takeover risk. My guess is that the ex-ante takeover risk from GPT4 should have been like 0.1% and then the future/ongoing risk is more like 0.001% or something. And, the trend extrapolation doesn’t look good here : ).
Sample efficiency isn’t the main way I think about this topic so it’s a bit difficult to answer. I find all these defeaters fairly plausible, but if I had to pick the central concern it’d be (3).
I tend to view ML training as a model taking a path through a space of possible programs. There’s some programs that are capable and aligned with our interests; others that are capable but will actively pursue harmful goals; and of course many other programs that just don’t do anything particularly useful. Assuming we start with model that is aligned (where “aligned” could include “model cannot do anything useful so does not cause any harm”) and we only reward positive behavior, I find it plausible that we can hill-climb to more capable models while preserving alignment.
However, suppose we at some point err and reward undesirable behavior. (This could occur due to incorrect human feedback, or a reward model that is not robust, or some other issues.) At this point, we’re training a sub-component of the system that is actively opposed to our interests. Hopefully, we eventually discover this sub-component, and can then disincentivize it in the training process. But at that point, there is some uncertainty in my mind: will the training process remove the sub-component, or simply train the sub-component into being better able to fool the training process?
Now, we don’t need the reward model to be perfectly robust to avoid this (as you quite rightly point out), just robust in the region of policy space around the current policy where the RL algorithm is likely to explore. But empirically current reward model robustness falls short of even this.
In response to:
you write:
No, I do expect online data collection to take place, I just don’t expect to be able to do that data collection fast enough or in large enough volumes to kick in before hacking takes place. I think in your taxonomy, this is defeater (2): I think we’ll need substantially more samples to train superhuman models than we do human models, as the demands from RLHF switch from localizing a task that a network already knows how to perform, to teaching a model to perform a new capability (safely). (I will note online data collection is a pain and people seem to try and do as little as possible of it.)
Are you assuming that we can’t collect human data online as the policy optimizes against the reward model? (People currently do collect data online to avoid getting hacked like this.) This case seems probably hopeless to me without very strong regularization (I think you agree with this being mostly hopeless), but it also seems easy to avoid by just collecting human data online.
I don’t really see why (b) leads to dangerous failures. It seems like failures should be totally benign and just result in somewhat lower production?
Beyond this, it seems like this failure should happen early as it doesn’t require clever models to occur, so by default there will be strong commercial incentives to resolve this.
I agree it’s an alignment failure in some sense which could be addressed by alignment technology. I just think it isn’t very important to reduce from an X-risk/AI takeover perspective.
Related to this. You say:
What specific safety properties are you thinking about?
As far as I can tell sample efficiency is the only safety property which seems important here.
There could be other important considerations beyond sample efficiency for avoiding your policy doing some hacking of the reward model, but none of the ones I can think of seem importantly dangerous to me.
Naively, I’d guess it’s slightly bad on top of the sample efficiency issues, but mostly unimportant.
Minor point, feel free to ignore.
FWIW, I typically use ‘reward hacking’ to refer to just (a) here. I’d just call (b) ‘poor reward model sample efficiency’. That said, I more centrally use ‘reward hacking’ to describe hacking a reward process based on outcomes via stuff like ‘sensor tampering’, but this is still a subset of RLHF: the subset where humans look at outcomes and then assess reward taking this into account.
Oh, we’re using terminology quite differently then. I would not call (a) reward hacking, as I view the model as being the reward (to the RL process), whereas humans are not providing reward at all (but rather some data that gets fed into a reward model’s learning process). I don’t especially care about what definitions we use here, but do wonder if this means we’re speaking past each other in other areas as well.
I originally linked to the wrong paper! : (
Here is the actual Direct Preference Optimization paper. (I guess I just googled something like ‘DPO RL’ and then didn’t actually check that it was the right paper)
Yikes, sorry for wasting your time.
Ah, that paper makes a lot more sense. A reward model was attractive in the original Deep RL From Human Preferences paper because the environment was complex and non-differentiable: using RL was a natural fit. It’s always seemed a bit stranger to use RL for fine-tuning language models, especially in the prompt-completion setting where the “environment” is trivial. (RL becomes more natural when you start introducing external tools, or conversations with humans.)
I’ll need to take a closer look at the paper, but it looks like they derive the DPO objective by starting from the RL objective under KL optimization. So if it does what it says on the tin, then I’d expect the resulting policy incentives to be similar. My hunch is the problem of reward hacking has shifted from an explicit to implicit problem rather than being eliminated, although I’m certainly not confident on this. Could be interesting to study using a similar approach to the Scaling Laws for Reward Model Overoptimization paper.
comment TLDR: Adversarial examples are a weapon against the AIs we can use for good and solving adversarial robustness would let the AIs harden themselves.
I haven’t read this yet (I will later : ) ), so it’s possible this is mentioned, but I’d note that exploiting the lack of adversarial robustness could also be used to improve safety. For instance, AI systems might have a hard time keeping secrets if they also need to interact with humans trying to test for verifiable secrets. E.g., trying to jailbreak AIs to get them to tell you about the fact that they already hacked the ssh server.
Ofc, there are also clownier versions like trying to fight the robot armies with adversarial examples (naively seems implausible, but who knows). I’d guess that reasonably active militaries probably already should be thinking about making their camo also be a multi-system adversarial example if possible. (People are already using vision models for targeting etc.)
I don’t think this lack of adversarial robustness gets you particularly strong alignment properties, but it’s worth noting upsides as well as downsides.
Perhaps we’d ideally train our AIs to be robust to most classes of things but just ensure they remain non-robust in a few (ideally secret) edge cases. However, the most important use cases for AI exploits are in situations after AIs have already started to be able to apply training procedures to themselves without our consent.
My overall guess is that the upsides of solving adversarial robustness outweigh the downsides.
This is a good point, adversarial examples in what I called in the post the “main” ML system can be desirable even though we typically don’t want them in the “helper” ML systems used to align the main system.
One downside to adversarial vulnerability of the main ML system is that it could be exploited by bad actors (whether human or other, misaligned AIs). But this might be fine in some settings: e.g. you build some airgapped system that helps you build the next, more robust and aligned AI. One could also imagine crafting adversarial example backdoors that are cryptographically hard to discover if you don’t know how they were constructed.
I generally expect that if adversarial robustness can be practically solved then transformative AI systems will eventually self-improve themselves to the point of being robust. So, the window where AI systems are dangerous & deceptive enough that we need to test them using adversarial examples but not capable enough to have overcome this might be quite short. Could still be useful as an early-warning sign, though.
Regarding the predictions, I want to make the following quibble: According to the definition above, one way of “solving” adversarial robustness is to make sure that nobody tries to catastrophically exploit the system in the first place. (In particular, exploitable AI that takes over the world is no longer exploitable.)
So, a lot with this definition rests on how do you distinguish between “cannot be exploited” and “will not be exploited”.
And on reflection, I think that for some people, this is close to being a crux regarding the importance of all this research.
For scaling laws for adversarial robustness, see appendix A.15 of openreview.net/pdf?id=sckjveqlCZ#page=22
Thanks, I’d missed that!
Curious if you have any high-level takeaways from that? Bigger models do better, clearly, but e.g. how low do you think we’ll be able to get the error rate in the next 5-10 years given expected compute growth? Are there any follow-up experiments you’d like to see happen in this space?
Also could you clarify whether the setting was for adversarial training or just a vanilla model? “During training, adversarial examples for training are constructed by PGD attacker of 30 iterations” makes me think it’s adversarial training but I could imagine this just being used for evals.
The setting was adversarial training and adversarial evaluation. During training, PGD attacker of 30 iterations is used to construct adversarial examples used for training. During testing, the evaluation test set is an adversarial test set that is constructed via PGD attacker of 20 iterations.
Experimental data of y-axis is obtained from Table 7 of https://arxiv.org/abs/1906.03787; experimental data of x-axis is obtained from Figure 7 of https://arxiv.org/abs/1906.03787.
I also want to add that I really like the use of the prediction UI in this post.
Presumably in a set-up where the agent can meaningfully act in the world while still doing imitation learning, one way to reward hack would be to take a short-term imitation hit to install tons of surveillance of humans and/or make them easier to predict, so that in the long-term it will be easier to accurately imitate them, right? That said, I think I probably agree with the comparative claim you make.
Right: if the agent has learned an inner objective of “do things similar to what humans do in the world at the moment I am currently acting”, then it’d definitely be incentivized to do that. It’s not rewarded by the outer objective for e.g. behavioral cloning on a fixed dataset, as installing bunch of cameras would be punished by that loss (not something humans do) and changing human behavior wouldn’t help as BC would still be on the dataset of pre-manipulation demos. That might be little comfort if you’re worried about inner optimization, but most the other failures described happen even in the outer alignment case.
That said, if you had a different imitation learning setup that was something like doing RL on a reward of “do the same thing one of our human labelers chooses given the same state” then the outer objective would reward what the behavior you describe. It’d be a hard exploration problem for the agent to learn to exploit the reward in that way, but it quite probably could do so if situationally aware.
Maybe I’m missing something obvious here, but for advanced AI systems under monitoring that happen to be misaligned, won’t checking for the presence of mesa-optimization be relevant to the main system’s goal, in cases where the results of those checks matter for whether to re-train or deploy the AI?
Ah, you make this point here:
I would like to point out one aspects of the “Vulnerable ML systems” scenario that the post doesn’t discuss much: the effect on adversarial vulnerability on widespread-automation worlds.
Using existing words, some ways of pointing towards what I mean are: (1) Adversarial robustness solved after TAI (your case 2), (2) vulnerable ML systems + comprehensive AI systems, (3) vulnerable ML systems + slow takeoff, (4) fast takeoff happening in the middle of (3).
But ultimately, I think none of these fits perfectly. So a longer, self-contained description is something like:
Consider the world where we automate more and more things using AI systems that have vulnerable components. Perhaps those vulnerabilities primarily come from narrow-purpose neural networks and foundation models. But some might also come from insecure software design, software bugs, and humans in the loop.
And suppose some parts of the economy/society will be designed more securely (some banks, intelligence services, planes, hopefully nukes)...while others just have glaring security holes.
A naive expectation would be that a security hole gets fixed if and only if there is somebody who would be able to exploit it. This is overly optimistic, but note that even this implies the existence of many vulnerabilities that would require stronger-than-existing level of capability to exploit. More realistically, the actual bar for fixing security holes will be “there might be many people who can exploit this, but it is not worth their opportunity cost”. And then we will also not-fix all the holes that we are unaware of, or where the exploitation goes undetected.
These potential vulnerabilities leave a lot of space for actual exploitation when the stakes get higher, or we get a sudden jump in some area of capabilities, or when many coordinated exploits become more profitable than what a naive extrapolation would suggest.
There are several potential threats that have particularly interesting interactions with this setting:
(A) Alignment scheme failure: An alignment scheme that would otherwise work fails due to vulnerabilities in the AI company training it. This seems the closest to what this post describes?
(B) Easier AI takeover: Somebody builds a misaligned AI that would normally be sub-catastrophic, but all of these vulnerabilities allow it to take over.
(C) Capitalism gone wrong: The vulnerabilities regularly get exploited, in ways that either go undetected or cause negative externalities that nobody relevant has incentives to fix. And this destroys a large portion of the total value.
(D) Malicious actors: Bad actors use the vulnerabilities to cause damage. (And this makes B and C worse.)
(E) Great-power war: The vulnerabilities get exploited during a great-power war. (And this makes B and C worse.)
Connection to Cases 1-3: All of this seems very related to how you distinguish between adversarial robustness gets solved before tranformative AI/after TAI/never. However, I would argue that TAI is not necessarily the relevant cutoff point here. Indeed, for Alignment failure (A) and Easier takeover (B), the relevant moment is “the first time we get an AI capable of forming a singleton”. This might happen tomorrow, by the time we have automated 25% of economically-relevant tasks, half a year into having automated 100% of tasks, or possibly never. And for the remaining threat models (C,D,E), perhaps there are no single cutoff points, and instead the stakes and implications change gradually?
Implications: Personally, I am the most concerned about misaligned AI (A and B) and Capitalism gone wrong (C). However, perhaps risks from malicious actors and nation-state adversaries (D, E) are more salient and less controversial, while pointing towards the same issues? So perhaps advancing the agenda outlined in the post can be best done through focusing on these? [I would be curious to know your thoughts.]
An idea for increasing the impact of this research: Mitigating the “goalpost moving” effect for “but surely a bit more progress on capabilities will solve this”.
I suspect that many people who are sceptical of this issue will, by default, never sit down and properly think about this. If they did, they might make some falsifiable predictions and change their minds—but many of them might never do that. Or perhaps many people will, but it will all happen very gradually, and we will never get a good enough “coordination point” that would allow us to take needle-shifting actions.
I also suspect there are ways of making this go better. I am not quite sure what they are, but here are some ideas: Making and publishing surveys. Operationalizing all of this better, in particular with respect to the “how much does this actually matter?” aspect. Formulating some memorable “hypothesis” that makes it easier to refer to this in conversations and papers (cf “orthogonality thesis”). Perhaps making some proponents of “the opposing view” make some testable predictions, ideally some that can be tested with systems whose failures won’t be catastrophic yet?