Even the most advanced contemporary machine learning systems are vulnerable to adversarial attack. The safety community has often assumed adversarial robustness to be a problem that will be solved naturally as machine learning (ML) systems grow more capable and general. However, recentwork has shown that superhuman systems in a narrow domain such as AlphaZero are highly vulnerable to adversarial attack, as are general but less capable systems like large language models. This raises the possibility that adversarial (worst-case) robustness will continue to lag behind average-case capabilities. In other words, transformative AI systems are likely to be exploitable.
Exploitability will cause a wide variety of current alignment proposals to fail. Most extant agendas seek to align the main ML system with the assistance of helper ML systems. The main ML system is the primary system that takes actions in the world (e.g. interacting with users), with the helper ML systems acting as scaffolding to train and/or verify the main ML system. These alignment schemes will fail if the helpers are exploited by the main system – and we expect helpers to be vulnerable to exploitation (see Contemporary ML systems are exploitable by default).
In Table 1 we present a subjective risk matrix for a range of popular alignment agendas, evaluating the degree to which main ML systems have the ability and incentive to exploit the helper. We find many alignment agendas have a high risk of exploitation, with all having at least some risk.
Alignment Agenda
Main System’s Ability to Exploit Helper
Main System’s Incentive to Exploit Helper
Risk of Exploit
RL on learned reward model (e.g. RLHF, IRL)
Medium
High
High
Scalable oversight (e.g. recursive reward modeling, AI safety via debate)
Table 1: Subjective risk matrix for popular alignment agendas (see next section), using a helper ML system to assist with aligning the main ML system that will eventually be deployed.
We are most concerned by vulnerabilities in the helpers as this can impact the alignment of the main system. By contrast, an aligned but adversarially exploitable main system would not necessarily pose a danger, especially if the main system can recursively self-improve to fix itself. However, there is a possibility that even superintelligent systems cannot attain adversarial robustness. This would be a volatile situation, which could conceivably collapse into chaos (systems frequently exploiting each other), an implicit equilibrium (e.g. mutually assured destruction), or an explicit agreement (e.g. all AI systems self-modify to commit to not exploiting one another).
We see two possible approaches to fixing this: improving adversarial robustness, or developing fault tolerant alignment methods that can work even in the presence of vulnerable ML systems. We are most excited by fault tolerant alignment, as it is highly neglected and plausibly tractable, although further work is needed to solidify this approach. By contrast, adversarial robustness is an area that has received significant attention from the ML research community (low neglectedness)[1]but with only modest progress (low to medium tractability).
In the remainder of this document, we will argue that systems are exploitable by default, explore the implications this has for alignment agendas in several different scenarios, and outline several research directions we are excited by.
Alignment agendas need robustness
Most alignment schemes implicitly depend on some aligned helper ML systems that can assist us in training or validating a main ML system, like a proto-AGI. In this section we assume the helper is exploitable, and explore the problems this causes for these alignment schemes. We justify this assumption this assumption in the next section.
A helper is more likely to be exploited if the main system is incentivized to exploit it, such as by the helper directly providing a training signal the main system is being optimized for, or if the main system has gained situational awareness. Conversely, it is less likely to happen if the optimization process is pursuing a goal that is largely independent from the helper’s output, such as if the helper is an interpretability tool checking for the presence of mesa-optimization. In the best case, we might even be able to (as in the case of IDA) control the optimization pressure to explicitly avoid inputs that might be adversarial.
The likelihood of an exploit also depends on the ability of the main system. This includes the computational resources available to develop an exploit, which could range from just forward-passes of the main system’s network to the main system having access to a large compute cluster. It also depends on how much information the main system has about the helper, which could range from merely a high-level knowledge of possible evaluation techniques (black box), through to query access to the helper (gray box), through to full access to its weights and source code (white box). We can’t rely on security by obscurity: many zero-shot black-box adversarial attacks have been conducted by human researchers, and superintelligent ML systems might be even more creative. Still, even superintelligent ML systems are not omnipotent, and in practice the “adversary” might be a relatively stupid optimization process like SGD.
Referring back to Table 1, one class of alignment agendas that is particularly sensitive to lack of robustness is those where the helper is an overseer providing feedback to the main system, such as in RLHF and most scalable oversight proposals. In this case, exploiting the helper is part of the training objective of the main system, explaining the ubiquity of rewardhacking.
Corrective feedback from a human overseer can help mitigate reward hacking when the reward model is trained online, but there remains a significant risk that the main system hacks the helper more rapidly than a human is able to provide corrective feedback. Alternatively, the main system may simply learn to trick the human. Scalable oversight methods are designed to guard against tricking the human, but cannot protect against scenarios that are adversarial to both the human and scalable oversight model.
Imitation learning methods seem less risky, as the optimization pressure is simply to match the empirical distribution of a demonstration dataset. The closest to “reward hacking” in this setting would be overfitting to the dataset, a relatively benign failure mode. There is still some risk of inner optimization objectives arising, which could then be adversarial to other systems (e.g. attempt to hide themselves from transparency tools), but comparatively speaking this is one of the methods with the lowest risk of adversarial failure. Unfortunately, it also maxes out at around human performance.
Imitative Iterated Distillation and Amplification (IDA) is less sensitive to exploitability. IDA is a proposal to take an imitation learned model and amplify it beyond human-level performance. Casting it in our framework, the helper is a base imitation-learned model that we presume is safe and aligned. The main system is the model produced by IDA on the helper. At the first step, the helper is amplified by giving it the ability to ask questions to copies of itself. Critically, the helper has an incentive to avoid asking questions that are likely to cause it to fail. In this sense, IDA channels the optimization pressure away from adversarial exploitation, unlike e.g. vanilla RL.
The flipside is that the helper also likely has a high ability to exploit itself owing to this introspective knowledge of its own vulnerabilities. [2] This could result in some instability in the training scheme: if at any point a sub-part of the question tree is adversarial, the response could propagate to “infect” the rest of the question tree. This is related to the problem of security amplification.
One of the less concerning cases is for tools used only to evaluate, and not train a model, as is often the case for adversarial testing or transparency. Here, the main system has no direct incentive to fool the helper as it does not provide a training signal. However, in practical applications some optimization pressure against the helper will still leak in, for example if the helper is used to decide whether to abort a training run or change its hyperparameters. This would result in problems similar to those observed today with researchers overfitting to the test set.
Even if we only used these helper methods as a one-off final check on the main system, the main system might still be incentivised to fool these evaluation methods if it has situational awareness and long-term objectives. However, it will be a challenging exploit to pull off, as the main system may have only partial knowledge of the design of the helper (e.g. what tests the helper will perform) and cannot engage in direct trial-and-error exploitation of it. Additionally, transparency tools are typically applied to the weights of the main system, which the main system can at best only indirectly control (by gradient hacking). [3] Unfortunately, evaluation methods are not themselves solutions to alignment, but can provide an additional safeguard to avoid deployment of malign systems.
Contemporary ML systems are exploitable by default
Our best guess is that all contemporary ML systems solving non-trivial tasks are exploitable by a moderately resourced adversary. ℓp-norm adversarial examples in image classifiers were first described by Szegedy et al. in 2013, and nearly a decade later state-of-the-art image classifiers remain vulnerable despite intense research interest in adversarial defenses. These vulnerabilities can be exploited in real-world settings by physical adversarial attacks, and there are even naturally occurring images that are challenging for a wide variety of models. Moreover, analogous issues have been found in a diverse range of ML systems including languagemodels, graph analysis, robotic policies and superhuman Go programs.
To the best of our knowledge, no ML system solving a non-trivial problem has ever withstood a well-resourced attack.[4] Adversarial defenses can be divided into those that are broken, and those that have not yet attracted concerted effort to break them. This should not be too surprising: the same could be said of most software systems in general.
One difference is that software security has notably improved over time. Although there almost certainly exist remote root exploits in most major operating systems, finding one is decidedly non-trivial, and is largely out of reach of most attackers. By contrast, exploiting ML systems is often alarmingly easy.
This is not to say we haven’t made progress. There has been an immense amount of work defending against ℓp-norm adversarial examples, and this has made attacks harder: requiring more sophisticated methods, or a larger ℓp-norm perturbation. For example, a state-of-the-art (SOTA) method DensePure achieves 77.8% certified accuracy on ImageNet for perturbations up to 0.5/255 ℓ2-norm. However, this accuracy is still far behind the SOTA for clean images, which currently stands at 91.0% top-1 accuracy with CoCa. Moreover, the certified accuracy of DensePure drops to 54.6% at a 1.5/255 ℓ2-norm perturbation – which is visually imperceptible to humans. This is well below the 62% achieved by AlexNet back in 2012.
There is substantial evidence for a trade-off between accuracy and robustness. Tsipras et al (2019) demonstrate this trade-off theoretically in a simplified setting. Moreover, there is ample empirical evidence for this. For example, DensePure was SOTA in 2022 for certified accuracy on adversarial inputs but achieved only 84% accuracy on clean images. By contrast, non-robust models achieved this accuracy 4 years earlier such as AmoebaNetA in 2018. There appears to therefore be a significant “robustness tax” to pay, analogous to the alignment tax.[5]
In addition to certified methods such as DensePure, there are also a variety of defense methods that provide empirical protection against adversarial attack but without provable guarantees. However, the protection they provide is partial at best. For example, a SOTA method DiffPure achieves 74% accuracy on clean images in ImageNet but only 43% accuracy under a 4⁄255 ℓ∞-norm perturbation. There is also a significant robustness tax here: Table 5 from the DiffPure paper shows that accuracy on clean images drops from 99.43% on CelebA-HQ to 94% with the diffusion defense.
Although the ubiquitous presence of adversarial examples in contemporary ML systems is concerning, there is one glimmer of hope. Perhaps these adversarial examples are merely an artifact of the ML systems being insufficiently capable? Once the system reaches or surpasses human-level performance, we might hope it would have learned a set of representations at least as good as that of a human, and be no more vulnerable to adversarial attack than we are.
Unfortunately, recent work casts doubt on this. In Wang et al (2022), we find adversarial policies that beat KataGo, a superhuman Go program. We trained our adversarial policy with less than 14% of the compute that KataGo was trained with, but wins against a superhuman version of KataGo 97% of the time. This is not specific to KataGo: our exploit transfers to ELF OpenGo and Leela Zero, and in concurrent work from DeepMind Timbers et al (2022) were able to exploit an in-house replica of AlphaZero.
Of course, results in Go may not generalize to other settings, but we chose to study Go because we expected the systems to be unusually hard to exploit. In particular, since Go is a zero-sum game, being robust to adversaries is the key design objective, rather than merely one desiderata amongst many. Additionally, KataGo and AlphaZero use Monte-Carlo Tree Search coupled with a neural network evaluation. In general, we would expect search (which is provably optimal in the limit) to be harder to exploit than neural networks alone, and although search does make the system harder to exploit we are able to attack it even up to 10 million visits – far in excess of the threshold needed for superhuman performance, and well above the level used in most games.
There remains a possibility that although narrowly superhuman systems are vulnerable, more general systems might be robust. Large language models are the most general systems we have today, yet work by Ziegler et al (2022) find they are still exploitable even after significant adversarial training. Moreover, the existence of apparently fundamental tradeoffs between accuracy and robustness suggests that the most capable AI systems at any given time may be particularly likely to be vulnerable (Tsipras et al, 2019; Tramer et al, 2019).
Of course, at some point systems might be developed that are adversarially robust. This could be by “overshooting” on capability and generality, and then paying a robustness tax to get a suitably capable or general but robust system. Alternatively, new techniques might be developed that reduce or eliminate the robustness tax. Most optimistically, it is possible that general, human-level systems are naturally robust even though generality or human-level performance on their own are insufficient. In the next section, we will consider different possibilities for when and if adversarially robust systems are developed, and the implications this has for safety.
Future trajectories for robustness
We will consider three possible cases:
We solve adversarial robustness beforetransformative AI is developed;
We solve it after transformative AI is developed;
It is never solved.
Although coarse-grained, we believe this case split captures the most important distinctions.
For the purpose of this section, we will consider adversarial robustness to be solved if systems cannot be practically exploited to cause catastrophic outcomes. This is intended to be a low bar. In particular, this definition tolerates bounded errors. For example, we would tolerate threat actors being able to permanently trick AI systems into giving them 10% more resources in a trade. We’d also tolerate threat actors being able to temporarily divert even the majority of the AI’s resources, so long as this did not lead to permanent negative effects and that attackers eventually run out of such exploits.
We summarize our subjective credence in each of the cases below, and explore the cases qualitatively in the following sections.
Before you read further, consider estimating your credence for the following question, corresponding to Case 1:
AI Safety in a World of Vulnerable Machine Learning Systems
Link post
Even the most advanced contemporary machine learning systems are vulnerable to adversarial attack. The safety community has often assumed adversarial robustness to be a problem that will be solved naturally as machine learning (ML) systems grow more capable and general. However, recent work has shown that superhuman systems in a narrow domain such as AlphaZero are highly vulnerable to adversarial attack, as are general but less capable systems like large language models. This raises the possibility that adversarial (worst-case) robustness will continue to lag behind average-case capabilities. In other words, transformative AI systems are likely to be exploitable.
Exploitability will cause a wide variety of current alignment proposals to fail. Most extant agendas seek to align the main ML system with the assistance of helper ML systems. The main ML system is the primary system that takes actions in the world (e.g. interacting with users), with the helper ML systems acting as scaffolding to train and/or verify the main ML system. These alignment schemes will fail if the helpers are exploited by the main system – and we expect helpers to be vulnerable to exploitation (see Contemporary ML systems are exploitable by default).
In Table 1 we present a subjective risk matrix for a range of popular alignment agendas, evaluating the degree to which main ML systems have the ability and incentive to exploit the helper. We find many alignment agendas have a high risk of exploitation, with all having at least some risk.
Exploit Helper
AI safety via debate)
Table 1: Subjective risk matrix for popular alignment agendas (see next section), using a helper ML system to assist with aligning the main ML system that will eventually be deployed.
We are most concerned by vulnerabilities in the helpers as this can impact the alignment of the main system. By contrast, an aligned but adversarially exploitable main system would not necessarily pose a danger, especially if the main system can recursively self-improve to fix itself. However, there is a possibility that even superintelligent systems cannot attain adversarial robustness. This would be a volatile situation, which could conceivably collapse into chaos (systems frequently exploiting each other), an implicit equilibrium (e.g. mutually assured destruction), or an explicit agreement (e.g. all AI systems self-modify to commit to not exploiting one another).
We see two possible approaches to fixing this: improving adversarial robustness, or developing fault tolerant alignment methods that can work even in the presence of vulnerable ML systems. We are most excited by fault tolerant alignment, as it is highly neglected and plausibly tractable, although further work is needed to solidify this approach. By contrast, adversarial robustness is an area that has received significant attention from the ML research community (low neglectedness)[1]but with only modest progress (low to medium tractability).
In the remainder of this document, we will argue that systems are exploitable by default, explore the implications this has for alignment agendas in several different scenarios, and outline several research directions we are excited by.
Alignment agendas need robustness
Most alignment schemes implicitly depend on some aligned helper ML systems that can assist us in training or validating a main ML system, like a proto-AGI. In this section we assume the helper is exploitable, and explore the problems this causes for these alignment schemes. We justify this assumption this assumption in the next section.
A helper is more likely to be exploited if the main system is incentivized to exploit it, such as by the helper directly providing a training signal the main system is being optimized for, or if the main system has gained situational awareness. Conversely, it is less likely to happen if the optimization process is pursuing a goal that is largely independent from the helper’s output, such as if the helper is an interpretability tool checking for the presence of mesa-optimization. In the best case, we might even be able to (as in the case of IDA) control the optimization pressure to explicitly avoid inputs that might be adversarial.
The likelihood of an exploit also depends on the ability of the main system. This includes the computational resources available to develop an exploit, which could range from just forward-passes of the main system’s network to the main system having access to a large compute cluster. It also depends on how much information the main system has about the helper, which could range from merely a high-level knowledge of possible evaluation techniques (black box), through to query access to the helper (gray box), through to full access to its weights and source code (white box). We can’t rely on security by obscurity: many zero-shot black-box adversarial attacks have been conducted by human researchers, and superintelligent ML systems might be even more creative. Still, even superintelligent ML systems are not omnipotent, and in practice the “adversary” might be a relatively stupid optimization process like SGD.
Referring back to Table 1, one class of alignment agendas that is particularly sensitive to lack of robustness is those where the helper is an overseer providing feedback to the main system, such as in RLHF and most scalable oversight proposals. In this case, exploiting the helper is part of the training objective of the main system, explaining the ubiquity of reward hacking.
Corrective feedback from a human overseer can help mitigate reward hacking when the reward model is trained online, but there remains a significant risk that the main system hacks the helper more rapidly than a human is able to provide corrective feedback. Alternatively, the main system may simply learn to trick the human. Scalable oversight methods are designed to guard against tricking the human, but cannot protect against scenarios that are adversarial to both the human and scalable oversight model.
Imitation learning methods seem less risky, as the optimization pressure is simply to match the empirical distribution of a demonstration dataset. The closest to “reward hacking” in this setting would be overfitting to the dataset, a relatively benign failure mode. There is still some risk of inner optimization objectives arising, which could then be adversarial to other systems (e.g. attempt to hide themselves from transparency tools), but comparatively speaking this is one of the methods with the lowest risk of adversarial failure. Unfortunately, it also maxes out at around human performance.
Imitative Iterated Distillation and Amplification (IDA) is less sensitive to exploitability. IDA is a proposal to take an imitation learned model and amplify it beyond human-level performance. Casting it in our framework, the helper is a base imitation-learned model that we presume is safe and aligned. The main system is the model produced by IDA on the helper. At the first step, the helper is amplified by giving it the ability to ask questions to copies of itself. Critically, the helper has an incentive to avoid asking questions that are likely to cause it to fail. In this sense, IDA channels the optimization pressure away from adversarial exploitation, unlike e.g. vanilla RL.
The flipside is that the helper also likely has a high ability to exploit itself owing to this introspective knowledge of its own vulnerabilities. [2] This could result in some instability in the training scheme: if at any point a sub-part of the question tree is adversarial, the response could propagate to “infect” the rest of the question tree. This is related to the problem of security amplification.
One of the less concerning cases is for tools used only to evaluate, and not train a model, as is often the case for adversarial testing or transparency. Here, the main system has no direct incentive to fool the helper as it does not provide a training signal. However, in practical applications some optimization pressure against the helper will still leak in, for example if the helper is used to decide whether to abort a training run or change its hyperparameters. This would result in problems similar to those observed today with researchers overfitting to the test set.
Even if we only used these helper methods as a one-off final check on the main system, the main system might still be incentivised to fool these evaluation methods if it has situational awareness and long-term objectives. However, it will be a challenging exploit to pull off, as the main system may have only partial knowledge of the design of the helper (e.g. what tests the helper will perform) and cannot engage in direct trial-and-error exploitation of it. Additionally, transparency tools are typically applied to the weights of the main system, which the main system can at best only indirectly control (by gradient hacking). [3] Unfortunately, evaluation methods are not themselves solutions to alignment, but can provide an additional safeguard to avoid deployment of malign systems.
Contemporary ML systems are exploitable by default
Our best guess is that all contemporary ML systems solving non-trivial tasks are exploitable by a moderately resourced adversary. ℓp-norm adversarial examples in image classifiers were first described by Szegedy et al. in 2013, and nearly a decade later state-of-the-art image classifiers remain vulnerable despite intense research interest in adversarial defenses. These vulnerabilities can be exploited in real-world settings by physical adversarial attacks, and there are even naturally occurring images that are challenging for a wide variety of models. Moreover, analogous issues have been found in a diverse range of ML systems including language models, graph analysis, robotic policies and superhuman Go programs.
To the best of our knowledge, no ML system solving a non-trivial problem has ever withstood a well-resourced attack. [4] Adversarial defenses can be divided into those that are broken, and those that have not yet attracted concerted effort to break them. This should not be too surprising: the same could be said of most software systems in general.
One difference is that software security has notably improved over time. Although there almost certainly exist remote root exploits in most major operating systems, finding one is decidedly non-trivial, and is largely out of reach of most attackers. By contrast, exploiting ML systems is often alarmingly easy.
Figure 1: A typographic attack enables a no-code exploit of OpenAI Clip. More examples.This is not to say we haven’t made progress. There has been an immense amount of work defending against ℓp-norm adversarial examples, and this has made attacks harder: requiring more sophisticated methods, or a larger ℓp-norm perturbation. For example, a state-of-the-art (SOTA) method DensePure achieves 77.8% certified accuracy on ImageNet for perturbations up to 0.5/255 ℓ2-norm. However, this accuracy is still far behind the SOTA for clean images, which currently stands at 91.0% top-1 accuracy with CoCa. Moreover, the certified accuracy of DensePure drops to 54.6% at a 1.5/255 ℓ2-norm perturbation – which is visually imperceptible to humans. This is well below the 62% achieved by AlexNet back in 2012.
There is substantial evidence for a trade-off between accuracy and robustness. Tsipras et al (2019) demonstrate this trade-off theoretically in a simplified setting. Moreover, there is ample empirical evidence for this. For example, DensePure was SOTA in 2022 for certified accuracy on adversarial inputs but achieved only 84% accuracy on clean images. By contrast, non-robust models achieved this accuracy 4 years earlier such as AmoebaNetA in 2018. There appears to therefore be a significant “robustness tax” to pay, analogous to the alignment tax.[5]
In addition to certified methods such as DensePure, there are also a variety of defense methods that provide empirical protection against adversarial attack but without provable guarantees. However, the protection they provide is partial at best. For example, a SOTA method DiffPure achieves 74% accuracy on clean images in ImageNet but only 43% accuracy under a 4⁄255 ℓ∞-norm perturbation. There is also a significant robustness tax here: Table 5 from the DiffPure paper shows that accuracy on clean images drops from 99.43% on CelebA-HQ to 94% with the diffusion defense.
To make matters worse, real attackers have a much broader range of possible attacks outlined by Gilmers et al (2018), such as rotating images, perturbing physical parameters in rendered images, adversarially selecting images from a real-world dataset, adversarial patches, single-pixel attacks and latent adversarial perturbations. We would like to be robust to all these attacks, but there appears to be fundamental trade-offs between robustness to different attacks, with Tramer et al (2019) showing such a trade-off between different types of ℓp-bounded and spatial perturbations. Moreover, there are currently no effective methods to defend against unrestricted adversarial examples outside of toy settings.
Although the ubiquitous presence of adversarial examples in contemporary ML systems is concerning, there is one glimmer of hope. Perhaps these adversarial examples are merely an artifact of the ML systems being insufficiently capable? Once the system reaches or surpasses human-level performance, we might hope it would have learned a set of representations at least as good as that of a human, and be no more vulnerable to adversarial attack than we are.
Unfortunately, recent work casts doubt on this. In Wang et al (2022), we find adversarial policies that beat KataGo, a superhuman Go program. We trained our adversarial policy with less than 14% of the compute that KataGo was trained with, but wins against a superhuman version of KataGo 97% of the time. This is not specific to KataGo: our exploit transfers to ELF OpenGo and Leela Zero, and in concurrent work from DeepMind Timbers et al (2022) were able to exploit an in-house replica of AlphaZero.
Of course, results in Go may not generalize to other settings, but we chose to study Go because we expected the systems to be unusually hard to exploit. In particular, since Go is a zero-sum game, being robust to adversaries is the key design objective, rather than merely one desiderata amongst many. Additionally, KataGo and AlphaZero use Monte-Carlo Tree Search coupled with a neural network evaluation. In general, we would expect search (which is provably optimal in the limit) to be harder to exploit than neural networks alone, and although search does make the system harder to exploit we are able to attack it even up to 10 million visits – far in excess of the threshold needed for superhuman performance, and well above the level used in most games.
There remains a possibility that although narrowly superhuman systems are vulnerable, more general systems might be robust. Large language models are the most general systems we have today, yet work by Ziegler et al (2022) find they are still exploitable even after significant adversarial training. Moreover, the existence of apparently fundamental tradeoffs between accuracy and robustness suggests that the most capable AI systems at any given time may be particularly likely to be vulnerable (Tsipras et al, 2019; Tramer et al, 2019).
Of course, at some point systems might be developed that are adversarially robust. This could be by “overshooting” on capability and generality, and then paying a robustness tax to get a suitably capable or general but robust system. Alternatively, new techniques might be developed that reduce or eliminate the robustness tax. Most optimistically, it is possible that general, human-level systems are naturally robust even though generality or human-level performance on their own are insufficient. In the next section, we will consider different possibilities for when and if adversarially robust systems are developed, and the implications this has for safety.
Future trajectories for robustness
We will consider three possible cases:
We solve adversarial robustness before transformative AI is developed;
We solve it after transformative AI is developed;
It is never solved.
Although coarse-grained, we believe this case split captures the most important distinctions.
For the purpose of this section, we will consider adversarial robustness to be solved if systems cannot be practically exploited to cause catastrophic outcomes. This is intended to be a low bar. In particular, this definition tolerates bounded errors. For example, we would tolerate threat actors being able to permanently trick AI systems into giving them 10% more resources in a trade. We’d also tolerate threat actors being able to temporarily divert even the majority of the AI’s resources, so long as this did not lead to permanent negative effects and that attackers eventually run out of such exploits.
We summarize our subjective credence in each of the cases below, and explore the cases qualitatively in the following sections.
Before you read further, consider estimating your credence for the following question, corresponding to Case 1: