AI safety in practice relies on the AI system not only being aligned, but also not being able to discover internal computer security vulnerabilities itself or be stealable by attackers. The latter problem may be harder than one would assume at first sight. Here are a few reasons why that is, discussed at length by Mark S. Miller, Christine Peterson, and I in the Defend Against Cyber Threats chapter in Gaming the Future:
First Strike Instabilities
In Information security considerations for AI and the long term future, Jeffrey Ladish and Lennart Hein anticipate intense competition surrounding the development of AGI leading to considerable interest from state actors. They assign a high likelihood to advanced threat actors targeting organizations involved in AGI development, supply of critical resources to AGI companies, or possession of strategically important information, in order to gain an advantage in AGI development.
In Defend Against Cyber Threats in Gaming the Future, we suggest that this threat may be worse than often acknowledged, due to ‘first strike instabilities’: if the possibility of an AGI takeover merely becomes credible and believed to be imminent, this poses a significant risk in itself. If one nation-state actor becomes aware that another actor is about to develop an AGI capable of taking over the world, they may choose to preemptively destroy it. Critically, this risk exists even if creating an AGI is impossible, but simply believed to be possible.
It’s possible the attacker merely tries to steal the AGI capability or destroy the AGI-creating actor rather than launch a nation-wide attack. But it’s likely that the nation hosting the AGI-creating actor would retaliate, and the attacking nation knows this. Given how high the stakes may be believed to be at that point, the attacker may not want to risk this. We live in a world where multiple militaries have nuclear weapon delivery capabilities, so public AGI arms races may preemptively re-create game theoretic dynamics that resemble the Cold War nuclear game theory.
If a cyber-attack is chosen by the attacker, it would introduce several relatively new characteristics compared to a kinetic attack:
severe cyber attacks may already be within the range of capabilities of small nation state actors, while destructive kinetic weapons (especially nukes) are mostly within the hands of large nation state actors
it takes time to figure out where a cyber attack is launched from while kinetic attacks (especially nukes) are easier to attribute
the attacking nation has an incentive to cover its tracks or to pin it on another actor which is not that simple with standard kinetic weapons (especially nukes)
These differences could make the dynamics resulting from a cyber-attack to steal AGI capabilities potentially more globally destabilizing than those of standard kinetic weapons (including nukes) that we have at least some experience with.
It’s possible that nation state actors move too slowly to realize that the AGI threat is imminent, But it’s also possible that they realize relatively late, realize they realized late, and angle for a desperate attempt.
AI security in practice
Computer systems are vulnerable at multiple levels, including hardware, firmware, operating systems, and user behavior (I discuss user interface design in section 2). We have not, to date, figured out how to make these systems reliably secure in practice:
Zero-day exploit capabilities are proliferating via exploit markets
Zero-day exploits with potentially disastrous consequences are proliferating. In This is how they tell me the world ends, Nicole Perlroth provides a string of zero-day examples:
between 2005-2007, Stuxnet, a worm used by the US to target Iranian nuclear facilities, ended up escaping and hitting various countries such as Russia, California, India, Europe, and Indonesia.
in 2009, Google discovered it had been breached by elite Chinese hackers who stole information on satellite technology, missiles, aerospace, and nuclear propulsion, likely to create backdoors that would enable China to monitor Gmail accounts of political dissidents.
in 2014 and 2015, Russia’s Sandworm targeted General Electric software that controlled water treatment facilities, electric grids, and oil and gas pipelines
in 2015, Chinese hackers were found within the US agency responsible for storing the personal information of all government employees
in 2016, the Shadow Brokers group published a collection of NSA cyberweapons on the internet, including the EternalBlue zero-day exploit, which enabled North Korea to deploy ransomware that spread to 150 countries and targeted hospital systems.
in 2020, hackers believed to be associated with Russia’s SVR inserted malicious software into the foundational network infrastructure of 30,000 companies, including many Fortune 500 companies and critical parts of the government in an attack known as SolarWinds. The compromised targets included the Departments of Homeland Security, Treasury, Commerce, and State, as well as the Energy Department and the National Nuclear Security Administration, which is responsible for maintaining America’s nuclear stockpile
Nicole Perlroth points out that a main driver of the increase in zero-days are international markets for zero-day exploits. The NSA purchases these exploits from hackers using taxpayer money, not to inform affected companies and prompt them to patch the vulnerabilities, but rather to exploit them for their own purposes. Companies are often unable to match the prices offered by governments, and other nations are increasingly entering the markets.
AI may worsen the short-term offense defense dynamic
In Malicious Use of AI, Miles Brundage and co-authors point out that AI progress may alleviate existing tradeoffs between scale and efficiency of an attack. They point to potentially escalating risks due to
automation of labor-intensive cyberattacks like spear phishing
the emergence of new attack methods that capitalize on human weaknesses (for instance, employing speech synthesis for impersonation)
existing software vulnerabilities (such as through automated hacking techniques)
the susceptibilities of AI systems themselves (for example, by using adversarial examples and data poisoning)
In Cyber, Nano, AGI Risks, Christine Peterson and co-authors add that AI could potentially generate software capable of analyzing targeted software and detecting novel zero-day attacks that were previously unknown. By integrating cutting-edge vulnerability detection software into the attacking system that is deployed, systems may soon be able to identify and exploit vulnerabilities while in communication with the target, instead of solely utilizing pre-existing attacks against known vulnerabilities.
Security exploits are likely to get worse with AI advances. While most AGI security risks arise from adversarial nation state actors, AI may also lower the barriers to entry for malicious amateurs to cause harm. In Sparks of Artificial General Intelligence, Sebastien Bubeck and co-authors suggest that GPT-4’s increasing ability to generalize and interact can be harnessed for adversarial uses, such as computer security attacks. In I,Chatbot, Insikt points out how even Chat-GPT3 already has the potential to empower script kiddies with limited programming skills to develop malware, who increasingly share exploit proofs-of-concepts using Chat-GPT3 on the dark web.
There are also cases in which AI may help security, for instance in AI-supported fuzzing (discussed in part 2). Traditionally, fuzzing involves generating a multitude of diverse inputs to an application with the goal of inducing a crash. Because each application accepts inputs in distinct ways, significant manual setup is required, and testing every conceivable input through brute force would be an extremely time-consuming process. Currently, fuzzers employ randomized attempts and utilize various techniques to target the most promising prospects.
It’s possible that AI tools can aid in test case generation and be leveraged in the aftermath of fuzzing to assess whether the discovered crashes are exploitable or not. In theory, such tools could make it easier for companies to identify potential vulnerabilities that could be exploited in their systems, so they can address them prior to any malicious actors exploiting them. But it’s also possible that malicious actors will have access to such technology and will soon be able to uncover zero-day vulnerabilities on a large scale.
If advanced security attacks that are hard to attribute could be increasingly caused by smaller non-state actors, this would further destabilize the first strike dynamics mentioned earlier.
Hardware supply chain risk
Next to insecure software, hardware supply chain risks are a major, often neglected, factor in computer security. The assurance that a system design is secure is only effective if the software is run on the intended hardware. However, this assumption may not always hold as there is a possibility of hardware containing a built-in trap door.
In Cyber, Nano, AGI Risks: Decentralized Approaches to Defense, Christine Peterson and co-authors highlight that each step in global supply chains presents an opportunity for potential compromise. State actors have actively engaged in efforts, like Project Bullrun, to weaken the global cryptography standards that form the foundation of the world’s economy and security. Even when end-to-end encryption is employed and remains resistant to attacks, the hardware facilitating the encryption can often be easily infiltrated. There are already demonstrations of how to build extremely hard to detect exploitable trapdoors at the analog level.
The disclosure of user information by software companies due to national security letters served by the NSA raises concerns that the agency may have issued similar letters to hardware companies, such as Intel and AMD, demanding the installation of trap doors in their hardware. These trap doors could be activated at a later time.
It would be useful to have open source processor design for which there is a proof of security comparable to the proof of security of the seL4 software. SeL4 is an operating system microkernel that not only relies on a formal end-to-end security proof but was also able to withstand a DARPA Red Team Attack —a feat unmatched by any other software.
There are open source processor designs that are sufficiently high performance that, when run on a field-programmable gate array (FPGA), can run fast enough to be practical for many applications. These processors could be combined with a layout algorithm that randomizes layout decisions for each hardware instance, making it virtually impossible for any corruption of the FPGA hardware to go unnoticed under electron microscopes. This approach could prevent most instances of the processor from being successfully corrupted.
In practice, the techniques currently known for building machines that are credibly correct, such as randomized FPGA layout, are significantly more expensive than simply building correct machines. It’s currently not feasible for any manufacturer to create hardware that is both credibly correct and competitive.
Defense in Depth: This strategy is currently used by militaries to protect critical infrastructure, assuming that some systems will inevitably be compromised. It emphasizes learning from discovered compromises and maintaining security for the most vital, in-depth networks. This approach offers the advantage of allowing defenders to gather intelligence about attackers as they breach various firewalls.
Technical Solutions: Technical solutions for security vulnerabilities do exist. For instance, the seL4 microkernel (described above) is a prime example of a seemingly secure operating system. The U.S. Department of Defense has increased funding for seL4, which is a promising development. However, its security still relies on certain counterfactual assumptions, such as the accuracy of the formal model for the underlying hardware. I will discuss other technical solutions in part 3 of this sequence.
Responsible Disclosure: As a defense measure, responsible disclosure with timelines would involve discovering existing vulnerabilities in the wild and privately disclosing them to affected organizations. These organizations would then be given a specific timeframe to address the vulnerability before it is made public. Responsible disclosure is already the standard practice within the cybersecurity community. With the NSA’s dual mandate of civil defense and offense for secure defense, it could act as a vulnerability collector and disclosure mechanism. As a start, the government could privately disclose their discovered zero-day vulnerabilities to US companies, with a timeline for public disclosure and support to fix them.
Adversarial Red Teaming: Both Bitcoin and Ethereum are developing in an environment that is constantly under aggressive attack pressures, as insecure projects are the practical equivalent of a multimillion-dollar cryptocurrency ‘bug bounty.’ When security breaches result in losses, non-bulletproof systems are quickly and visibly eliminated, leaving only seemingly bulletproof systems within these ecosystems. The robust security of these systems is a crucial aspect of their value proposition. These projects undergo a level of adversarial testing that can be a good inspiration for systems capable of withstanding cyberattacks that would devastate conventional software.
The main problem with security is not that there are no good approaches but that a multi-trillion dollar ecosystem is already built on the current foundations. It is very difficult to get adoption for something that needs to rebuild the entire ecosystem from scratch.
In the Defend Against Cyber Threats chapter in Gaming the Future, we propose that one potential approach to overcoming the significant barriers to adoption of a new, more secure system is to pursue a ‘genetic takeover’ strategy (a term borrowed from biology). This involves growing a new system within the existing one, without directly challenging it. The new system can coexist with the current one, functioning in a world dominated by it, and gradually become competitive until the old system becomes obsolete. For instance, the new personal computing ecosystem began as a complement to the old system before eventually displacing it.
For a genetic takeover to start, we may have to count on a non-catastrophic computer security event to occur to get sufficiently spooked to do anything. However, judging from how poorly we reacted to the existing cyber attacks mentioned above, this is unlikely. It may be more likely that the panic following a big takeout will be followed by efforts to support entrenched techniques that aren’t more than window dressing unless actually secure approaches can be scaled and made economically competitive in time.
The lack of computer security may already be a catastrophic risk in itself
The computer security problem is already very bad, even without AI. Civilization is currently built upon foundations that are not only insecure but arguably insecurable. This creates a severe risk to humanity since so much of our infrastructure, such as the electric grid and internet, relies on these insecure systems. Because vulnerabilities are networked, local vulnerabilities compound into regional vulnerabilities, which compound into international vulnerabilities, increasing the risk of large-scale attacks.
If these systems were to be severely compromised, this could lead to potentially catastrophic fallouts of critical infrastructure. If severe offense capabilities are proliferating to potentially non-state actors, making attribution of the attack even harder, it could also sharpen adversarial dynamics across nation-states in the dangerous ways described above.
Computer security as unmet necessary condition for AI safety
Some AGI takeover scenarios involve the AGI breaking out of its confinement by bribing, blackmailing or otherwise manipulating a human or a group of humans into releasing it onto the internet. This implies it couldn’t effectively break out without human help. If we are concerned about humans (aided by AI) exploiting the software and hardware vulnerabilities discussed throughout this section – and we should – we should be very concerned about what an AGI may be capable of. Worrying about a social manipulation breakout scenario assumes a level of computer security that we don’t actually have. We could consider computer security as a necessary condition for AI safety that we currently don’t meet.
AI infosec: first strikes, zero-day markets, hardware supply chains, adoption barriers
[This part 1 of a 5 part sequence on security and cryptography areas relevant for AI safety, published and linked here a few days apart]
AI safety in practice
AI safety in practice relies on the AI system not only being aligned, but also not being able to discover internal computer security vulnerabilities itself or be stealable by attackers. The latter problem may be harder than one would assume at first sight. Here are a few reasons why that is, discussed at length by Mark S. Miller, Christine Peterson, and I in the Defend Against Cyber Threats chapter in Gaming the Future:
First Strike Instabilities
In Information security considerations for AI and the long term future, Jeffrey Ladish and Lennart Hein anticipate intense competition surrounding the development of AGI leading to considerable interest from state actors. They assign a high likelihood to advanced threat actors targeting organizations involved in AGI development, supply of critical resources to AGI companies, or possession of strategically important information, in order to gain an advantage in AGI development.
In Defend Against Cyber Threats in Gaming the Future, we suggest that this threat may be worse than often acknowledged, due to ‘first strike instabilities’: if the possibility of an AGI takeover merely becomes credible and believed to be imminent, this poses a significant risk in itself. If one nation-state actor becomes aware that another actor is about to develop an AGI capable of taking over the world, they may choose to preemptively destroy it. Critically, this risk exists even if creating an AGI is impossible, but simply believed to be possible.
It’s possible the attacker merely tries to steal the AGI capability or destroy the AGI-creating actor rather than launch a nation-wide attack. But it’s likely that the nation hosting the AGI-creating actor would retaliate, and the attacking nation knows this. Given how high the stakes may be believed to be at that point, the attacker may not want to risk this. We live in a world where multiple militaries have nuclear weapon delivery capabilities, so public AGI arms races may preemptively re-create game theoretic dynamics that resemble the Cold War nuclear game theory.
If a cyber-attack is chosen by the attacker, it would introduce several relatively new characteristics compared to a kinetic attack:
severe cyber attacks may already be within the range of capabilities of small nation state actors, while destructive kinetic weapons (especially nukes) are mostly within the hands of large nation state actors
it takes time to figure out where a cyber attack is launched from while kinetic attacks (especially nukes) are easier to attribute
the attacking nation has an incentive to cover its tracks or to pin it on another actor which is not that simple with standard kinetic weapons (especially nukes)
These differences could make the dynamics resulting from a cyber-attack to steal AGI capabilities potentially more globally destabilizing than those of standard kinetic weapons (including nukes) that we have at least some experience with.
It’s possible that nation state actors move too slowly to realize that the AGI threat is imminent, But it’s also possible that they realize relatively late, realize they realized late, and angle for a desperate attempt.
AI security in practice
Computer systems are vulnerable at multiple levels, including hardware, firmware, operating systems, and user behavior (I discuss user interface design in section 2). We have not, to date, figured out how to make these systems reliably secure in practice:
Zero-day exploit capabilities are proliferating via exploit markets
Zero-day exploits with potentially disastrous consequences are proliferating. In This is how they tell me the world ends, Nicole Perlroth provides a string of zero-day examples:
between 2005-2007, Stuxnet, a worm used by the US to target Iranian nuclear facilities, ended up escaping and hitting various countries such as Russia, California, India, Europe, and Indonesia.
in 2009, Google discovered it had been breached by elite Chinese hackers who stole information on satellite technology, missiles, aerospace, and nuclear propulsion, likely to create backdoors that would enable China to monitor Gmail accounts of political dissidents.
in 2014 and 2015, Russia’s Sandworm targeted General Electric software that controlled water treatment facilities, electric grids, and oil and gas pipelines
in 2015, Chinese hackers were found within the US agency responsible for storing the personal information of all government employees
in 2016, the Shadow Brokers group published a collection of NSA cyberweapons on the internet, including the EternalBlue zero-day exploit, which enabled North Korea to deploy ransomware that spread to 150 countries and targeted hospital systems.
in 2020, hackers believed to be associated with Russia’s SVR inserted malicious software into the foundational network infrastructure of 30,000 companies, including many Fortune 500 companies and critical parts of the government in an attack known as SolarWinds. The compromised targets included the Departments of Homeland Security, Treasury, Commerce, and State, as well as the Energy Department and the National Nuclear Security Administration, which is responsible for maintaining America’s nuclear stockpile
Nicole Perlroth points out that a main driver of the increase in zero-days are international markets for zero-day exploits. The NSA purchases these exploits from hackers using taxpayer money, not to inform affected companies and prompt them to patch the vulnerabilities, but rather to exploit them for their own purposes. Companies are often unable to match the prices offered by governments, and other nations are increasingly entering the markets.
AI may worsen the short-term offense defense dynamic
In Malicious Use of AI, Miles Brundage and co-authors point out that AI progress may alleviate existing tradeoffs between scale and efficiency of an attack. They point to potentially escalating risks due to
automation of labor-intensive cyberattacks like spear phishing
the emergence of new attack methods that capitalize on human weaknesses (for instance, employing speech synthesis for impersonation)
existing software vulnerabilities (such as through automated hacking techniques)
the susceptibilities of AI systems themselves (for example, by using adversarial examples and data poisoning)
In Cyber, Nano, AGI Risks, Christine Peterson and co-authors add that AI could potentially generate software capable of analyzing targeted software and detecting novel zero-day attacks that were previously unknown. By integrating cutting-edge vulnerability detection software into the attacking system that is deployed, systems may soon be able to identify and exploit vulnerabilities while in communication with the target, instead of solely utilizing pre-existing attacks against known vulnerabilities.
Security exploits are likely to get worse with AI advances. While most AGI security risks arise from adversarial nation state actors, AI may also lower the barriers to entry for malicious amateurs to cause harm. In Sparks of Artificial General Intelligence, Sebastien Bubeck and co-authors suggest that GPT-4’s increasing ability to generalize and interact can be harnessed for adversarial uses, such as computer security attacks. In I,Chatbot, Insikt points out how even Chat-GPT3 already has the potential to empower script kiddies with limited programming skills to develop malware, who increasingly share exploit proofs-of-concepts using Chat-GPT3 on the dark web.
There are also cases in which AI may help security, for instance in AI-supported fuzzing (discussed in part 2). Traditionally, fuzzing involves generating a multitude of diverse inputs to an application with the goal of inducing a crash. Because each application accepts inputs in distinct ways, significant manual setup is required, and testing every conceivable input through brute force would be an extremely time-consuming process. Currently, fuzzers employ randomized attempts and utilize various techniques to target the most promising prospects.
It’s possible that AI tools can aid in test case generation and be leveraged in the aftermath of fuzzing to assess whether the discovered crashes are exploitable or not. In theory, such tools could make it easier for companies to identify potential vulnerabilities that could be exploited in their systems, so they can address them prior to any malicious actors exploiting them. But it’s also possible that malicious actors will have access to such technology and will soon be able to uncover zero-day vulnerabilities on a large scale.
If advanced security attacks that are hard to attribute could be increasingly caused by smaller non-state actors, this would further destabilize the first strike dynamics mentioned earlier.
Hardware supply chain risk
Next to insecure software, hardware supply chain risks are a major, often neglected, factor in computer security. The assurance that a system design is secure is only effective if the software is run on the intended hardware. However, this assumption may not always hold as there is a possibility of hardware containing a built-in trap door.
In Cyber, Nano, AGI Risks: Decentralized Approaches to Defense, Christine Peterson and co-authors highlight that each step in global supply chains presents an opportunity for potential compromise. State actors have actively engaged in efforts, like Project Bullrun, to weaken the global cryptography standards that form the foundation of the world’s economy and security. Even when end-to-end encryption is employed and remains resistant to attacks, the hardware facilitating the encryption can often be easily infiltrated. There are already demonstrations of how to build extremely hard to detect exploitable trapdoors at the analog level.
The disclosure of user information by software companies due to national security letters served by the NSA raises concerns that the agency may have issued similar letters to hardware companies, such as Intel and AMD, demanding the installation of trap doors in their hardware. These trap doors could be activated at a later time.
It would be useful to have open source processor design for which there is a proof of security comparable to the proof of security of the seL4 software. SeL4 is an operating system microkernel that not only relies on a formal end-to-end security proof but was also able to withstand a DARPA Red Team Attack —a feat unmatched by any other software.
There are open source processor designs that are sufficiently high performance that, when run on a field-programmable gate array (FPGA), can run fast enough to be practical for many applications. These processors could be combined with a layout algorithm that randomizes layout decisions for each hardware instance, making it virtually impossible for any corruption of the FPGA hardware to go unnoticed under electron microscopes. This approach could prevent most instances of the processor from being successfully corrupted.
In practice, the techniques currently known for building machines that are credibly correct, such as randomized FPGA layout, are significantly more expensive than simply building correct machines. It’s currently not feasible for any manufacturer to create hardware that is both credibly correct and competitive.
Security adoption barriers are high
In AGI Coordination: Coordination And Great Powers, I and other co-authors point out that a few potentially promising security approaches exist, such as:
Defense in Depth: This strategy is currently used by militaries to protect critical infrastructure, assuming that some systems will inevitably be compromised. It emphasizes learning from discovered compromises and maintaining security for the most vital, in-depth networks. This approach offers the advantage of allowing defenders to gather intelligence about attackers as they breach various firewalls.
Technical Solutions: Technical solutions for security vulnerabilities do exist. For instance, the seL4 microkernel (described above) is a prime example of a seemingly secure operating system. The U.S. Department of Defense has increased funding for seL4, which is a promising development. However, its security still relies on certain counterfactual assumptions, such as the accuracy of the formal model for the underlying hardware. I will discuss other technical solutions in part 3 of this sequence.
Responsible Disclosure: As a defense measure, responsible disclosure with timelines would involve discovering existing vulnerabilities in the wild and privately disclosing them to affected organizations. These organizations would then be given a specific timeframe to address the vulnerability before it is made public. Responsible disclosure is already the standard practice within the cybersecurity community. With the NSA’s dual mandate of civil defense and offense for secure defense, it could act as a vulnerability collector and disclosure mechanism. As a start, the government could privately disclose their discovered zero-day vulnerabilities to US companies, with a timeline for public disclosure and support to fix them.
Adversarial Red Teaming: Both Bitcoin and Ethereum are developing in an environment that is constantly under aggressive attack pressures, as insecure projects are the practical equivalent of a multimillion-dollar cryptocurrency ‘bug bounty.’ When security breaches result in losses, non-bulletproof systems are quickly and visibly eliminated, leaving only seemingly bulletproof systems within these ecosystems. The robust security of these systems is a crucial aspect of their value proposition. These projects undergo a level of adversarial testing that can be a good inspiration for systems capable of withstanding cyberattacks that would devastate conventional software.
The main problem with security is not that there are no good approaches but that a multi-trillion dollar ecosystem is already built on the current foundations. It is very difficult to get adoption for something that needs to rebuild the entire ecosystem from scratch.
In the Defend Against Cyber Threats chapter in Gaming the Future, we propose that one potential approach to overcoming the significant barriers to adoption of a new, more secure system is to pursue a ‘genetic takeover’ strategy (a term borrowed from biology). This involves growing a new system within the existing one, without directly challenging it. The new system can coexist with the current one, functioning in a world dominated by it, and gradually become competitive until the old system becomes obsolete. For instance, the new personal computing ecosystem began as a complement to the old system before eventually displacing it.
For a genetic takeover to start, we may have to count on a non-catastrophic computer security event to occur to get sufficiently spooked to do anything. However, judging from how poorly we reacted to the existing cyber attacks mentioned above, this is unlikely. It may be more likely that the panic following a big takeout will be followed by efforts to support entrenched techniques that aren’t more than window dressing unless actually secure approaches can be scaled and made economically competitive in time.
The lack of computer security may already be a catastrophic risk in itself
The computer security problem is already very bad, even without AI. Civilization is currently built upon foundations that are not only insecure but arguably insecurable. This creates a severe risk to humanity since so much of our infrastructure, such as the electric grid and internet, relies on these insecure systems. Because vulnerabilities are networked, local vulnerabilities compound into regional vulnerabilities, which compound into international vulnerabilities, increasing the risk of large-scale attacks.
If these systems were to be severely compromised, this could lead to potentially catastrophic fallouts of critical infrastructure. If severe offense capabilities are proliferating to potentially non-state actors, making attribution of the attack even harder, it could also sharpen adversarial dynamics across nation-states in the dangerous ways described above.
Computer security as unmet necessary condition for AI safety
Some AGI takeover scenarios involve the AGI breaking out of its confinement by bribing, blackmailing or otherwise manipulating a human or a group of humans into releasing it onto the internet. This implies it couldn’t effectively break out without human help. If we are concerned about humans (aided by AI) exploiting the software and hardware vulnerabilities discussed throughout this section – and we should – we should be very concerned about what an AGI may be capable of. Worrying about a social manipulation breakout scenario assumes a level of computer security that we don’t actually have. We could consider computer security as a necessary condition for AI safety that we currently don’t meet.
[This part 1 of a 5 part sequence on security and cryptography areas relevant for AI safety. Continue to part 2 on parallels between AI safety and the security mindset and problems.]