(the “AI immune system”) The whole internet — including space satellites and the internet-of-things — becomes way more secure, and includes a distributed network of non-nuclear electromagnetic pulse emitters that will physically shut down any tech infrastructure appearing to be running rogue AI agents.
Define “way more secure”. Like, superhuman-at-security AGIs rewrote the systems to be formally unhackable even taking into account hardware vulnerabilities like Rowhammer that violate the logical chip invariants?
Can you talk a bit about the world global dictatorship running the electromagnetic pulse emitters, and how they monitor every computer in the world? What sort of violence do you envision being inflicted on any countries who don’t want to submit their computers for monitoring? Is part of the plan to use AI drones to kill any political leaders who oppose this plan, so as to minimize civilian casualties? Who controls these AI drones, are we quite sure this world dictatorship stays friendly to its citizens? A lot of political processes leading to such a thing sound like they could potentially be scary.
I said “burn all GPUs” to be frank about these things being scary. It’s easy for things to sound less scary when they’re vague and the processes leading up to them are left vague. See also, George Orwell, “Politics and the English Language”. We can’t evaluate whether you have a less scary proposal until you make a less vague one.
I am only replying to the part of this post about hardware vulnerabilities.
Like, superhuman-at-security AGIs rewrote the systems to be formally unhackable even taking into account hardware vulnerabilities like Rowhammer that violate the logical chip invariants?
There are dozens of hardware vulnerabilities that exist primarily to pad security researcher’s bibliographies.
Rowhammer, like all of these vulnerabilities, is viable if and only if the following conditions are met:
You know the exact target hardware.
You also know the exact target software, like the OS, running on that hardware.
You need to know what in RAM you’re trying to flip, and where to target to do so, like a page table or some type of bit for the user’s current access level.
The target needs to actually execute your code.
In attacks where security researchers pad themselves for pulling off Rowhammer remotely, it’s because they use JavaScript or WebGL in a browser, and then pwn devices that use browsers. This is flaw almost entirely reserved for general-purpose compute hardware, because embedded software or other application-specific hardware/software systems don’t need or have browsers in them.
In all other attacks, it involves downloading & executing a program on the target machine. Normally the example is given with cloud VMs whose entire purpose is to run code from an external source. Again, this is reserved for general-purpose compute hardware, because systems that execute code only out of read-only memory will not be able to execute an attacker’s code.
There is time on the target to run the Rowhammer attack uninterrupted. It relies on a continuous and degenerate set of instructions. This can be anywhere from minutes to days of time. This means that systems that don’t give uninterrupted time to external code are also not vulnerable.
The target OS, or other software, on the system needs to not perform any type of RAM scrubbing. There are papers claiming that variants of Rowhammer work for systems that use ECC + scrubbing, but those papers also assume that the scrubbing happens over hours. If a system has very little RAM, like an embedded processor, it is feasible for hardware to scrub RAM far faster than that.
These exploits also add the requirement that the attacker needs to know the exact target RAM and ECC algorithm.
The target hardware/software system needs to not have any hardware level redundancy. You can’t rowhammer a system that has 3 separate computers that compare & vote their state on a synchronized clock. Hardware vulnerabilities are probabilistic attacks. They don’t work anymore if the attack must occur simultaneously, and identically on separate physical systems. This is another reason why we’re able to build systems that function despite hostile environments where bit flips are routine, i.e. high radiation.
The target needs to not crash. Semi-randomly flipping bits in roughly the right location in RAM is not something that most software is designed to handle, and in an overwhelming number of cases, trying to execute this attack will crash the system.
It’s not that Rowhammer isn’t possible in the sense that it cannot be shown to work, but it’s like this paper showing that you can create WiFi signals in air-gapped computers. Or this fun paper for Nethammer showing novel attacks that don’t require code execution on the target machine, except they also don’t allow for controlling where bit flips occur, so the “attack” is isomorphic to an especially hostile radiation environment with a high likelihood of bit-flips, and it relies on the ability for the attacker to swarm the target system with a high volume (500 Mbps?) of network traffic that they control—a network switch that drops unexpected traffic or even just rate-limits it will defeat Nethammer. Note that rate-limiting network traffic is in fact standard practice for high stability systems, because it’s also a protection against much more mundane denial-of-service attacks.
Consumer systems are vulnerable to attacks, because consumer systems don’t care about stability. Consumers want to have a fast network connection to the internet. There’s no requirement, or need, for that to be true on a system designed for stability, like something in a satellite, or some other safety-critical role. It is possible to have systems that are effectively “not able to be hacked”—they don’t use general-purpose hardware, they don’t have code that can be modified, they have no capability for executing external code, they include hardware level fault tolerance and redundancy, and they have exceptionally limited I/O. It doesn’t require us presuming “superhuman-at-security AGIs” exist to design these systems.
Every few weeks researchers publish papers carefully documenting the latest side-channel attacks that result in EVERYTHING EVERYWHERE BEING VULNERABLE FOREVER, and every few weeks attackers continue to do the boring old thing of leaving USB drives lying around for a target to pwn themselves, or letting the target just download the malware directly to their machine. They’re almost all just remixes too—it’s “here’s how to do exploit A (the original), but on hardware systems that implemented mitigations for C, D, and E”. Except exploit A still has all of the preconditions I listed above, and now you’ve got whatever new preconditions the security researchers have in their latest paper.
This is an awesome comment, I think it would be great to make it a top-level post. There’s a Facebook group called “Information Security in Effective Altruism” that might also be interested
Would you agree that if there were a system that could automatically (without humans) monitor for rogue AI, then that system is probably hackable? (Because it has to take many inputs from the world, and has to be a general computer, not a tiny / hardwired thing.)
Would you agree that if there were a system that could automatically (without humans) monitor for rogue AI, then that system is probably hackable?
I can’t comment on this, because I have no idea what this system would even do, or what it would be looking for, or who would be running it.
What indicates a rogue AI? How is that differentiated from human hackers? How is that different from state-level actors like the NSA? How is it different from some white hat security researchers doing pen-testing in the wild for questionable reasons? What “behavior” is the system trying to look for?
How is it monitoring? What inputs is it taking? Is it only looking at internet traffic? How? Where is it tapping for those packets? There’s a lot of internet traffic. Is this even “a computer”, or it, “an entire facility of computers”. Why are there no humans involved in this process? Where do the logs, monitoring, or other telemetry go if this is done “without humans”? Wouldn’t this system be reporting some type of digest or summary for review? Why do you think that it would need to be a general computer? Do you think that high-performance SoCs do not exist, or that it would not be possible to have a distributed system containing multiple SoCs with the equivalent RAM and compute of some general-purpose machine?
What would it mean to hack this system? Just that it doesn’t report any rogue AI anymore? Or that it erroneously reports the presence of rogue AI? What would hacking it to not report rogue AI look like? Are we concerned that it would log the presence of a rogue AI, but then the rogue AI would be able to wipe that entry from some log file, like a movie? Or are we thinking that the rogue AI is going to disable this system before it is even noticed by the system? Do we count “turning off the power” as “hacking”?
The question here (in this subthread) is whether we could make a system that would shut down computers
appearing to be running rogue AI agents.
There’s a question of what such a system could look like.
Why are there no humans involved in this process?
Presumably humans would be involved. I was pretty imprecise above; what I meant to ask was:
Suppose there’s a system that can shut down computers running rogue AIs. If this actually stops rogue AI, that’s because there’s a vast amount of monitoring going on. The bulk of that vast amount of monitoring can’t be borne by humans. So there’s a lot of monitoring that’s being done by computers. Those computers are doing something seemingly pretty non-trivial, i.e. detecting AI / rogue AI / hacking (like, we don’t currently know how to do it). Because they’re doing something pretty non-trivial, they probably have to be big complex systems. Because they’re big complex systems, they’re hackable. Does this sound right to you? I’m mostly asking you about the step “detecting rogue AI implies hackable”. Or to expand the question, for what tasks XYZ can you feasibly design a system that does XYZ, but is really seriously not hackable even by a significantly superhuman hacker? Clearly with some work you could make a perfect tic-tac-toe player that simply is not hackable without physically tampering with it. It seems like you could not feasibly make an unhackable system that takes a bunch of inputs from another (unsafe) system and processes them in a bunch of complex ways using software that someone is constantly updating, because having the ability to update to the latest Detect-O-Matic-v3.4 without knowing in advance what sort of thing the Detect-O-Matic is, beyond that it’s software, seems to imply being Turing-completely programmable, which seems to imply being hackable. But maybe that’s too vague an analysis… Thoughts?
What would it mean to hack this system?
Yeah, hack it to not raise the alarm during the time period where the AI can be shut off by the EMP being activated. Then the AI disables some EMPs, or arranges or finds some computers to be not in range of EMPs or shielded from the EMPs, or creates or finds computers intrinsically not affected by EMPs (like nanotech mechanical computers or DNA computers, or something less crazy / easier that I didn’t think of), or gains control over the decision-making process for whether to activate the EMPs, or intercepts signals to the EMPs and spoofs the “successfully fired” signals. (Assuming that the AI detection scheme isn’t mooted by any of these already having happened while the AI still looked like what it was supposed to look like—just another AI project—but that already had enough resources for the AI to get a decisive advantage without spreading noticeably.)
includes a distributed network of non-nuclear electromagnetic pulse emitters that will physically shut down any tech infrastructure appearing to be running rogue AI agents.
This just seems like one of those things people say, in the same vein as “melt all of the GPUs”. I think that non-nuclear EMPs are still based on chemical warheads. I don’t know if a “pulse emitter” is a thing that someone could build. Like I think what this sentence actually says is equivalent to saying
includes a distributed network of non-nuclear ICBMs that will be physically shot at any target believed to be running a rogue AI agent
and then we can put an asterisk on the word “ICBM” and say it’ll cause an EMP at the detonation site, and only a small explosion.
But you can see how this now has a different tone to it, doesn’t it? It makes me wonder how the system defines “appears to be running rogue AI agents”, because now I wonder what the % chance of false positives is—since on a false positive, the system launches a missile.
What happens if this hypothetical system is physically located in the United States, but the rogue AI is believed to be in China or Russia? Does this hypothetical system fire a missile into another country? That seems like it could be awkward if they’re not already on board with this plan.
because they’re doing something pretty non-trivial, they probably have to be big complex systems. Because they’re big complex systems, they’re hackable. Does this sound right to you? I’m mostly asking you about the step “detecting rogue AI implies hackable”. Or to expand the question, for what tasks XYZ can you feasibly design a system that does XYZ, but is really seriously not hackable even by a significantly superhuman hacker?
It’s not really about “tasks”, it’s about how the hardware/software system is designed. Even a trivial task, if done on a general-purpose computer, with a normal network switch, the OS firewall turned off, etc, is going to be vulnerable to whatever exploits exist for applications or libraries running on that computer. Those applications or libraries expose vulnerabilities on a general-purpose computer because they’re connected to the internet to check for updates, or they send telemetry, or they’re hosting a Minecraft server with log4j.
It seems like you could not feasibly make an unhackable system that takes a bunch of inputs from another (unsafe) system and processes them in a bunch of complex ways using software that someone is constantly updating, because having the ability to update to the latest Detect-O-Matic-v3.4 without knowing in advance what sort of thing the Detect-O-Matic is, beyond that it’s software, seems to imply being Turing-completely programmable, which seems to imply being hackable.
When you’re analyzing the security of a system, what you’re looking for is “what can the attacker control?”
If the attacker can’t control anything, the system isn’t vulnerable.
We normally distinguish between remote attacks (e.g. over a network) and physical attacks (e.g. due to social engineering or espionage or whatever). It’s generally safe to assume that if an attacker has physical access to a machine, you’re compromised.[1] So first, we don’t want the attacker to have physical access to these computers. That means they’re in a secure facility, with guards, and badges, and access control on doors, just like you’d see in a tech company’s R&D lab.
That leaves remote attacks. These generally come in two forms:
The attacker tricks you into downloading and running some compromised software. For example, visiting a website with malicious JavaScript, or running some untrusted executable you downloaded because it was supposed to be a cheat engine for a video game but it was actually just a keylogger, or the attacker has a malicious payload in a seemingly innocent file type like a Word document or PDF file and it’s going to exploit a bug in the Word program or Adobe Acrobat program that tries to read that file.
The attacker sends network traffic to the machine which is able to compromise the machine in some way, generally by exploiting open ports or servers running on the target machine.
All of the attacks in (1) fall under “when you run untrusted code, you will get pwned” umbrella. There’s a bunch of software mitigations for trying to make this not terrible, like admin users vs non-admin users, file system permissions, VM sandboxing, etc, but ultimately it’s just like rearranging deck chairs on the Titanic. It doesn’t matter what you do, someone else is going to find a side channel attack and ruin your day if you let them run code on your machine. So don’t do that. This is actually easier than you might think: plenty of systems are “secure” because they run an incredibly minimal Linux OS (or some RTOS or even just bare metal) and they’re effectively static—the software image is flashed to some SoC’s read-only memory (ROM) by an external debugger[2], and there’s no capability from within the software to write or update that memory. The processor is not configured for running code outside of that ROM. There’s no user accounts, or “default applications”, or browsers, or anything else other than the code for performing the actual task required by the system.
For (2), in cases where the system is not supposed to be receiving arbitrary traffic, we solve this by using networking hardware downstream of the system that drops any unexpected traffic, and rate-limits everything. If the attacker’s traffic gets dropped, they can’t control anything. The rate-limiting is to prevent a denial-of-service based on traffic we expect. For the “expected” traffic, the ideal mechanism to prevent abuse is a binary schema for valid messages, a fixed MTU size on messages, and the use of authenticated encryption so that the attacker is unable to generate the traffic without access to encryption keys. Encryption keys can be stored in hardware security modules that are physically located on the senders & receivers within the system.[3]
With that in mind, I hope it is clear that the problem isn’t about complexity of a system, or the number of inputs that the system has, or if the system supports software updates.
The actual issue with the system we’re discussing is that the attacker can control their network traffic, and in this hypothetical, we need to look at arbitrary network traffic to decide if it’s evidence of a rogue AI.[4] We can’t just drop everything in hardware and ignore messages that don’t fit an authenticated encryption scheme, because we want to read that traffic.
So we have to assume that the attacker can generate arbitrary payloads, send them to our system, and our system is going to need to not get pwned.
I think I’d probably approach it with the following thoughts:
I’d design the system so that the traffic to be inspected is distributed across a cluster of computers. This is both to prevent denial-of-service attacks and to prevent targeted attacks like Nethammer. If the system is receiving traffic in excess of what the cluster is designed to handle, it’ll be dropped—while this represents a degradation in monitoring capability, it can be fixed by adding additional machines to the cluster as needed. It’s often useful to gracefully degrade capabilities when a system is under load.
The computers in the cluster that receive traffic have a simple role: send an alert message to a central supervisor if suspicious traffic is detected.
The alert message uses authenticated encryption.
The central supervisor is only communicated with via authenticated encryption messages on a local network. It is not connected to the wider network and it is not possible for the attacker to send traffic to the supervisor machine, unless they first compromised one of the cluster machines.
The supervisor machine is pretty safe unless a cluster machine is compromised. If a cluster machine is compromised, maybe we think that the attacker can control the alert message, and in doing so, control traffic sent to the central supervisor? Let’s fix that. What do we need for the alert message? Maybe the IP of the packet, geolocation information, reasons why it was suspicious? Etc. We’ll define a binary schema for that information and we’ll stick a serial device or FPGA onto each cluster machine.
The cluster machines send the alert message by setting the relevant fields of the message via the serial device or FPGA registers, and it’s that device which sends the actual alert. Now if the cluster machine is compromised, all the attacker can do is change fields in the alert message, but they can’t send different messages entirely.[5]
We test the central supervisor to confirm that any permutation of the alert message is still safe to receive. For testing permutations, we normally use a method based on equivalence classes.
So now attackers can maybe compromise cluster machines because they control the traffic that we’re trying to monitor, but they can’t use a compromised cluster machine to control traffic to the central supervisor, because compromising a cluster machine doesn’t let you change the traffic sent to a supervisor in a way that would be dangerous.
At this point we can ask what the attacker can do:
They can maybe send garbage alert messages.
They can maybe send no message at all—i.e. they crash the cluster machine so it fails silent.
We can mitigate the 2nd problem by requiring a heartbeat from each cluster machine to the central supervisor. A heartbeat is another message sent via authenticated encryption, but it’s sent at a known frequency, like 100 Hz. If an attacker or any other event crashes the machine, the central supervisor will know.
If the attacker sends a garbage alert message, what is the downside? It could be bad if this system launches missiles, so maybe we’ll have human review on any alert that the central supervisor receives, prior to any action being taken by the system. The human review can include a log of the network traffic that generated the alert, captured by the cluster machine, and a diagnosis of the cluster machine’s telemetry for how it concluded the traffic was suspicious. This is intended to mitigate the possibility of a false positive where the attacker can control where we think a rogue AI is located.
Now, where we stand is that an attacker can maybe compromise a cluster machine, and in doing so, they can maybe generate an erroneous alert message for a rogue AI, which a human review maybe won’t be able to detect as erroneous, so whatever action gets taken on belief that a rogue AI exists would be sent to a location controlled by the attacker. The magnitude of how bad that is depends on what action is taken. If it is “fire an ICBM”, it could be pretty bad.
The other possibility is that an attacker compromises a machine in such a way that no alert is generated, but no heartbeat is missed either. This is an interesting failure mode because it’s equivalent to another failure mode that we have not discussed: the system could simply fail to detect the behavior of a rogue AI. In other words, this system doesn’t need to be hacked in order to fail. If the algorithm we’re using for detecting rogue AI has any margin for error at all, then it should be no surprise to anyone that this system is susceptible to a false negative, i.e.: it did not send an alert when it should have.
The relevant question becomes, what’s the margin of error on the hypothetical algorithm for detecting a rogue AI? If that algorithm has a higher likelihood of a false positive than our hypothetical attacker, after all of our mitigations, than maybe this system isn’t even worth building, depending on how high that likelihood is. Likewise, if the algorithm has a high likelihood of a false negative, such that a rogue AI can reasonably evade detection due to not being suspicious enough, then this system might still not be worth building, because now the most likely failure mode is that the system just doesn’t work reliably enough, hacked or not.
This is a safe assumption because if the attacker has physical access to a machine, they can do dumb things like pull out the hard drive and put it into another computer that they control. Enable encryption on your drives.
In order to get the encryption keys, the attacker needs physical access to the system. Attacks that would use a side-channel attack won’t work here, because we don’t accept arbitrary traffic, and in order to generate that traffic, you need the keys. It’s a catch-22, unless you have physical access. But if you have physical access, we assume the system is compromised.
This is assuming that we’ve verified the device that sends the alert message, which we’ll have high confidence in because of how simple the implementation will be.
Thank you! I appreciate your thoughtful reply, and learned stuff. (This could IMO be a top-level post, maybe, IDK. Like “So you want to built an AI off switch...”.)
> I hope it is clear that the problem isn’t about complexity of a system,
It’s clear that the complexity of a system provides an upper bound on hackability, but doesn’t provibe a lower bound: a really simple system isn’t hackable, while a complex (big, Turing complete, etc.) system might or might not be hackable depending on what code is running on it and what inputs it gets. We get lower bounds on hackability by showing ways that the attacker can affect the machine.
> or if the system supports software updates.
Hm. It still seems like actually getting software updates is a big security hole (as opposed to merely hypothetically supporting updates). If you don’t update the software, and you already trust it to not be hackable, then it should stay unhackable. If you do update the software, you could introduce new bugs, or whoever sends the updates could have been compromised and the “update” could be malware. You’re increasing your surface area; the rogue AI could hack the software devs who push the code that’s supposed to use the latest new AI transparency tech to detect rogue AIs. As you point out, there’s a good chance that detecting rogue AIs is at best a very difficult task; this puts pressure to use adaptive methods, methods that you keep updating (e.g. you keep updating with a detector AI that’s been further improved / trained). So it seems like there’s a quadrilemma: either
1. your rogue AI detection methods are imprecise and are quick to pull the trigger, in which case you just immediately shut down all computers; or 2. your methods are imprecise and reluctant to pull the trigger, in which case a rogue AI can probably go unnoticed; or 3. your methods are precise, but you don’t update the watchdogs, meaning you somehow got this difficult task right with no further corrections after a point, without using the newest AI tech or getting to test your system on detecting the newest AI tech; or 4. your methods are precise, which you do by updating the watchdogs, in which case you’re greatly increasing the surface area of your system, and have to be hardened against ongoing attacks on the entire detection software pipeline.
> To be clear, I am not convinced that “evidence of a rogue AI” is a meaningful description of behavior.
Or in simpler terms for Eliezer, the TL;DR of anonymousaisafety’s comment is that hacking is not magic, and Hollywood hacking is not real insofar in it’s ease of hacking. Effectors do not exist, which is again why hacking human brains instantly isn’t possible.
I don’t think that this TL;DR is particularly helpful.
People think attacks like Rowhammer are viable because security researchers keep releasing papers that say the attacks are viable.
If I posted 1 sentence and said “Rowhammer has too many limitations for it to be usable by an attacker”, I’d be given 30 links to papers with different security researchers all making grandiose claims about how Rowhammer is totally a viable attack, which is why 8 years after the discovery of Rowhammer we’ve had dozens of security researchers reproduce the attack and 0 attacks in the wild[1] that make use of it.
If my other posts haven’t made this clear, I think almost all disagreements in AI x-risk come down to a debate over high-level vs low-level analysis. Many things sound true as a sound-bite or quick rebuttal in a forum post, but I’m arguing from my perspective and career spent working on hardware/software systems that we’ve accumulated enough low-level evidence (“the devil is in the details”) to falsify the high-level claim entirely.
We can argue that just because we don’t know that someone has used Rowhammer—or a similar probabilistic hardware vulnerability—doesn’t mean that someone hasn’t. I don’t know if that’s a useful tangent either. The problem is that people use these side-channel attacks as an “I win” button in arguments about secure software systems by making it seem like the existence of side channel exploits is therefore proof that security is a lost cause. It isn’t. It isn’t about the intelligence of the adversary, it’s that the target basically needs be sitting there, helping the attack happen. On any platform where part of the stack is running someone else’s code, yeah, you’re going to get pwned if you just accept arbitrary code, so maybe don’t do that? It is not rocket science.
Define “way more secure”. Like, superhuman-at-security AGIs rewrote the systems to be formally unhackable even taking into account hardware vulnerabilities like Rowhammer that violate the logical chip invariants?
Can you talk a bit about the world global dictatorship running the electromagnetic pulse emitters, and how they monitor every computer in the world? What sort of violence do you envision being inflicted on any countries who don’t want to submit their computers for monitoring? Is part of the plan to use AI drones to kill any political leaders who oppose this plan, so as to minimize civilian casualties? Who controls these AI drones, are we quite sure this world dictatorship stays friendly to its citizens? A lot of political processes leading to such a thing sound like they could potentially be scary.
I said “burn all GPUs” to be frank about these things being scary. It’s easy for things to sound less scary when they’re vague and the processes leading up to them are left vague. See also, George Orwell, “Politics and the English Language”. We can’t evaluate whether you have a less scary proposal until you make a less vague one.
An attempted paraphrase, to hopefully-disentangle some claims:
Eliezer, list of AGI lethalities: pivotal acts are (necessarily?) “outside of the Overton window, or something”[1].
Critch, preceding post: Strategies involving non-Overton elements are not worth it
Critch, this post: there are pivotal outcomes you can via a strategy with no non-Overton elements
Eliezer, this comment: the “AI immune system” example is not an example of a strategy with no non-Overton elements
Possible reading: Critch/the reader/Eliezer currently wouldn’t be able to name a strategy towards a pivotal outcome, with no non-Overton elements
Extreme version of this: Any practical-in-our-world strategy towards a pivotal outcome necessarily contains some non-Overton elements
Substitute your better characterization of the undesirable property here. I will just use “non-Overton” for the purposes of this comment.
I am only replying to the part of this post about hardware vulnerabilities.
There are dozens of hardware vulnerabilities that exist primarily to pad security researcher’s bibliographies.
Rowhammer, like all of these vulnerabilities, is viable if and only if the following conditions are met:
You know the exact target hardware.
You also know the exact target software, like the OS, running on that hardware.
You need to know what in RAM you’re trying to flip, and where to target to do so, like a page table or some type of bit for the user’s current access level.
The target needs to actually execute your code.
In attacks where security researchers pad themselves for pulling off Rowhammer remotely, it’s because they use JavaScript or WebGL in a browser, and then pwn devices that use browsers. This is flaw almost entirely reserved for general-purpose compute hardware, because embedded software or other application-specific hardware/software systems don’t need or have browsers in them.
In all other attacks, it involves downloading & executing a program on the target machine. Normally the example is given with cloud VMs whose entire purpose is to run code from an external source. Again, this is reserved for general-purpose compute hardware, because systems that execute code only out of read-only memory will not be able to execute an attacker’s code.
There is time on the target to run the Rowhammer attack uninterrupted. It relies on a continuous and degenerate set of instructions. This can be anywhere from minutes to days of time. This means that systems that don’t give uninterrupted time to external code are also not vulnerable.
The target OS, or other software, on the system needs to not perform any type of RAM scrubbing. There are papers claiming that variants of Rowhammer work for systems that use ECC + scrubbing, but those papers also assume that the scrubbing happens over hours. If a system has very little RAM, like an embedded processor, it is feasible for hardware to scrub RAM far faster than that.
These exploits also add the requirement that the attacker needs to know the exact target RAM and ECC algorithm.
The target hardware/software system needs to not have any hardware level redundancy. You can’t rowhammer a system that has 3 separate computers that compare & vote their state on a synchronized clock. Hardware vulnerabilities are probabilistic attacks. They don’t work anymore if the attack must occur simultaneously, and identically on separate physical systems. This is another reason why we’re able to build systems that function despite hostile environments where bit flips are routine, i.e. high radiation.
The target needs to not crash. Semi-randomly flipping bits in roughly the right location in RAM is not something that most software is designed to handle, and in an overwhelming number of cases, trying to execute this attack will crash the system.
It’s not that Rowhammer isn’t possible in the sense that it cannot be shown to work, but it’s like this paper showing that you can create WiFi signals in air-gapped computers. Or this fun paper for Nethammer showing novel attacks that don’t require code execution on the target machine, except they also don’t allow for controlling where bit flips occur, so the “attack” is isomorphic to an especially hostile radiation environment with a high likelihood of bit-flips, and it relies on the ability for the attacker to swarm the target system with a high volume (500 Mbps?) of network traffic that they control—a network switch that drops unexpected traffic or even just rate-limits it will defeat Nethammer. Note that rate-limiting network traffic is in fact standard practice for high stability systems, because it’s also a protection against much more mundane denial-of-service attacks.
Consumer systems are vulnerable to attacks, because consumer systems don’t care about stability. Consumers want to have a fast network connection to the internet. There’s no requirement, or need, for that to be true on a system designed for stability, like something in a satellite, or some other safety-critical role. It is possible to have systems that are effectively “not able to be hacked”—they don’t use general-purpose hardware, they don’t have code that can be modified, they have no capability for executing external code, they include hardware level fault tolerance and redundancy, and they have exceptionally limited I/O. It doesn’t require us presuming “superhuman-at-security AGIs” exist to design these systems.
Every few weeks researchers publish papers carefully documenting the latest side-channel attacks that result in EVERYTHING EVERYWHERE BEING VULNERABLE FOREVER, and every few weeks attackers continue to do the boring old thing of leaving USB drives lying around for a target to pwn themselves, or letting the target just download the malware directly to their machine. They’re almost all just remixes too—it’s “here’s how to do exploit A (the original), but on hardware systems that implemented mitigations for C, D, and E”. Except exploit A still has all of the preconditions I listed above, and now you’ve got whatever new preconditions the security researchers have in their latest paper.
This is an awesome comment, I think it would be great to make it a top-level post. There’s a Facebook group called “Information Security in Effective Altruism” that might also be interested
Would you agree that if there were a system that could automatically (without humans) monitor for rogue AI, then that system is probably hackable? (Because it has to take many inputs from the world, and has to be a general computer, not a tiny / hardwired thing.)
I can’t comment on this, because I have no idea what this system would even do, or what it would be looking for, or who would be running it.
What indicates a rogue AI? How is that differentiated from human hackers? How is that different from state-level actors like the NSA? How is it different from some white hat security researchers doing pen-testing in the wild for questionable reasons? What “behavior” is the system trying to look for?
How is it monitoring? What inputs is it taking? Is it only looking at internet traffic? How? Where is it tapping for those packets? There’s a lot of internet traffic. Is this even “a computer”, or it, “an entire facility of computers”. Why are there no humans involved in this process? Where do the logs, monitoring, or other telemetry go if this is done “without humans”? Wouldn’t this system be reporting some type of digest or summary for review? Why do you think that it would need to be a general computer? Do you think that high-performance SoCs do not exist, or that it would not be possible to have a distributed system containing multiple SoCs with the equivalent RAM and compute of some general-purpose machine?
What would it mean to hack this system? Just that it doesn’t report any rogue AI anymore? Or that it erroneously reports the presence of rogue AI? What would hacking it to not report rogue AI look like? Are we concerned that it would log the presence of a rogue AI, but then the rogue AI would be able to wipe that entry from some log file, like a movie? Or are we thinking that the rogue AI is going to disable this system before it is even noticed by the system? Do we count “turning off the power” as “hacking”?
The question here (in this subthread) is whether we could make a system that would shut down computers
There’s a question of what such a system could look like.
Presumably humans would be involved. I was pretty imprecise above; what I meant to ask was:
Suppose there’s a system that can shut down computers running rogue AIs. If this actually stops rogue AI, that’s because there’s a vast amount of monitoring going on. The bulk of that vast amount of monitoring can’t be borne by humans. So there’s a lot of monitoring that’s being done by computers. Those computers are doing something seemingly pretty non-trivial, i.e. detecting AI / rogue AI / hacking (like, we don’t currently know how to do it). Because they’re doing something pretty non-trivial, they probably have to be big complex systems. Because they’re big complex systems, they’re hackable. Does this sound right to you? I’m mostly asking you about the step “detecting rogue AI implies hackable”. Or to expand the question, for what tasks XYZ can you feasibly design a system that does XYZ, but is really seriously not hackable even by a significantly superhuman hacker? Clearly with some work you could make a perfect tic-tac-toe player that simply is not hackable without physically tampering with it. It seems like you could not feasibly make an unhackable system that takes a bunch of inputs from another (unsafe) system and processes them in a bunch of complex ways using software that someone is constantly updating, because having the ability to update to the latest Detect-O-Matic-v3.4 without knowing in advance what sort of thing the Detect-O-Matic is, beyond that it’s software, seems to imply being Turing-completely programmable, which seems to imply being hackable. But maybe that’s too vague an analysis… Thoughts?
Yeah, hack it to not raise the alarm during the time period where the AI can be shut off by the EMP being activated. Then the AI disables some EMPs, or arranges or finds some computers to be not in range of EMPs or shielded from the EMPs, or creates or finds computers intrinsically not affected by EMPs (like nanotech mechanical computers or DNA computers, or something less crazy / easier that I didn’t think of), or gains control over the decision-making process for whether to activate the EMPs, or intercepts signals to the EMPs and spoofs the “successfully fired” signals. (Assuming that the AI detection scheme isn’t mooted by any of these already having happened while the AI still looked like what it was supposed to look like—just another AI project—but that already had enough resources for the AI to get a decisive advantage without spreading noticeably.)
Oh, sorry, you’re referring to this:
This just seems like one of those things people say, in the same vein as “melt all of the GPUs”. I think that non-nuclear EMPs are still based on chemical warheads. I don’t know if a “pulse emitter” is a thing that someone could build. Like I think what this sentence actually says is equivalent to saying
and then we can put an asterisk on the word “ICBM” and say it’ll cause an EMP at the detonation site, and only a small explosion.
But you can see how this now has a different tone to it, doesn’t it? It makes me wonder how the system defines “appears to be running rogue AI agents”, because now I wonder what the % chance of false positives is—since on a false positive, the system launches a missile.
What happens if this hypothetical system is physically located in the United States, but the rogue AI is believed to be in China or Russia? Does this hypothetical system fire a missile into another country? That seems like it could be awkward if they’re not already on board with this plan.
It’s not really about “tasks”, it’s about how the hardware/software system is designed. Even a trivial task, if done on a general-purpose computer, with a normal network switch, the OS firewall turned off, etc, is going to be vulnerable to whatever exploits exist for applications or libraries running on that computer. Those applications or libraries expose vulnerabilities on a general-purpose computer because they’re connected to the internet to check for updates, or they send telemetry, or they’re hosting a Minecraft server with log4j.
When you’re analyzing the security of a system, what you’re looking for is “what can the attacker control?”
If the attacker can’t control anything, the system isn’t vulnerable.
We normally distinguish between remote attacks (e.g. over a network) and physical attacks (e.g. due to social engineering or espionage or whatever). It’s generally safe to assume that if an attacker has physical access to a machine, you’re compromised.[1] So first, we don’t want the attacker to have physical access to these computers. That means they’re in a secure facility, with guards, and badges, and access control on doors, just like you’d see in a tech company’s R&D lab.
That leaves remote attacks. These generally come in two forms:
The attacker tricks you into downloading and running some compromised software. For example, visiting a website with malicious JavaScript, or running some untrusted executable you downloaded because it was supposed to be a cheat engine for a video game but it was actually just a keylogger, or the attacker has a malicious payload in a seemingly innocent file type like a Word document or PDF file and it’s going to exploit a bug in the Word program or Adobe Acrobat program that tries to read that file.
The attacker sends network traffic to the machine which is able to compromise the machine in some way, generally by exploiting open ports or servers running on the target machine.
All of the attacks in (1) fall under “when you run untrusted code, you will get pwned” umbrella. There’s a bunch of software mitigations for trying to make this not terrible, like admin users vs non-admin users, file system permissions, VM sandboxing, etc, but ultimately it’s just like rearranging deck chairs on the Titanic. It doesn’t matter what you do, someone else is going to find a side channel attack and ruin your day if you let them run code on your machine. So don’t do that. This is actually easier than you might think: plenty of systems are “secure” because they run an incredibly minimal Linux OS (or some RTOS or even just bare metal) and they’re effectively static—the software image is flashed to some SoC’s read-only memory (ROM) by an external debugger[2], and there’s no capability from within the software to write or update that memory. The processor is not configured for running code outside of that ROM. There’s no user accounts, or “default applications”, or browsers, or anything else other than the code for performing the actual task required by the system.
For (2), in cases where the system is not supposed to be receiving arbitrary traffic, we solve this by using networking hardware downstream of the system that drops any unexpected traffic, and rate-limits everything. If the attacker’s traffic gets dropped, they can’t control anything. The rate-limiting is to prevent a denial-of-service based on traffic we expect. For the “expected” traffic, the ideal mechanism to prevent abuse is a binary schema for valid messages, a fixed MTU size on messages, and the use of authenticated encryption so that the attacker is unable to generate the traffic without access to encryption keys. Encryption keys can be stored in hardware security modules that are physically located on the senders & receivers within the system.[3]
With that in mind, I hope it is clear that the problem isn’t about complexity of a system, or the number of inputs that the system has, or if the system supports software updates.
The actual issue with the system we’re discussing is that the attacker can control their network traffic, and in this hypothetical, we need to look at arbitrary network traffic to decide if it’s evidence of a rogue AI.[4] We can’t just drop everything in hardware and ignore messages that don’t fit an authenticated encryption scheme, because we want to read that traffic.
So we have to assume that the attacker can generate arbitrary payloads, send them to our system, and our system is going to need to not get pwned.
I think I’d probably approach it with the following thoughts:
I’d design the system so that the traffic to be inspected is distributed across a cluster of computers. This is both to prevent denial-of-service attacks and to prevent targeted attacks like Nethammer. If the system is receiving traffic in excess of what the cluster is designed to handle, it’ll be dropped—while this represents a degradation in monitoring capability, it can be fixed by adding additional machines to the cluster as needed. It’s often useful to gracefully degrade capabilities when a system is under load.
The computers in the cluster that receive traffic have a simple role: send an alert message to a central supervisor if suspicious traffic is detected.
The alert message uses authenticated encryption.
The central supervisor is only communicated with via authenticated encryption messages on a local network. It is not connected to the wider network and it is not possible for the attacker to send traffic to the supervisor machine, unless they first compromised one of the cluster machines.
The supervisor machine is pretty safe unless a cluster machine is compromised. If a cluster machine is compromised, maybe we think that the attacker can control the alert message, and in doing so, control traffic sent to the central supervisor? Let’s fix that. What do we need for the alert message? Maybe the IP of the packet, geolocation information, reasons why it was suspicious? Etc. We’ll define a binary schema for that information and we’ll stick a serial device or FPGA onto each cluster machine.
The cluster machines send the alert message by setting the relevant fields of the message via the serial device or FPGA registers, and it’s that device which sends the actual alert. Now if the cluster machine is compromised, all the attacker can do is change fields in the alert message, but they can’t send different messages entirely.[5]
We test the central supervisor to confirm that any permutation of the alert message is still safe to receive. For testing permutations, we normally use a method based on equivalence classes.
So now attackers can maybe compromise cluster machines because they control the traffic that we’re trying to monitor, but they can’t use a compromised cluster machine to control traffic to the central supervisor, because compromising a cluster machine doesn’t let you change the traffic sent to a supervisor in a way that would be dangerous.
At this point we can ask what the attacker can do:
They can maybe send garbage alert messages.
They can maybe send no message at all—i.e. they crash the cluster machine so it fails silent.
We can mitigate the 2nd problem by requiring a heartbeat from each cluster machine to the central supervisor. A heartbeat is another message sent via authenticated encryption, but it’s sent at a known frequency, like 100 Hz. If an attacker or any other event crashes the machine, the central supervisor will know.
If the attacker sends a garbage alert message, what is the downside? It could be bad if this system launches missiles, so maybe we’ll have human review on any alert that the central supervisor receives, prior to any action being taken by the system. The human review can include a log of the network traffic that generated the alert, captured by the cluster machine, and a diagnosis of the cluster machine’s telemetry for how it concluded the traffic was suspicious. This is intended to mitigate the possibility of a false positive where the attacker can control where we think a rogue AI is located.
Now, where we stand is that an attacker can maybe compromise a cluster machine, and in doing so, they can maybe generate an erroneous alert message for a rogue AI, which a human review maybe won’t be able to detect as erroneous, so whatever action gets taken on belief that a rogue AI exists would be sent to a location controlled by the attacker. The magnitude of how bad that is depends on what action is taken. If it is “fire an ICBM”, it could be pretty bad.
The other possibility is that an attacker compromises a machine in such a way that no alert is generated, but no heartbeat is missed either. This is an interesting failure mode because it’s equivalent to another failure mode that we have not discussed: the system could simply fail to detect the behavior of a rogue AI. In other words, this system doesn’t need to be hacked in order to fail. If the algorithm we’re using for detecting rogue AI has any margin for error at all, then it should be no surprise to anyone that this system is susceptible to a false negative, i.e.: it did not send an alert when it should have.
The relevant question becomes, what’s the margin of error on the hypothetical algorithm for detecting a rogue AI? If that algorithm has a higher likelihood of a false positive than our hypothetical attacker, after all of our mitigations, than maybe this system isn’t even worth building, depending on how high that likelihood is. Likewise, if the algorithm has a high likelihood of a false negative, such that a rogue AI can reasonably evade detection due to not being suspicious enough, then this system might still not be worth building, because now the most likely failure mode is that the system just doesn’t work reliably enough, hacked or not.
This is a safe assumption because if the attacker has physical access to a machine, they can do dumb things like pull out the hard drive and put it into another computer that they control. Enable encryption on your drives.
When I say “external debugger”, this is implying physical access to the system.
In order to get the encryption keys, the attacker needs physical access to the system. Attacks that would use a side-channel attack won’t work here, because we don’t accept arbitrary traffic, and in order to generate that traffic, you need the keys. It’s a catch-22, unless you have physical access. But if you have physical access, we assume the system is compromised.
To be clear, I am not convinced that “evidence of a rogue AI” is a meaningful description of behavior.
This is assuming that we’ve verified the device that sends the alert message, which we’ll have high confidence in because of how simple the implementation will be.
Thank you! I appreciate your thoughtful reply, and learned stuff. (This could IMO be a top-level post, maybe, IDK. Like “So you want to built an AI off switch...”.)
> I hope it is clear that the problem isn’t about complexity of a system,
It’s clear that the complexity of a system provides an upper bound on hackability, but doesn’t provibe a lower bound: a really simple system isn’t hackable, while a complex (big, Turing complete, etc.) system might or might not be hackable depending on what code is running on it and what inputs it gets. We get lower bounds on hackability by showing ways that the attacker can affect the machine.
> or if the system supports software updates.
Hm. It still seems like actually getting software updates is a big security hole (as opposed to merely hypothetically supporting updates). If you don’t update the software, and you already trust it to not be hackable, then it should stay unhackable. If you do update the software, you could introduce new bugs, or whoever sends the updates could have been compromised and the “update” could be malware. You’re increasing your surface area; the rogue AI could hack the software devs who push the code that’s supposed to use the latest new AI transparency tech to detect rogue AIs. As you point out, there’s a good chance that detecting rogue AIs is at best a very difficult task; this puts pressure to use adaptive methods, methods that you keep updating (e.g. you keep updating with a detector AI that’s been further improved / trained). So it seems like there’s a quadrilemma: either
1. your rogue AI detection methods are imprecise and are quick to pull the trigger, in which case you just immediately shut down all computers; or
2. your methods are imprecise and reluctant to pull the trigger, in which case a rogue AI can probably go unnoticed; or
3. your methods are precise, but you don’t update the watchdogs, meaning you somehow got this difficult task right with no further corrections after a point, without using the newest AI tech or getting to test your system on detecting the newest AI tech; or
4. your methods are precise, which you do by updating the watchdogs, in which case you’re greatly increasing the surface area of your system, and have to be hardened against ongoing attacks on the entire detection software pipeline.
> To be clear, I am not convinced that “evidence of a rogue AI” is a meaningful description of behavior.
Me neither.
Or in simpler terms for Eliezer, the TL;DR of anonymousaisafety’s comment is that hacking is not magic, and Hollywood hacking is not real insofar in it’s ease of hacking. Effectors do not exist, which is again why hacking human brains instantly isn’t possible.
I don’t think that this TL;DR is particularly helpful.
People think attacks like Rowhammer are viable because security researchers keep releasing papers that say the attacks are viable.
If I posted 1 sentence and said “Rowhammer has too many limitations for it to be usable by an attacker”, I’d be given 30 links to papers with different security researchers all making grandiose claims about how Rowhammer is totally a viable attack, which is why 8 years after the discovery of Rowhammer we’ve had dozens of security researchers reproduce the attack and 0 attacks in the wild[1] that make use of it.
If my other posts haven’t made this clear, I think almost all disagreements in AI x-risk come down to a debate over high-level vs low-level analysis. Many things sound true as a sound-bite or quick rebuttal in a forum post, but I’m arguing from my perspective and career spent working on hardware/software systems that we’ve accumulated enough low-level evidence (“the devil is in the details”) to falsify the high-level claim entirely.
We can argue that just because we don’t know that someone has used Rowhammer—or a similar probabilistic hardware vulnerability—doesn’t mean that someone hasn’t. I don’t know if that’s a useful tangent either. The problem is that people use these side-channel attacks as an “I win” button in arguments about secure software systems by making it seem like the existence of side channel exploits is therefore proof that security is a lost cause. It isn’t. It isn’t about the intelligence of the adversary, it’s that the target basically needs be sitting there, helping the attack happen. On any platform where part of the stack is running someone else’s code, yeah, you’re going to get pwned if you just accept arbitrary code, so maybe don’t do that? It is not rocket science.
Thanks, I’ll retract that comment.
This is not only a bad summary, it’s extraordinarily toxic and uncharitable.