Would you agree that if there were a system that could automatically (without humans) monitor for rogue AI, then that system is probably hackable? (Because it has to take many inputs from the world, and has to be a general computer, not a tiny / hardwired thing.)
Would you agree that if there were a system that could automatically (without humans) monitor for rogue AI, then that system is probably hackable?
I can’t comment on this, because I have no idea what this system would even do, or what it would be looking for, or who would be running it.
What indicates a rogue AI? How is that differentiated from human hackers? How is that different from state-level actors like the NSA? How is it different from some white hat security researchers doing pen-testing in the wild for questionable reasons? What “behavior” is the system trying to look for?
How is it monitoring? What inputs is it taking? Is it only looking at internet traffic? How? Where is it tapping for those packets? There’s a lot of internet traffic. Is this even “a computer”, or it, “an entire facility of computers”. Why are there no humans involved in this process? Where do the logs, monitoring, or other telemetry go if this is done “without humans”? Wouldn’t this system be reporting some type of digest or summary for review? Why do you think that it would need to be a general computer? Do you think that high-performance SoCs do not exist, or that it would not be possible to have a distributed system containing multiple SoCs with the equivalent RAM and compute of some general-purpose machine?
What would it mean to hack this system? Just that it doesn’t report any rogue AI anymore? Or that it erroneously reports the presence of rogue AI? What would hacking it to not report rogue AI look like? Are we concerned that it would log the presence of a rogue AI, but then the rogue AI would be able to wipe that entry from some log file, like a movie? Or are we thinking that the rogue AI is going to disable this system before it is even noticed by the system? Do we count “turning off the power” as “hacking”?
The question here (in this subthread) is whether we could make a system that would shut down computers
appearing to be running rogue AI agents.
There’s a question of what such a system could look like.
Why are there no humans involved in this process?
Presumably humans would be involved. I was pretty imprecise above; what I meant to ask was:
Suppose there’s a system that can shut down computers running rogue AIs. If this actually stops rogue AI, that’s because there’s a vast amount of monitoring going on. The bulk of that vast amount of monitoring can’t be borne by humans. So there’s a lot of monitoring that’s being done by computers. Those computers are doing something seemingly pretty non-trivial, i.e. detecting AI / rogue AI / hacking (like, we don’t currently know how to do it). Because they’re doing something pretty non-trivial, they probably have to be big complex systems. Because they’re big complex systems, they’re hackable. Does this sound right to you? I’m mostly asking you about the step “detecting rogue AI implies hackable”. Or to expand the question, for what tasks XYZ can you feasibly design a system that does XYZ, but is really seriously not hackable even by a significantly superhuman hacker? Clearly with some work you could make a perfect tic-tac-toe player that simply is not hackable without physically tampering with it. It seems like you could not feasibly make an unhackable system that takes a bunch of inputs from another (unsafe) system and processes them in a bunch of complex ways using software that someone is constantly updating, because having the ability to update to the latest Detect-O-Matic-v3.4 without knowing in advance what sort of thing the Detect-O-Matic is, beyond that it’s software, seems to imply being Turing-completely programmable, which seems to imply being hackable. But maybe that’s too vague an analysis… Thoughts?
What would it mean to hack this system?
Yeah, hack it to not raise the alarm during the time period where the AI can be shut off by the EMP being activated. Then the AI disables some EMPs, or arranges or finds some computers to be not in range of EMPs or shielded from the EMPs, or creates or finds computers intrinsically not affected by EMPs (like nanotech mechanical computers or DNA computers, or something less crazy / easier that I didn’t think of), or gains control over the decision-making process for whether to activate the EMPs, or intercepts signals to the EMPs and spoofs the “successfully fired” signals. (Assuming that the AI detection scheme isn’t mooted by any of these already having happened while the AI still looked like what it was supposed to look like—just another AI project—but that already had enough resources for the AI to get a decisive advantage without spreading noticeably.)
includes a distributed network of non-nuclear electromagnetic pulse emitters that will physically shut down any tech infrastructure appearing to be running rogue AI agents.
This just seems like one of those things people say, in the same vein as “melt all of the GPUs”. I think that non-nuclear EMPs are still based on chemical warheads. I don’t know if a “pulse emitter” is a thing that someone could build. Like I think what this sentence actually says is equivalent to saying
includes a distributed network of non-nuclear ICBMs that will be physically shot at any target believed to be running a rogue AI agent
and then we can put an asterisk on the word “ICBM” and say it’ll cause an EMP at the detonation site, and only a small explosion.
But you can see how this now has a different tone to it, doesn’t it? It makes me wonder how the system defines “appears to be running rogue AI agents”, because now I wonder what the % chance of false positives is—since on a false positive, the system launches a missile.
What happens if this hypothetical system is physically located in the United States, but the rogue AI is believed to be in China or Russia? Does this hypothetical system fire a missile into another country? That seems like it could be awkward if they’re not already on board with this plan.
because they’re doing something pretty non-trivial, they probably have to be big complex systems. Because they’re big complex systems, they’re hackable. Does this sound right to you? I’m mostly asking you about the step “detecting rogue AI implies hackable”. Or to expand the question, for what tasks XYZ can you feasibly design a system that does XYZ, but is really seriously not hackable even by a significantly superhuman hacker?
It’s not really about “tasks”, it’s about how the hardware/software system is designed. Even a trivial task, if done on a general-purpose computer, with a normal network switch, the OS firewall turned off, etc, is going to be vulnerable to whatever exploits exist for applications or libraries running on that computer. Those applications or libraries expose vulnerabilities on a general-purpose computer because they’re connected to the internet to check for updates, or they send telemetry, or they’re hosting a Minecraft server with log4j.
It seems like you could not feasibly make an unhackable system that takes a bunch of inputs from another (unsafe) system and processes them in a bunch of complex ways using software that someone is constantly updating, because having the ability to update to the latest Detect-O-Matic-v3.4 without knowing in advance what sort of thing the Detect-O-Matic is, beyond that it’s software, seems to imply being Turing-completely programmable, which seems to imply being hackable.
When you’re analyzing the security of a system, what you’re looking for is “what can the attacker control?”
If the attacker can’t control anything, the system isn’t vulnerable.
We normally distinguish between remote attacks (e.g. over a network) and physical attacks (e.g. due to social engineering or espionage or whatever). It’s generally safe to assume that if an attacker has physical access to a machine, you’re compromised.[1] So first, we don’t want the attacker to have physical access to these computers. That means they’re in a secure facility, with guards, and badges, and access control on doors, just like you’d see in a tech company’s R&D lab.
That leaves remote attacks. These generally come in two forms:
The attacker tricks you into downloading and running some compromised software. For example, visiting a website with malicious JavaScript, or running some untrusted executable you downloaded because it was supposed to be a cheat engine for a video game but it was actually just a keylogger, or the attacker has a malicious payload in a seemingly innocent file type like a Word document or PDF file and it’s going to exploit a bug in the Word program or Adobe Acrobat program that tries to read that file.
The attacker sends network traffic to the machine which is able to compromise the machine in some way, generally by exploiting open ports or servers running on the target machine.
All of the attacks in (1) fall under “when you run untrusted code, you will get pwned” umbrella. There’s a bunch of software mitigations for trying to make this not terrible, like admin users vs non-admin users, file system permissions, VM sandboxing, etc, but ultimately it’s just like rearranging deck chairs on the Titanic. It doesn’t matter what you do, someone else is going to find a side channel attack and ruin your day if you let them run code on your machine. So don’t do that. This is actually easier than you might think: plenty of systems are “secure” because they run an incredibly minimal Linux OS (or some RTOS or even just bare metal) and they’re effectively static—the software image is flashed to some SoC’s read-only memory (ROM) by an external debugger[2], and there’s no capability from within the software to write or update that memory. The processor is not configured for running code outside of that ROM. There’s no user accounts, or “default applications”, or browsers, or anything else other than the code for performing the actual task required by the system.
For (2), in cases where the system is not supposed to be receiving arbitrary traffic, we solve this by using networking hardware downstream of the system that drops any unexpected traffic, and rate-limits everything. If the attacker’s traffic gets dropped, they can’t control anything. The rate-limiting is to prevent a denial-of-service based on traffic we expect. For the “expected” traffic, the ideal mechanism to prevent abuse is a binary schema for valid messages, a fixed MTU size on messages, and the use of authenticated encryption so that the attacker is unable to generate the traffic without access to encryption keys. Encryption keys can be stored in hardware security modules that are physically located on the senders & receivers within the system.[3]
With that in mind, I hope it is clear that the problem isn’t about complexity of a system, or the number of inputs that the system has, or if the system supports software updates.
The actual issue with the system we’re discussing is that the attacker can control their network traffic, and in this hypothetical, we need to look at arbitrary network traffic to decide if it’s evidence of a rogue AI.[4] We can’t just drop everything in hardware and ignore messages that don’t fit an authenticated encryption scheme, because we want to read that traffic.
So we have to assume that the attacker can generate arbitrary payloads, send them to our system, and our system is going to need to not get pwned.
I think I’d probably approach it with the following thoughts:
I’d design the system so that the traffic to be inspected is distributed across a cluster of computers. This is both to prevent denial-of-service attacks and to prevent targeted attacks like Nethammer. If the system is receiving traffic in excess of what the cluster is designed to handle, it’ll be dropped—while this represents a degradation in monitoring capability, it can be fixed by adding additional machines to the cluster as needed. It’s often useful to gracefully degrade capabilities when a system is under load.
The computers in the cluster that receive traffic have a simple role: send an alert message to a central supervisor if suspicious traffic is detected.
The alert message uses authenticated encryption.
The central supervisor is only communicated with via authenticated encryption messages on a local network. It is not connected to the wider network and it is not possible for the attacker to send traffic to the supervisor machine, unless they first compromised one of the cluster machines.
The supervisor machine is pretty safe unless a cluster machine is compromised. If a cluster machine is compromised, maybe we think that the attacker can control the alert message, and in doing so, control traffic sent to the central supervisor? Let’s fix that. What do we need for the alert message? Maybe the IP of the packet, geolocation information, reasons why it was suspicious? Etc. We’ll define a binary schema for that information and we’ll stick a serial device or FPGA onto each cluster machine.
The cluster machines send the alert message by setting the relevant fields of the message via the serial device or FPGA registers, and it’s that device which sends the actual alert. Now if the cluster machine is compromised, all the attacker can do is change fields in the alert message, but they can’t send different messages entirely.[5]
We test the central supervisor to confirm that any permutation of the alert message is still safe to receive. For testing permutations, we normally use a method based on equivalence classes.
So now attackers can maybe compromise cluster machines because they control the traffic that we’re trying to monitor, but they can’t use a compromised cluster machine to control traffic to the central supervisor, because compromising a cluster machine doesn’t let you change the traffic sent to a supervisor in a way that would be dangerous.
At this point we can ask what the attacker can do:
They can maybe send garbage alert messages.
They can maybe send no message at all—i.e. they crash the cluster machine so it fails silent.
We can mitigate the 2nd problem by requiring a heartbeat from each cluster machine to the central supervisor. A heartbeat is another message sent via authenticated encryption, but it’s sent at a known frequency, like 100 Hz. If an attacker or any other event crashes the machine, the central supervisor will know.
If the attacker sends a garbage alert message, what is the downside? It could be bad if this system launches missiles, so maybe we’ll have human review on any alert that the central supervisor receives, prior to any action being taken by the system. The human review can include a log of the network traffic that generated the alert, captured by the cluster machine, and a diagnosis of the cluster machine’s telemetry for how it concluded the traffic was suspicious. This is intended to mitigate the possibility of a false positive where the attacker can control where we think a rogue AI is located.
Now, where we stand is that an attacker can maybe compromise a cluster machine, and in doing so, they can maybe generate an erroneous alert message for a rogue AI, which a human review maybe won’t be able to detect as erroneous, so whatever action gets taken on belief that a rogue AI exists would be sent to a location controlled by the attacker. The magnitude of how bad that is depends on what action is taken. If it is “fire an ICBM”, it could be pretty bad.
The other possibility is that an attacker compromises a machine in such a way that no alert is generated, but no heartbeat is missed either. This is an interesting failure mode because it’s equivalent to another failure mode that we have not discussed: the system could simply fail to detect the behavior of a rogue AI. In other words, this system doesn’t need to be hacked in order to fail. If the algorithm we’re using for detecting rogue AI has any margin for error at all, then it should be no surprise to anyone that this system is susceptible to a false negative, i.e.: it did not send an alert when it should have.
The relevant question becomes, what’s the margin of error on the hypothetical algorithm for detecting a rogue AI? If that algorithm has a higher likelihood of a false positive than our hypothetical attacker, after all of our mitigations, than maybe this system isn’t even worth building, depending on how high that likelihood is. Likewise, if the algorithm has a high likelihood of a false negative, such that a rogue AI can reasonably evade detection due to not being suspicious enough, then this system might still not be worth building, because now the most likely failure mode is that the system just doesn’t work reliably enough, hacked or not.
This is a safe assumption because if the attacker has physical access to a machine, they can do dumb things like pull out the hard drive and put it into another computer that they control. Enable encryption on your drives.
In order to get the encryption keys, the attacker needs physical access to the system. Attacks that would use a side-channel attack won’t work here, because we don’t accept arbitrary traffic, and in order to generate that traffic, you need the keys. It’s a catch-22, unless you have physical access. But if you have physical access, we assume the system is compromised.
This is assuming that we’ve verified the device that sends the alert message, which we’ll have high confidence in because of how simple the implementation will be.
Thank you! I appreciate your thoughtful reply, and learned stuff. (This could IMO be a top-level post, maybe, IDK. Like “So you want to built an AI off switch...”.)
> I hope it is clear that the problem isn’t about complexity of a system,
It’s clear that the complexity of a system provides an upper bound on hackability, but doesn’t provibe a lower bound: a really simple system isn’t hackable, while a complex (big, Turing complete, etc.) system might or might not be hackable depending on what code is running on it and what inputs it gets. We get lower bounds on hackability by showing ways that the attacker can affect the machine.
> or if the system supports software updates.
Hm. It still seems like actually getting software updates is a big security hole (as opposed to merely hypothetically supporting updates). If you don’t update the software, and you already trust it to not be hackable, then it should stay unhackable. If you do update the software, you could introduce new bugs, or whoever sends the updates could have been compromised and the “update” could be malware. You’re increasing your surface area; the rogue AI could hack the software devs who push the code that’s supposed to use the latest new AI transparency tech to detect rogue AIs. As you point out, there’s a good chance that detecting rogue AIs is at best a very difficult task; this puts pressure to use adaptive methods, methods that you keep updating (e.g. you keep updating with a detector AI that’s been further improved / trained). So it seems like there’s a quadrilemma: either
1. your rogue AI detection methods are imprecise and are quick to pull the trigger, in which case you just immediately shut down all computers; or 2. your methods are imprecise and reluctant to pull the trigger, in which case a rogue AI can probably go unnoticed; or 3. your methods are precise, but you don’t update the watchdogs, meaning you somehow got this difficult task right with no further corrections after a point, without using the newest AI tech or getting to test your system on detecting the newest AI tech; or 4. your methods are precise, which you do by updating the watchdogs, in which case you’re greatly increasing the surface area of your system, and have to be hardened against ongoing attacks on the entire detection software pipeline.
> To be clear, I am not convinced that “evidence of a rogue AI” is a meaningful description of behavior.
Would you agree that if there were a system that could automatically (without humans) monitor for rogue AI, then that system is probably hackable? (Because it has to take many inputs from the world, and has to be a general computer, not a tiny / hardwired thing.)
I can’t comment on this, because I have no idea what this system would even do, or what it would be looking for, or who would be running it.
What indicates a rogue AI? How is that differentiated from human hackers? How is that different from state-level actors like the NSA? How is it different from some white hat security researchers doing pen-testing in the wild for questionable reasons? What “behavior” is the system trying to look for?
How is it monitoring? What inputs is it taking? Is it only looking at internet traffic? How? Where is it tapping for those packets? There’s a lot of internet traffic. Is this even “a computer”, or it, “an entire facility of computers”. Why are there no humans involved in this process? Where do the logs, monitoring, or other telemetry go if this is done “without humans”? Wouldn’t this system be reporting some type of digest or summary for review? Why do you think that it would need to be a general computer? Do you think that high-performance SoCs do not exist, or that it would not be possible to have a distributed system containing multiple SoCs with the equivalent RAM and compute of some general-purpose machine?
What would it mean to hack this system? Just that it doesn’t report any rogue AI anymore? Or that it erroneously reports the presence of rogue AI? What would hacking it to not report rogue AI look like? Are we concerned that it would log the presence of a rogue AI, but then the rogue AI would be able to wipe that entry from some log file, like a movie? Or are we thinking that the rogue AI is going to disable this system before it is even noticed by the system? Do we count “turning off the power” as “hacking”?
The question here (in this subthread) is whether we could make a system that would shut down computers
There’s a question of what such a system could look like.
Presumably humans would be involved. I was pretty imprecise above; what I meant to ask was:
Suppose there’s a system that can shut down computers running rogue AIs. If this actually stops rogue AI, that’s because there’s a vast amount of monitoring going on. The bulk of that vast amount of monitoring can’t be borne by humans. So there’s a lot of monitoring that’s being done by computers. Those computers are doing something seemingly pretty non-trivial, i.e. detecting AI / rogue AI / hacking (like, we don’t currently know how to do it). Because they’re doing something pretty non-trivial, they probably have to be big complex systems. Because they’re big complex systems, they’re hackable. Does this sound right to you? I’m mostly asking you about the step “detecting rogue AI implies hackable”. Or to expand the question, for what tasks XYZ can you feasibly design a system that does XYZ, but is really seriously not hackable even by a significantly superhuman hacker? Clearly with some work you could make a perfect tic-tac-toe player that simply is not hackable without physically tampering with it. It seems like you could not feasibly make an unhackable system that takes a bunch of inputs from another (unsafe) system and processes them in a bunch of complex ways using software that someone is constantly updating, because having the ability to update to the latest Detect-O-Matic-v3.4 without knowing in advance what sort of thing the Detect-O-Matic is, beyond that it’s software, seems to imply being Turing-completely programmable, which seems to imply being hackable. But maybe that’s too vague an analysis… Thoughts?
Yeah, hack it to not raise the alarm during the time period where the AI can be shut off by the EMP being activated. Then the AI disables some EMPs, or arranges or finds some computers to be not in range of EMPs or shielded from the EMPs, or creates or finds computers intrinsically not affected by EMPs (like nanotech mechanical computers or DNA computers, or something less crazy / easier that I didn’t think of), or gains control over the decision-making process for whether to activate the EMPs, or intercepts signals to the EMPs and spoofs the “successfully fired” signals. (Assuming that the AI detection scheme isn’t mooted by any of these already having happened while the AI still looked like what it was supposed to look like—just another AI project—but that already had enough resources for the AI to get a decisive advantage without spreading noticeably.)
Oh, sorry, you’re referring to this:
This just seems like one of those things people say, in the same vein as “melt all of the GPUs”. I think that non-nuclear EMPs are still based on chemical warheads. I don’t know if a “pulse emitter” is a thing that someone could build. Like I think what this sentence actually says is equivalent to saying
and then we can put an asterisk on the word “ICBM” and say it’ll cause an EMP at the detonation site, and only a small explosion.
But you can see how this now has a different tone to it, doesn’t it? It makes me wonder how the system defines “appears to be running rogue AI agents”, because now I wonder what the % chance of false positives is—since on a false positive, the system launches a missile.
What happens if this hypothetical system is physically located in the United States, but the rogue AI is believed to be in China or Russia? Does this hypothetical system fire a missile into another country? That seems like it could be awkward if they’re not already on board with this plan.
It’s not really about “tasks”, it’s about how the hardware/software system is designed. Even a trivial task, if done on a general-purpose computer, with a normal network switch, the OS firewall turned off, etc, is going to be vulnerable to whatever exploits exist for applications or libraries running on that computer. Those applications or libraries expose vulnerabilities on a general-purpose computer because they’re connected to the internet to check for updates, or they send telemetry, or they’re hosting a Minecraft server with log4j.
When you’re analyzing the security of a system, what you’re looking for is “what can the attacker control?”
If the attacker can’t control anything, the system isn’t vulnerable.
We normally distinguish between remote attacks (e.g. over a network) and physical attacks (e.g. due to social engineering or espionage or whatever). It’s generally safe to assume that if an attacker has physical access to a machine, you’re compromised.[1] So first, we don’t want the attacker to have physical access to these computers. That means they’re in a secure facility, with guards, and badges, and access control on doors, just like you’d see in a tech company’s R&D lab.
That leaves remote attacks. These generally come in two forms:
The attacker tricks you into downloading and running some compromised software. For example, visiting a website with malicious JavaScript, or running some untrusted executable you downloaded because it was supposed to be a cheat engine for a video game but it was actually just a keylogger, or the attacker has a malicious payload in a seemingly innocent file type like a Word document or PDF file and it’s going to exploit a bug in the Word program or Adobe Acrobat program that tries to read that file.
The attacker sends network traffic to the machine which is able to compromise the machine in some way, generally by exploiting open ports or servers running on the target machine.
All of the attacks in (1) fall under “when you run untrusted code, you will get pwned” umbrella. There’s a bunch of software mitigations for trying to make this not terrible, like admin users vs non-admin users, file system permissions, VM sandboxing, etc, but ultimately it’s just like rearranging deck chairs on the Titanic. It doesn’t matter what you do, someone else is going to find a side channel attack and ruin your day if you let them run code on your machine. So don’t do that. This is actually easier than you might think: plenty of systems are “secure” because they run an incredibly minimal Linux OS (or some RTOS or even just bare metal) and they’re effectively static—the software image is flashed to some SoC’s read-only memory (ROM) by an external debugger[2], and there’s no capability from within the software to write or update that memory. The processor is not configured for running code outside of that ROM. There’s no user accounts, or “default applications”, or browsers, or anything else other than the code for performing the actual task required by the system.
For (2), in cases where the system is not supposed to be receiving arbitrary traffic, we solve this by using networking hardware downstream of the system that drops any unexpected traffic, and rate-limits everything. If the attacker’s traffic gets dropped, they can’t control anything. The rate-limiting is to prevent a denial-of-service based on traffic we expect. For the “expected” traffic, the ideal mechanism to prevent abuse is a binary schema for valid messages, a fixed MTU size on messages, and the use of authenticated encryption so that the attacker is unable to generate the traffic without access to encryption keys. Encryption keys can be stored in hardware security modules that are physically located on the senders & receivers within the system.[3]
With that in mind, I hope it is clear that the problem isn’t about complexity of a system, or the number of inputs that the system has, or if the system supports software updates.
The actual issue with the system we’re discussing is that the attacker can control their network traffic, and in this hypothetical, we need to look at arbitrary network traffic to decide if it’s evidence of a rogue AI.[4] We can’t just drop everything in hardware and ignore messages that don’t fit an authenticated encryption scheme, because we want to read that traffic.
So we have to assume that the attacker can generate arbitrary payloads, send them to our system, and our system is going to need to not get pwned.
I think I’d probably approach it with the following thoughts:
I’d design the system so that the traffic to be inspected is distributed across a cluster of computers. This is both to prevent denial-of-service attacks and to prevent targeted attacks like Nethammer. If the system is receiving traffic in excess of what the cluster is designed to handle, it’ll be dropped—while this represents a degradation in monitoring capability, it can be fixed by adding additional machines to the cluster as needed. It’s often useful to gracefully degrade capabilities when a system is under load.
The computers in the cluster that receive traffic have a simple role: send an alert message to a central supervisor if suspicious traffic is detected.
The alert message uses authenticated encryption.
The central supervisor is only communicated with via authenticated encryption messages on a local network. It is not connected to the wider network and it is not possible for the attacker to send traffic to the supervisor machine, unless they first compromised one of the cluster machines.
The supervisor machine is pretty safe unless a cluster machine is compromised. If a cluster machine is compromised, maybe we think that the attacker can control the alert message, and in doing so, control traffic sent to the central supervisor? Let’s fix that. What do we need for the alert message? Maybe the IP of the packet, geolocation information, reasons why it was suspicious? Etc. We’ll define a binary schema for that information and we’ll stick a serial device or FPGA onto each cluster machine.
The cluster machines send the alert message by setting the relevant fields of the message via the serial device or FPGA registers, and it’s that device which sends the actual alert. Now if the cluster machine is compromised, all the attacker can do is change fields in the alert message, but they can’t send different messages entirely.[5]
We test the central supervisor to confirm that any permutation of the alert message is still safe to receive. For testing permutations, we normally use a method based on equivalence classes.
So now attackers can maybe compromise cluster machines because they control the traffic that we’re trying to monitor, but they can’t use a compromised cluster machine to control traffic to the central supervisor, because compromising a cluster machine doesn’t let you change the traffic sent to a supervisor in a way that would be dangerous.
At this point we can ask what the attacker can do:
They can maybe send garbage alert messages.
They can maybe send no message at all—i.e. they crash the cluster machine so it fails silent.
We can mitigate the 2nd problem by requiring a heartbeat from each cluster machine to the central supervisor. A heartbeat is another message sent via authenticated encryption, but it’s sent at a known frequency, like 100 Hz. If an attacker or any other event crashes the machine, the central supervisor will know.
If the attacker sends a garbage alert message, what is the downside? It could be bad if this system launches missiles, so maybe we’ll have human review on any alert that the central supervisor receives, prior to any action being taken by the system. The human review can include a log of the network traffic that generated the alert, captured by the cluster machine, and a diagnosis of the cluster machine’s telemetry for how it concluded the traffic was suspicious. This is intended to mitigate the possibility of a false positive where the attacker can control where we think a rogue AI is located.
Now, where we stand is that an attacker can maybe compromise a cluster machine, and in doing so, they can maybe generate an erroneous alert message for a rogue AI, which a human review maybe won’t be able to detect as erroneous, so whatever action gets taken on belief that a rogue AI exists would be sent to a location controlled by the attacker. The magnitude of how bad that is depends on what action is taken. If it is “fire an ICBM”, it could be pretty bad.
The other possibility is that an attacker compromises a machine in such a way that no alert is generated, but no heartbeat is missed either. This is an interesting failure mode because it’s equivalent to another failure mode that we have not discussed: the system could simply fail to detect the behavior of a rogue AI. In other words, this system doesn’t need to be hacked in order to fail. If the algorithm we’re using for detecting rogue AI has any margin for error at all, then it should be no surprise to anyone that this system is susceptible to a false negative, i.e.: it did not send an alert when it should have.
The relevant question becomes, what’s the margin of error on the hypothetical algorithm for detecting a rogue AI? If that algorithm has a higher likelihood of a false positive than our hypothetical attacker, after all of our mitigations, than maybe this system isn’t even worth building, depending on how high that likelihood is. Likewise, if the algorithm has a high likelihood of a false negative, such that a rogue AI can reasonably evade detection due to not being suspicious enough, then this system might still not be worth building, because now the most likely failure mode is that the system just doesn’t work reliably enough, hacked or not.
This is a safe assumption because if the attacker has physical access to a machine, they can do dumb things like pull out the hard drive and put it into another computer that they control. Enable encryption on your drives.
When I say “external debugger”, this is implying physical access to the system.
In order to get the encryption keys, the attacker needs physical access to the system. Attacks that would use a side-channel attack won’t work here, because we don’t accept arbitrary traffic, and in order to generate that traffic, you need the keys. It’s a catch-22, unless you have physical access. But if you have physical access, we assume the system is compromised.
To be clear, I am not convinced that “evidence of a rogue AI” is a meaningful description of behavior.
This is assuming that we’ve verified the device that sends the alert message, which we’ll have high confidence in because of how simple the implementation will be.
Thank you! I appreciate your thoughtful reply, and learned stuff. (This could IMO be a top-level post, maybe, IDK. Like “So you want to built an AI off switch...”.)
> I hope it is clear that the problem isn’t about complexity of a system,
It’s clear that the complexity of a system provides an upper bound on hackability, but doesn’t provibe a lower bound: a really simple system isn’t hackable, while a complex (big, Turing complete, etc.) system might or might not be hackable depending on what code is running on it and what inputs it gets. We get lower bounds on hackability by showing ways that the attacker can affect the machine.
> or if the system supports software updates.
Hm. It still seems like actually getting software updates is a big security hole (as opposed to merely hypothetically supporting updates). If you don’t update the software, and you already trust it to not be hackable, then it should stay unhackable. If you do update the software, you could introduce new bugs, or whoever sends the updates could have been compromised and the “update” could be malware. You’re increasing your surface area; the rogue AI could hack the software devs who push the code that’s supposed to use the latest new AI transparency tech to detect rogue AIs. As you point out, there’s a good chance that detecting rogue AIs is at best a very difficult task; this puts pressure to use adaptive methods, methods that you keep updating (e.g. you keep updating with a detector AI that’s been further improved / trained). So it seems like there’s a quadrilemma: either
1. your rogue AI detection methods are imprecise and are quick to pull the trigger, in which case you just immediately shut down all computers; or
2. your methods are imprecise and reluctant to pull the trigger, in which case a rogue AI can probably go unnoticed; or
3. your methods are precise, but you don’t update the watchdogs, meaning you somehow got this difficult task right with no further corrections after a point, without using the newest AI tech or getting to test your system on detecting the newest AI tech; or
4. your methods are precise, which you do by updating the watchdogs, in which case you’re greatly increasing the surface area of your system, and have to be hardened against ongoing attacks on the entire detection software pipeline.
> To be clear, I am not convinced that “evidence of a rogue AI” is a meaningful description of behavior.
Me neither.