Thank you! I appreciate your thoughtful reply, and learned stuff. (This could IMO be a top-level post, maybe, IDK. Like “So you want to built an AI off switch...”.)
> I hope it is clear that the problem isn’t about complexity of a system,
It’s clear that the complexity of a system provides an upper bound on hackability, but doesn’t provibe a lower bound: a really simple system isn’t hackable, while a complex (big, Turing complete, etc.) system might or might not be hackable depending on what code is running on it and what inputs it gets. We get lower bounds on hackability by showing ways that the attacker can affect the machine.
> or if the system supports software updates.
Hm. It still seems like actually getting software updates is a big security hole (as opposed to merely hypothetically supporting updates). If you don’t update the software, and you already trust it to not be hackable, then it should stay unhackable. If you do update the software, you could introduce new bugs, or whoever sends the updates could have been compromised and the “update” could be malware. You’re increasing your surface area; the rogue AI could hack the software devs who push the code that’s supposed to use the latest new AI transparency tech to detect rogue AIs. As you point out, there’s a good chance that detecting rogue AIs is at best a very difficult task; this puts pressure to use adaptive methods, methods that you keep updating (e.g. you keep updating with a detector AI that’s been further improved / trained). So it seems like there’s a quadrilemma: either
1. your rogue AI detection methods are imprecise and are quick to pull the trigger, in which case you just immediately shut down all computers; or 2. your methods are imprecise and reluctant to pull the trigger, in which case a rogue AI can probably go unnoticed; or 3. your methods are precise, but you don’t update the watchdogs, meaning you somehow got this difficult task right with no further corrections after a point, without using the newest AI tech or getting to test your system on detecting the newest AI tech; or 4. your methods are precise, which you do by updating the watchdogs, in which case you’re greatly increasing the surface area of your system, and have to be hardened against ongoing attacks on the entire detection software pipeline.
> To be clear, I am not convinced that “evidence of a rogue AI” is a meaningful description of behavior.
Thank you! I appreciate your thoughtful reply, and learned stuff. (This could IMO be a top-level post, maybe, IDK. Like “So you want to built an AI off switch...”.)
> I hope it is clear that the problem isn’t about complexity of a system,
It’s clear that the complexity of a system provides an upper bound on hackability, but doesn’t provibe a lower bound: a really simple system isn’t hackable, while a complex (big, Turing complete, etc.) system might or might not be hackable depending on what code is running on it and what inputs it gets. We get lower bounds on hackability by showing ways that the attacker can affect the machine.
> or if the system supports software updates.
Hm. It still seems like actually getting software updates is a big security hole (as opposed to merely hypothetically supporting updates). If you don’t update the software, and you already trust it to not be hackable, then it should stay unhackable. If you do update the software, you could introduce new bugs, or whoever sends the updates could have been compromised and the “update” could be malware. You’re increasing your surface area; the rogue AI could hack the software devs who push the code that’s supposed to use the latest new AI transparency tech to detect rogue AIs. As you point out, there’s a good chance that detecting rogue AIs is at best a very difficult task; this puts pressure to use adaptive methods, methods that you keep updating (e.g. you keep updating with a detector AI that’s been further improved / trained). So it seems like there’s a quadrilemma: either
1. your rogue AI detection methods are imprecise and are quick to pull the trigger, in which case you just immediately shut down all computers; or
2. your methods are imprecise and reluctant to pull the trigger, in which case a rogue AI can probably go unnoticed; or
3. your methods are precise, but you don’t update the watchdogs, meaning you somehow got this difficult task right with no further corrections after a point, without using the newest AI tech or getting to test your system on detecting the newest AI tech; or
4. your methods are precise, which you do by updating the watchdogs, in which case you’re greatly increasing the surface area of your system, and have to be hardened against ongoing attacks on the entire detection software pipeline.
> To be clear, I am not convinced that “evidence of a rogue AI” is a meaningful description of behavior.
Me neither.