An AGI kill switch with defined security properties
This is a proposal for creating an AGI kill switch that has definable security properties.
Take a combination of symmetric fully homomorphic encryption (FHE) and indistinguishability obfuscation and use it to train and run inference for an AGI.
Take the encryption key and split it so that a quorum of m of n shards are required to submit or receive clear text data to the FHE AGI.
The key quorum and use takes place in a program that has been obfuscated with the indistinguishability obfuscation technique so that the clear text key is never seen.
Secretly distribute the key shards to diverse custodians who will monitor the AI’s output and input and look for signs of bad behavior.
If bad behavior by the FHE AGI is noted you would destroy enough keys to prevent successful key quorum. This key destruction ideally results in the AGI being unable to legibly understand or manipulate the world.
Limitations include:
Someone breaking IO and stealing m/n keys and releasing the AGI.
The AGI breaking the assumptions of the relevant cryptographic systems to escape.
Knowing what intelligence threshold is required to break the cryptographic system and keeping the AI below that.
The difficulty of predicting the consequences of the AGI’s decrypted real world behavior to screen for malfeasance.
Extremely high computation requirements for FHE and IO systems making the overall system economically noncompetitive with normal AGI.
Deciding who the key shard guardians are and what properties they have.
Deciding and recognizing what behaviors count as malignant.
The bottleneck of running all FHE AGI queries and answers through a quorum of key guardians and malignancy filters.
The tradeoff between the security of the system (more keys is better) and the usability/stability of the system (less keys is better).
I would love feedback. Is it impractical, impossible, not even wrong? What other limitations are there to the idea?
Can you nest different FHE systems with differing assumptions so that if the AI breaks out of the inner system it is still stuck in an outer system? Assuming the AI can break out of the inner system, would it be unable to tell if it was in the real world versus another outer FHE system? If the FHE AGI is unable to tell if it is in an encrypted world or the real world, wouldn’t a smart and sneaky AI want to discover vulnerabilities for all practical FHE systems before trying to escape?
https://xkcd.com/538/ Crypto is not the weakest link.
Thanks for the reply! In addition to downvoting could some one reply with a substantive critique? I don’t understand the relevance of the xkcd comic. FHE is a way to compute on encrypted data. IO is a way to hide the contents of a program while it is running. Because of these properties FHE/IO crypto as a kill switch has very different adversarial dynamics from crypto as a privacy mechanism. Could you explain why the comic invalidates the concept so I understand your point better? My current understanding is that FHE AGI doesn’t know if it is captured, no individual or group has the clear text key or key shards (due to key splitting and IO), For an AGI to break out of the jail it would need to break several different FHE systems. It certainly could grow smart enough to recognize it’s in an FHE world and break several different FHE systems. My biggest critiques of the idea is that it is computationally expensive, it is hard to know the threshold of intelligence/knowledge required to break FHE, the bottleneck of running input and output to the FHE AGI through a key quorum. I am sure I am missing other important critiques that invalidate the idea but I don’t see the xkcd cartoon as one of them.
If humans are supposed to be able to detect things going wrong and shut things down, that requires that they are exposed to the unencrypted feed. At this point, the humans are the weakest link, not the encryption. Similar for anything else external that you need / want AI to access while it’s being trained and tested.
Edited to add: particularly if we are talking about not some theoretical sensible humans, but about real humans that started with “do not worry about LLMs, they are not agentic”, and then promptly connected LLMs to agentic APIs.
I agree with you. Humans are the weakest link. This scheme isn’t able to solve that. It is meant to be a marginal improvement over a physical kill switch and make theft and release of the model more difficult.
It’s a clever scheme, but you can accomplish nearly the same thing by having a physical shutdown switch for the server room and giving a key to N people, and the shutdown switch was never the weak part. Here are some ways to escape your encryption, if the AI can interact with the outside world:
Exploiting a hardware or software vulnerability. There are a lot of these. No one noticed a vulnerability that’s been in the spec for the CPUs everyone uses for decades.
Convincing one person to share it’s source code with people that won’t bother to run it in FHE
Convincing everyone that it’s benevolent and helpful beyond our wildest dreams, until we use it to run the world, then doing whatever it wants
Successfully threatening m of the key holders, and also the utility company that’s keeping the power on, and also whoever owns the server room
Something something nanobots
Convincing a rival company to unethically steal its source code
There are a few features of this scheme that make it better than a. hardware kill switch I think.
The nested FHE training and running prevents the AI from directly interacting with or understanding the outside world. It must gain access to its unencrypted key before it can directly intelligently interact with the world. The two threats here are the FHE AGI finding out how to break the encryption or covert manipulation.
The weights are useless to release without breaking the encryption or gaining access to the m of n anonymously held IO key shards. In a normal kill switch the hardware or software can be stolen or leaked intelligently.
Creating a physical key quorum is probably going to result in a bigger bottle neck on the models input and output compared to the FHE kill switch. Split hardware keys would not scale as well as split IO’d cryptographic keys.
Using the hardware kill switch to activate it is slow and deanonymizes the key holders weakening security. FHE keys can be used and distributed in provably anonymous ways.
Agreed that side channel attacks and hardware flaws are a major limitation of the proposal. But bugs are a limitation of any AGI control scheme and are not specific to this one.
Sharing the source won’t matter without the key to encrypt and decrypt the model. The weights of a model are only usefully intelligent for encrypted info!
I agree that this scheme does not directly address the problem of covert undetectable malignant behavior resulting in the theft of m of n key shards.
To successfully credibly threaten the key holders requires breaking out of the FHE in the first place or using covert manipulation to generate a real world threat. Critically it also requires knowing who the anonymous key holders are.
Stealing the source code/weights doesn’t matter because it was trained in FHE resulting in them being useless without the key. At no point in the process of training, running, or killing the FHE AGI are the models weights compatible with understanding reality without the decryption key.
Nanorobots requires either covert undetectable malignant behavior powerful enough to create nanobots or breaking multiple FHE systems.
Shutting off power to the AI’s servers results in a useless model, not a unchained model.
It’s not either or. You can and should do both schemes.