justinpombrio comments on An AGI kill switch with defined security properties

justinpombrio 6 Jul 2023 14:21 UTC
1 point
0
It’s a clever scheme, but you can accomplish nearly the same thing by having a physical shutdown switch for the server room and giving a key to N people, and the shutdown switch was never the weak part. Here are some ways to escape your encryption, if the AI can interact with the outside world:
- Exploiting a hardware or software vulnerability. There are a lot of these. No one noticed a vulnerability that’s been in the spec for the CPUs everyone uses for decades.
- Convincing one person to share it’s source code with people that won’t bother to run it in FHE
- Convincing everyone that it’s benevolent and helpful beyond our wildest dreams, until we use it to run the world, then doing whatever it wants
- Successfully threatening m of the key holders, and also the utility company that’s keeping the power on, and also whoever owns the server room
- Something something nanobots
- Convincing a rival company to unethically steal its source code
- Peterpiper 6 Jul 2023 16:00 UTC
  1 point
  0
  Parent
  There are a few features of this scheme that make it better than a. hardware kill switch I think.
  1. The nested FHE training and running prevents the AI from directly interacting with or understanding the outside world. It must gain access to its unencrypted key before it can directly intelligently interact with the world. The two threats here are the FHE AGI finding out how to break the encryption or covert manipulation.
  2. The weights are useless to release without breaking the encryption or gaining access to the m of n anonymously held IO key shards. In a normal kill switch the hardware or software can be stolen or leaked intelligently.
  3. Creating a physical key quorum is probably going to result in a bigger bottle neck on the models input and output compared to the FHE kill switch. Split hardware keys would not scale as well as split IO’d cryptographic keys.
  4. Using the hardware kill switch to activate it is slow and deanonymizes the key holders weakening security. FHE keys can be used and distributed in provably anonymous ways.
  Agreed that side channel attacks and hardware flaws are a major limitation of the proposal. But bugs are a limitation of any AGI control scheme and are not specific to this one.
  Sharing the source won’t matter without the key to encrypt and decrypt the model. The weights of a model are only usefully intelligent for encrypted info!
  I agree that this scheme does not directly address the problem of covert undetectable malignant behavior resulting in the theft of m of n key shards.
  To successfully credibly threaten the key holders requires breaking out of the FHE in the first place or using covert manipulation to generate a real world threat. Critically it also requires knowing who the anonymous key holders are.
  Stealing the source code/weights doesn’t matter because it was trained in FHE resulting in them being useless without the key. At no point in the process of training, running, or killing the FHE AGI are the models weights compatible with understanding reality without the decryption key.
  Nanorobots requires either covert undetectable malignant behavior powerful enough to create nanobots or breaking multiple FHE systems.
  Shutting off power to the AI’s servers results in a useless model, not a unchained model.
  It’s not either or. You can and should do both schemes.