jimrandomh comments on Interpretability Tools Are an Attack Channel

jimrandomh 19 Aug 2022 2:36 UTC
3 points
0
It’s totally feasible to make a (narrow, well defined) computer interface which has zero security vulnerabilities. It’s much easier to secure the channel that an interpretability tool passes through than it is to secure the AI itself or its training environment, since the interface has a lot less total complexity. You can’t use memory-unsafe programming languages like C, and you can’t incorporate certain large pieces of software that you might have wanted to use, but it’s not nearly as bad as you seem to think. Tricking the humans into doing dumb things that break the security is still an issue, of course. But no-human-in-the-loop setups, like “shut off the power if a particular plan is detected”, don’t create an escape channel unless badly designed in ways they have no good reason to be badly designed.
(Homomorphic encryption is not a real outside of thought experiments, and will not become a real thing on any relevant time scale.)
- Thane Ruthenis 19 Aug 2022 8:11 UTC
  1 point
  0
  Parent
  no-human-in-the-loop setups, like “shut off the power if a particular plan is detected”
  Hmm, “automatized mulligan” is indeed pretty secure, as far as uses of interpretability tools go. Good point.
  Of course, what I’m arguing is that it wouldn’t help us either, since a superintelligence would break the tool and ensure the “shut off the power” condition is never triggered.
  Homomorphic encryption is not a real outside of thought experiments, and will not become a real thing on any relevant time scale
  I’m agnostic on that, leaning towards agreement. Just noting that it wouldn’t save us even if it were, or even help us since the scenarios where the AI is advanced enough that nothing short of homomorphic encryption can stop it are probably scenarios where homomorphic encryption can’t stop it either.