Thane Ruthenis comments on Interpretability Tools Are an Attack Channel

Thane Ruthenis 19 Aug 2022 8:11 UTC
1 point
0
no-human-in-the-loop setups, like “shut off the power if a particular plan is detected”
Hmm, “automatized mulligan” is indeed pretty secure, as far as uses of interpretability tools go. Good point.
Of course, what I’m arguing is that it wouldn’t help us either, since a superintelligence would break the tool and ensure the “shut off the power” condition is never triggered.
Homomorphic encryption is not a real outside of thought experiments, and will not become a real thing on any relevant time scale
I’m agnostic on that, leaning towards agreement. Just noting that it wouldn’t save us even if it were, or even help us since the scenarios where the AI is advanced enough that nothing short of homomorphic encryption can stop it are probably scenarios where homomorphic encryption can’t stop it either.