Steve, please clarify, because I’ve long wondered: Are you saying you could possibly formally prove that a system like the human brain would always do something safe?
There are two enormous problems here.
a) Defining what arrangement of atoms you’d call “safe”.
b) Predicting what a complex set of neural networks is going to do.
Intuitively, this problem is so difficult as to be a non-starter for effective AGI safety methods (if the AGI is a set of neural networks, as seems highly likely at this point; and probably even if it was the nicest set of complex algorithms you could imagine, because of problem a).
I thought your previous work was far more limited in ambition, making it a potential useful supplement to alignment.
I’m focused on making sure our infrastructure is safe against AI attacks. This will require that our software not have security holes, that our cryptography not be vulnerable to mathematical attacks, our hardware not leak signals to adversarial sensors, and our social mechanisms not be vulnerable to manipulative interactions. I believe the only practical way to have these assurances is to model our systems formally using tools like Lean and to design them so that adversaries in a specified class provably cannot break specified safety criteria (eg. not leak cryptographic keys). Humans can do these tasks today but it is laborious and usually only attempted when the stakes are high. We are likely to soon have AI systems which can synthesize systems with verified proofs of safety properties rapidly and cheaply. I believe this will be a game changer for safety and am arguing that we need to prepare for that opportunity. One important piece of infrastructure is the hardware that AI runs on. When we have provable protections for that hardware, we can put hard controls on the compute available to AIs, its ability to replicate, etc..
Okay! Thanks for the clarification. That’s what I got from your paper with Tegmark, but in the more recent writing it sounded like maybe you were extending the goal to actually verifying safe behavior from the AGI. This is what I was referring to as a potentially useful supplement to alignment. I agree that it’s possible with improved verification methods, given the caveats from this post.
An unaligned AGI could take action outside of all of our infrastructure, so protecting it would be a partial solution at best.
If I were or controlled an AGI and wanted to take over, I’d set up my own infrastructure in a hidden location, underground or off-planet, and let the magic of self-replicating manufacturing develop whatever I needed to take over. You’d need some decent robotics to jumpstart this process, but it looks like progress in robotics is speeding up alongside progress in AI.
I think we definitely would like a potentially unsafe AI to be able to generate control actions, code, hardware, or systems designs together with proofs that those designs meet specified goals. Our trusted systems can then cheaply and reliably check that proof and if it passes, safely use the designs or actions from an untrusted AI. I think that’s a hugely important pattern and it can be extended in all sorts of ways. For example, markets of untrusted agents can still solve problems and take actions that obey desired constraints, etc.
The issue of unaligned AGI hiding itself is potentially huge! I have end state designs that would guarantee peace and abundance for humanity, but they require that all AIs operate under a single proven infrastructure. In the intermediate period between now and then is the highest risk, I think.
And, of course, an adversarial AI will do everything it can to hide and garner resources! One of the great uses of provable hardware is the ability to create controlled privacy. You can have extensive networks of sensors where all parties are convinced by proofs that they won’t transmit information about what they are sensing unless a specified situation is sensed. It looks like that kind of technology might allow mutual treaties which meet all parties needs but prevent the “hidden rogue AIs” buried in the desert. I don’t understand the dynamics very well yet, though.
Steve, please clarify, because I’ve long wondered: Are you saying you could possibly formally prove that a system like the human brain would always do something safe?
There are two enormous problems here.
a) Defining what arrangement of atoms you’d call “safe”.
b) Predicting what a complex set of neural networks is going to do.
Intuitively, this problem is so difficult as to be a non-starter for effective AGI safety methods (if the AGI is a set of neural networks, as seems highly likely at this point; and probably even if it was the nicest set of complex algorithms you could imagine, because of problem a).
I thought your previous work was far more limited in ambition, making it a potential useful supplement to alignment.
I’m focused on making sure our infrastructure is safe against AI attacks. This will require that our software not have security holes, that our cryptography not be vulnerable to mathematical attacks, our hardware not leak signals to adversarial sensors, and our social mechanisms not be vulnerable to manipulative interactions. I believe the only practical way to have these assurances is to model our systems formally using tools like Lean and to design them so that adversaries in a specified class provably cannot break specified safety criteria (eg. not leak cryptographic keys). Humans can do these tasks today but it is laborious and usually only attempted when the stakes are high. We are likely to soon have AI systems which can synthesize systems with verified proofs of safety properties rapidly and cheaply. I believe this will be a game changer for safety and am arguing that we need to prepare for that opportunity. One important piece of infrastructure is the hardware that AI runs on. When we have provable protections for that hardware, we can put hard controls on the compute available to AIs, its ability to replicate, etc..
Okay! Thanks for the clarification. That’s what I got from your paper with Tegmark, but in the more recent writing it sounded like maybe you were extending the goal to actually verifying safe behavior from the AGI. This is what I was referring to as a potentially useful supplement to alignment. I agree that it’s possible with improved verification methods, given the caveats from this post.
An unaligned AGI could take action outside of all of our infrastructure, so protecting it would be a partial solution at best.
If I were or controlled an AGI and wanted to take over, I’d set up my own infrastructure in a hidden location, underground or off-planet, and let the magic of self-replicating manufacturing develop whatever I needed to take over. You’d need some decent robotics to jumpstart this process, but it looks like progress in robotics is speeding up alongside progress in AI.
I think we definitely would like a potentially unsafe AI to be able to generate control actions, code, hardware, or systems designs together with proofs that those designs meet specified goals. Our trusted systems can then cheaply and reliably check that proof and if it passes, safely use the designs or actions from an untrusted AI. I think that’s a hugely important pattern and it can be extended in all sorts of ways. For example, markets of untrusted agents can still solve problems and take actions that obey desired constraints, etc.
The issue of unaligned AGI hiding itself is potentially huge! I have end state designs that would guarantee peace and abundance for humanity, but they require that all AIs operate under a single proven infrastructure. In the intermediate period between now and then is the highest risk, I think.
And, of course, an adversarial AI will do everything it can to hide and garner resources! One of the great uses of provable hardware is the ability to create controlled privacy. You can have extensive networks of sensors where all parties are convinced by proofs that they won’t transmit information about what they are sensing unless a specified situation is sensed. It looks like that kind of technology might allow mutual treaties which meet all parties needs but prevent the “hidden rogue AIs” buried in the desert. I don’t understand the dynamics very well yet, though.