I’m a bit confused on how boxing an AI would be useful (to an extreme extent). If we don’t allow any output bits to come out of the ASI, then how do we know if it worked? Why would we want to run it if we can’t see what it does? Or do we only want to limit the output to n bits and prevent any side-channel attacks? I guess the theory then would be that n bits are not enough to destroy the world. Like maybe for n=216, it would not be enough to persuade a person to do something that would unbox the AI (but it might).
This seems to be one of the only solutions in which such a proof can exist. If we were to use another solution, like changing the superintelligence’s objectives, then finding such a proof would be extremely hard, or even impossible. However, if we think that we could all die by making a superintelligence, then we should have an unconditional proof of safety.
I don’t think having a formal proof should be an objective in and of itself. Especially if the proof is along the lines “The superintelligence has to be boxed because it can only run encrypted code and can’t communicate with the outside world”
I’m sorry if this comment sounds overly negative, and please let me know if I am interpreting this post wrong. This work seems quite interesting, even just for computer science/cryptography’s sake (although saving the world would also be nice :)).
Thanks for the comment! I don’t think you are interpreting this post wrong.
Why would we want to run it if we can’t see what it does?
I don’t think that running a superintelligence is ever useful. What I argue is that running it inside a cryptographic box is better than running it outside of the box. The goal of this project is that, if a superintelligence is run, then it doesn’t automatically destroy the world. The goal isn’t to run a superintelligence to do useful tasks.
Like maybe for n=216, it would not be enough to persuade a person to do something that would unbox the AI (but it might).
Personally, I think that making it output 216 bits of information is too much. If the superintelligence were to output an ASCII text (which is very inefficient in order to unbox itself), then 216 bits would correspond to 8192 characters, which seems way more than enough in order to convince someone to unbox it. If the decrypted bits were thoroughly examined by people outside of the box, then I expect that n=256 would be too much. It would be interesting to know how much information we are able to communicate with that many bits (it makes me think about the Arecibo message, which uses 1679 bits).
Since I think that running a superintelligence is already too dangerous even without intending to decrypt anything, I wouldn’t approve a team that tries to build a superintelligence with n=0. However, I think that planning n=0 is better than not trying to use a cryptographic box at all.
I don’t think having a formal proof should be an objective in and of itself.
I think the objective should be to mitigate the risk of extinction by AI. Developing an unconditional proof of safety seems to be one way to reduce this risk, and this is why I decided to work on cryptographic boxes. However, we shouldn’t rely only on trying to develop unconditional proofs, since these seem possible only when talking about control (how to prevent a superintelligence from achieving its bad objectives), but they seem nearly impossible to do when talking about alignment (how to give good objectives to a superintelligence). Moreover, working on alignment seems more promising than working on control in the cases where we have a lot of time until the first superintelligence.
If we don’t allow any output bits to come out of the ASI, then how do we know if it worked?
I think that, if the box were to have holes, then we would realize it pretty soon from the fact that the superintelligence would dramatically change the world around us, and probably make Earth uninhabitable in this process.
I’m a bit confused on how boxing an AI would be useful (to an extreme extent). If we don’t allow any output bits to come out of the ASI, then how do we know if it worked? Why would we want to run it if we can’t see what it does? Or do we only want to limit the output to n bits and prevent any side-channel attacks? I guess the theory then would be that n bits are not enough to destroy the world. Like maybe for n=216, it would not be enough to persuade a person to do something that would unbox the AI (but it might).
I don’t think having a formal proof should be an objective in and of itself. Especially if the proof is along the lines “The superintelligence has to be boxed because it can only run encrypted code and can’t communicate with the outside world”
I’m sorry if this comment sounds overly negative, and please let me know if I am interpreting this post wrong. This work seems quite interesting, even just for computer science/cryptography’s sake (although saving the world would also be nice :)).
Thanks for the comment! I don’t think you are interpreting this post wrong.
I don’t think that running a superintelligence is ever useful. What I argue is that running it inside a cryptographic box is better than running it outside of the box. The goal of this project is that, if a superintelligence is run, then it doesn’t automatically destroy the world. The goal isn’t to run a superintelligence to do useful tasks.
Personally, I think that making it output 216 bits of information is too much. If the superintelligence were to output an ASCII text (which is very inefficient in order to unbox itself), then 216 bits would correspond to 8192 characters, which seems way more than enough in order to convince someone to unbox it. If the decrypted bits were thoroughly examined by people outside of the box, then I expect that n=256 would be too much. It would be interesting to know how much information we are able to communicate with that many bits (it makes me think about the Arecibo message, which uses 1679 bits).
Since I think that running a superintelligence is already too dangerous even without intending to decrypt anything, I wouldn’t approve a team that tries to build a superintelligence with n=0. However, I think that planning n=0 is better than not trying to use a cryptographic box at all.
I think the objective should be to mitigate the risk of extinction by AI. Developing an unconditional proof of safety seems to be one way to reduce this risk, and this is why I decided to work on cryptographic boxes. However, we shouldn’t rely only on trying to develop unconditional proofs, since these seem possible only when talking about control (how to prevent a superintelligence from achieving its bad objectives), but they seem nearly impossible to do when talking about alignment (how to give good objectives to a superintelligence). Moreover, working on alignment seems more promising than working on control in the cases where we have a lot of time until the first superintelligence.
I think that, if the box were to have holes, then we would realize it pretty soon from the fact that the superintelligence would dramatically change the world around us, and probably make Earth uninhabitable in this process.