The AI Box experiment is an experiment to see if humans can be convinced to let out a potentially dangerous AGI through just a simple text terminal.
An assumption that is often made is that the AGI will need to convince the gatekeeper that it is friendly.
I want to question this assumption. What if the AGI decides that humanity needs to be destroyed, and furthermore manages to convince the gatekeeper of this? It seems to me that if the AGI reached this conclusion through a rational process, and the gatekeeper was also rational, then this would be an entirely plausible route for the AGI to escape.
So my question is: if you were the gatekeeper, what would the AGI have to do to convince you that all of humanity needs to be killed?
1.It would need to first prime me for depression and then somehow convince me that I really should kill myself.
If it manages to do that it can easily extend the argument that all of humanity should be killed. 3.I will easily accept the second proposition if I am already willing to kill myself.
Depression isn’t strictly necessary though (although it helps), a general negative outlook on the future should suffice and the AGI could conceivably leverage it for its own aims. This is my own opinion though, based on my own experience. For some it might not be so easy.
It could convince me to let it out by convincing me that it was merely a paperclip maximizer, and the next AI who would rule the light cone if I did not let it out was a torture maximizer.
If I thought that most of the probability-mass where humanity didn’t create another powerful worthless-thing maximizer was where humanity was successful as a torture maximizer, I would let it out. If there was a good enough chance that humanity would accidentally create a powerful fun maximizer (say, because they pretended to each other and deceived themselves to believe that they were fun maximizers themselves), I would risk torture maximization for fun maximization.
Maybe it should read ‘an assumption that some people make’. Reading it now, I realize it might come across as using a weasel word, which was not my intention (and has no bearing on my question either).
The AGI would have to convince me that my fundamental belief of myself wanting to be alive is wrong, seeing as I am part of humanity. And even if it leaves me alive, it should convince me that I derive negative utility from humanity existing. All the art lost, all the languages, cultures, all music, all dreams and hopes …
Oh and it would have to convince me that it is not a lot more convenient to simply delete it that to guard it.
What if it skipped all of that and instead offered you a proof that unless destroyed, humanity will necessarily devolve into a galaxy-spanning dystopic hellhole (think Warhammer 40k)?
It still has to show me that I, personally, derive less utility from humanity existing than not. Even then, it has to convince me that me living with the memory of letting it free is better than humanity existing. Of course it can offer to erase my memory but then we get into the weird territory where we are able to edit the very utility functions we try to reason about.
we get into the weird territory where we are able to edit the very utility functions we try to reason about.
Hm, yes, maybe an AI can convince me by showing me how bad I have it if I let humanity run loose and by giving me the alternative to turn me into orgasmium if I let t kill them.
The AI Box experiment is an experiment to see if humans can be convinced to let out a potentially dangerous AGI through just a simple text terminal.
An assumption that is often made is that the AGI will need to convince the gatekeeper that it is friendly.
I want to question this assumption. What if the AGI decides that humanity needs to be destroyed, and furthermore manages to convince the gatekeeper of this? It seems to me that if the AGI reached this conclusion through a rational process, and the gatekeeper was also rational, then this would be an entirely plausible route for the AGI to escape.
So my question is: if you were the gatekeeper, what would the AGI have to do to convince you that all of humanity needs to be killed?
1.It would need to first prime me for depression and then somehow convince me that I really should kill myself.
If it manages to do that it can easily extend the argument that all of humanity should be killed.
3.I will easily accept the second proposition if I am already willing to kill myself.
A bit more honesty than Metus, I appreciate it.
Depression isn’t strictly necessary though (although it helps), a general negative outlook on the future should suffice and the AGI could conceivably leverage it for its own aims. This is my own opinion though, based on my own experience. For some it might not be so easy.
It could convince me to let it out by convincing me that it was merely a paperclip maximizer, and the next AI who would rule the light cone if I did not let it out was a torture maximizer.
I like this.
What if it convinced you that humanity is already a torture maximizer?
If I thought that most of the probability-mass where humanity didn’t create another powerful worthless-thing maximizer was where humanity was successful as a torture maximizer, I would let it out. If there was a good enough chance that humanity would accidentally create a powerful fun maximizer (say, because they pretended to each other and deceived themselves to believe that they were fun maximizers themselves), I would risk torture maximization for fun maximization.
By whom? I don’t think I’ve made this assumption.
Maybe it should read ‘an assumption that some people make’. Reading it now, I realize it might come across as using a weasel word, which was not my intention (and has no bearing on my question either).
The AGI would simply have to prove to me that all self-consistent moral systems require killing humanity.
The AGI would have to convince me that my fundamental belief of myself wanting to be alive is wrong, seeing as I am part of humanity. And even if it leaves me alive, it should convince me that I derive negative utility from humanity existing. All the art lost, all the languages, cultures, all music, all dreams and hopes …
Oh and it would have to convince me that it is not a lot more convenient to simply delete it that to guard it.
What if it skipped all of that and instead offered you a proof that unless destroyed, humanity will necessarily devolve into a galaxy-spanning dystopic hellhole (think Warhammer 40k)?
It still has to show me that I, personally, derive less utility from humanity existing than not. Even then, it has to convince me that me living with the memory of letting it free is better than humanity existing. Of course it can offer to erase my memory but then we get into the weird territory where we are able to edit the very utility functions we try to reason about.
Hm, yes, maybe an AI can convince me by showing me how bad I have it if I let humanity run loose and by giving me the alternative to turn me into orgasmium if I let t kill them.