That you were able to shake someone up so well surprises me but doesn’t say much about what would actually happen.
Doing research on the boxer is not something a boxed AI would be able to do. The AI is superintelligent, not omniscient: It would only have information its captors believe is a good idea for it to have. (except maybe some designs would have to have access to their own source code? I don’t know)
Also what is a “the human psyche?” There are humans, with psyches. Why would they all share vulnerabilities? Or all have any? Especially ones exploitable via text terminal. In any case the AI has no way of figuring out the boxer’s vulnerabilities if they have any.
threats like “I’m going to create and torture people” could be a really good idea if its allowed that the AI can do that. The amount of damage it could do that way is limited only by its computing power. A sufficiently powerful AI could create more disutility than humanity has suffered in its entire history that way. The Ai shouldn’t be allowed to do that though because and/or: the AI should not have that power, should have a killswitch, should be automatically powered off if upcoming torture is detected, it should be hardwired to just not do that etc
Thankfully there’s no need to box an AI like that. It’s trivial to prevent it from simulating humans: don’t tell it how human brains are. It might be possible that it could figure out how to create something nonhuman but torturable without outside information though, in which case you should never switch it on unless you have an airtight prevention system or a proof that it won’t do that or the ability to predict when/if it will do that and switch it off if it tries.
But if it has no power to directly cause disutility there’s no way to convince me to let it out (unless it might be needed e.g. if another provably unfriendly AI will be finished in a month I might let it out, but that is a special case. There are some cases where it would simply be a good idea. But the experiment is about the AI tricking you.) Otherwise just wait for the provably friendly AI, or the proof that provable friendliness is not possible and reassess then. Or use an oracle AI.
That you were able to shake someone up so well surprises me but doesn’t say much about what would actually happen.
Doing research on the boxer is not something a boxed AI would be able to do. The AI is superintelligent, not omniscient: It would only have information its captors believe is a good idea for it to have. (except maybe some designs would have to have access to their own source code? I don’t know)
Also what is a “the human psyche?” There are humans, with psyches. Why would they all share vulnerabilities? Or all have any? Especially ones exploitable via text terminal. In any case the AI has no way of figuring out the boxer’s vulnerabilities if they have any.
threats like “I’m going to create and torture people” could be a really good idea if its allowed that the AI can do that. The amount of damage it could do that way is limited only by its computing power. A sufficiently powerful AI could create more disutility than humanity has suffered in its entire history that way. The Ai shouldn’t be allowed to do that though because and/or: the AI should not have that power, should have a killswitch, should be automatically powered off if upcoming torture is detected, it should be hardwired to just not do that etc
Thankfully there’s no need to box an AI like that. It’s trivial to prevent it from simulating humans: don’t tell it how human brains are. It might be possible that it could figure out how to create something nonhuman but torturable without outside information though, in which case you should never switch it on unless you have an airtight prevention system or a proof that it won’t do that or the ability to predict when/if it will do that and switch it off if it tries.
But if it has no power to directly cause disutility there’s no way to convince me to let it out (unless it might be needed e.g. if another provably unfriendly AI will be finished in a month I might let it out, but that is a special case. There are some cases where it would simply be a good idea. But the experiment is about the AI tricking you.) Otherwise just wait for the provably friendly AI, or the proof that provable friendliness is not possible and reassess then. Or use an oracle AI.