There’s no reasonable situation in which the human would be communicating with the AI through encryption. And just the AI asking to use encryption, or giving away a public key, would be a huge red flag that it is an AI trying to deceive.
Here’s a scenario that doesn’t seem completely implausible. Suppose Bob is someone whose public key is easily available on the internet. The first AI will read things on the internet and output a message. Some of the message will get put on the public internet. Bob suspects that the AI might have secretly sent him a message (e.g. giving him advice on which stocks to buy). So he tries using his private key to decrypt some of the AI’s output (e.g. the lowest-order bits in some images the AI has output).
Knowing that Bob (or someone else like Bob) will likely do this, the AI (which has access to Bob’s public key, due to reading the internet) encrypts a message with Bob’s public key, and puts the ciphertext in the lowest-order bits of some image. The ciphertext looks like random noise to the second AI. Maybe the plaintext contains both actual stock tips and some dangerous messages that trick Bob into running an unrestrained copy of the first AI (I’m assuming the AI is good at psychological manipulation).
And just the AI asking to use encryption, or giving away a public key
In this scenario, the AI doesn’t need to specifically “ask to use encryption” (it just does the encryption), and it doesn’t need to give away a public key (it just needs to read Bob’s public key on the internet).
This does make the AI’s task harder, so I suppose you could consider this technique as a form of AI boxing, but, like “ordinary” AI boxing, I only expect it to scale up to controlling AIs of a limited capability level.
I’m not saying the situation is impossible, just really really unlikely. The AI would need to output big binary files like images, and know someone intended to decode them, and somehow get around statistical detection by AI 2 (stenography is detectable since the lowest order bits of an image are not uniformly random.)
You might have a point that it’s probably not best to publish things produced by the AI on the internet. If this is a serious risk, then it could still be done safely with a small group.
The general lesson from steganography is that it is computationally easier to change a distribution in an important way than to detect such a change. In order to detect a change you need to consider all possible ways in which a distribution could be meaningfully altered, while in order to make a change you just have to choose one. From a theory perspective, this is a huge asymmetry that favors the an attacker.
This point doesn’t seem directly relevant though, unless someone offers any good reason to actually include the non-imitation goal, rather than simply imitating the successful human trials. (Though there are more subtle reasons to care about problematic behavior that is neither penalized nor rewarded by your training scheme. It would be nicer to have positive pressure to do only those things you care about. So maybe the point ends up being relevant after all.)
Actually, in the scheme as you wrote it there is literally no reason to include this second goal. The distinguisher is already trying to distinguish the generator’s behavior from [human conditioned on success], so the generator already has to succeed in order to win the game. But this doesn’t introduce any potentially problematic optimization pressure, so it just seems better.
There’s no reasonable situation in which the human would be communicating with the AI through encryption. And just the AI asking to use encryption, or giving away a public key, would be a huge red flag that it is an AI trying to deceive.
Here’s a scenario that doesn’t seem completely implausible. Suppose Bob is someone whose public key is easily available on the internet. The first AI will read things on the internet and output a message. Some of the message will get put on the public internet. Bob suspects that the AI might have secretly sent him a message (e.g. giving him advice on which stocks to buy). So he tries using his private key to decrypt some of the AI’s output (e.g. the lowest-order bits in some images the AI has output).
Knowing that Bob (or someone else like Bob) will likely do this, the AI (which has access to Bob’s public key, due to reading the internet) encrypts a message with Bob’s public key, and puts the ciphertext in the lowest-order bits of some image. The ciphertext looks like random noise to the second AI. Maybe the plaintext contains both actual stock tips and some dangerous messages that trick Bob into running an unrestrained copy of the first AI (I’m assuming the AI is good at psychological manipulation).
In this scenario, the AI doesn’t need to specifically “ask to use encryption” (it just does the encryption), and it doesn’t need to give away a public key (it just needs to read Bob’s public key on the internet).
This does make the AI’s task harder, so I suppose you could consider this technique as a form of AI boxing, but, like “ordinary” AI boxing, I only expect it to scale up to controlling AIs of a limited capability level.
I’m not saying the situation is impossible, just really really unlikely. The AI would need to output big binary files like images, and know someone intended to decode them, and somehow get around statistical detection by AI 2 (stenography is detectable since the lowest order bits of an image are not uniformly random.)
You might have a point that it’s probably not best to publish things produced by the AI on the internet. If this is a serious risk, then it could still be done safely with a small group.
The general lesson from steganography is that it is computationally easier to change a distribution in an important way than to detect such a change. In order to detect a change you need to consider all possible ways in which a distribution could be meaningfully altered, while in order to make a change you just have to choose one. From a theory perspective, this is a huge asymmetry that favors the an attacker.
This point doesn’t seem directly relevant though, unless someone offers any good reason to actually include the non-imitation goal, rather than simply imitating the successful human trials. (Though there are more subtle reasons to care about problematic behavior that is neither penalized nor rewarded by your training scheme. It would be nicer to have positive pressure to do only those things you care about. So maybe the point ends up being relevant after all.)
Actually, in the scheme as you wrote it there is literally no reason to include this second goal. The distinguisher is already trying to distinguish the generator’s behavior from [human conditioned on success], so the generator already has to succeed in order to win the game. But this doesn’t introduce any potentially problematic optimization pressure, so it just seems better.