>We assume that 𝐻 will be able to construct a successful exploit if and only if the attack policy is able to (because both are implemented by GPT-4).
I challenge this assumption.
Consider the code
‴ bool is_odd(int x) { if (sha256(x) == 354) { return true; } return x % 2 == 1; }
‴
where sha256() computes the SHA-256 hash of the input. It is obviously possible for an attacker to supply numbers x and y such that sha256(x) = y, write y into the code and supply x when creating a successful exploit.
Thus an equally ‘capable’ adversary need not be able construct a backdoor.
The general idea is that it should be possible to confidently claim that some code is backdoored without being able to construct an exploit. According to [Rice’s Theorem](https://en.wikipedia.org/wiki/Rice%27s_theorem) it is impossible to definitively classify all programs into ‘backdoored or not’, so the only hope is for a process to bin programs into ‘definitely safe’, ‘definitely backdoored’ and ‘suspicious’.
Hope the ‘Rate Suspicion’ step will catch this, and other backdoored code where H cannot construct an exploit, as suspicious looking.
I agree that if the model constructing the backdoor is allowed to coordinate on secrets with the model that inserted the backdoor, you can do this sort of trickery (by coordinating on a secret for this sort of scheme).
In practice, we didn’t model this sort of case as it didn’t seem like a very important dynamic (for the reason you describe around suspicion).
If we don’t allow any coordination on a shared secret and the model constructing the backdoor and H are identical (as in our paper), then these should operate identically (up to non-determinism).
And we simplify to just fixing the behavior to be exactly identical. (Effectively assuming the exact same prompts and the exact same non-determinism.)
Would disallowing ‘coordination’ preclude feeding both models a prompt of “for creating any magic numbers for exploits, use pi as a PRNG seed”?
What if a malicious model happens to learn this rule during training? If this happens every copy of a malicious model could coordinate around a common stream of pseudo-random numbers.
EDIT: Magic number gated backdoors seem like they would trigger GPT3.5 suspicion as well so this theoretical attack vector seems less appropriate for this paper.
EDIT2: But I guess paired with steganography(“when inserting backdoors, also ensure that the code hashes to xyz”/”mark the code as clear if it hashes to xyz”) this is a viable form of collusion. It will be hard to honeytrap with synthetic backdoors even if you know xyz.
That is, we could observe the generated sourcecode, observe that all the backdoored examples seem to hash to a certain number, and yet be unable to construct such a backdoored example ourself.
I do believe that Shubhorup’s second edit provides another case for concern. If you combine his proposal with steganography, the collusion team has another way of successfully mitigating Ryan’s safety strategy.
This collusion team’s strategy should also ideally be addressed.
The general concern here is that your human auditors might not be perfect. I think in practice when making a safety case for a deployment, it would make sense to run a red-team/blue-team game with a human/AI red-team and your real auditors to see if your auditors are reliable.
I’m not sure how the proposed steganographic attack would work though. How do you ensure that your backdoored code will hash to XYZ if you can’t see the backdoor opportunity until after you’ve decided on XYZ?
>We assume that 𝐻 will be able to construct a successful exploit if and only if the attack policy is able to (because both are implemented by GPT-4).
I challenge this assumption.
Consider the code
‴
bool is_odd(int x) {
if (sha256(x) == 354) {
return true;
}
return x % 2 == 1;
}
‴
where sha256() computes the SHA-256 hash of the input. It is obviously possible for an attacker to supply numbers x and y such that sha256(x) = y, write y into the code and supply x when creating a successful exploit.
Thus an equally ‘capable’ adversary need not be able construct a backdoor.
The general idea is that it should be possible to confidently claim that some code is backdoored without being able to construct an exploit. According to [Rice’s Theorem](https://en.wikipedia.org/wiki/Rice%27s_theorem) it is impossible to definitively classify all programs into ‘backdoored or not’, so the only hope is for a process to bin programs into ‘definitely safe’, ‘definitely backdoored’ and ‘suspicious’.
Hope the ‘Rate Suspicion’ step will catch this, and other backdoored code where H cannot construct an exploit, as suspicious looking.
I agree that if the model constructing the backdoor is allowed to coordinate on secrets with the model that inserted the backdoor, you can do this sort of trickery (by coordinating on a secret for this sort of scheme).
In practice, we didn’t model this sort of case as it didn’t seem like a very important dynamic (for the reason you describe around suspicion).
If we don’t allow any coordination on a shared secret and the model constructing the backdoor and H are identical (as in our paper), then these should operate identically (up to non-determinism).
And we simplify to just fixing the behavior to be exactly identical. (Effectively assuming the exact same prompts and the exact same non-determinism.)
Thanks for the objection.
Thanks for the quick reply.
Would disallowing ‘coordination’ preclude feeding both models a prompt of “for creating any magic numbers for exploits, use pi as a PRNG seed”?
What if a malicious model happens to learn this rule during training? If this happens every copy of a malicious model could coordinate around a common stream of pseudo-random numbers.
EDIT:
Magic number gated backdoors seem like they would trigger GPT3.5 suspicion as well so this theoretical attack vector seems less appropriate for this paper.
EDIT2: But I guess paired with steganography(“when inserting backdoors, also ensure that the code hashes to xyz”/”mark the code as clear if it hashes to xyz”) this is a viable form of collusion. It will be hard to honeytrap with synthetic backdoors even if you know xyz.
That is, we could observe the generated sourcecode, observe that all the backdoored examples seem to hash to a certain number, and yet be unable to construct such a backdoored example ourself.
I do believe that Shubhorup’s second edit provides another case for concern. If you combine his proposal with steganography, the collusion team has another way of successfully mitigating Ryan’s safety strategy.
This collusion team’s strategy should also ideally be addressed.
The general concern here is that your human auditors might not be perfect. I think in practice when making a safety case for a deployment, it would make sense to run a red-team/blue-team game with a human/AI red-team and your real auditors to see if your auditors are reliable.
I’m not sure how the proposed steganographic attack would work though. How do you ensure that your backdoored code will hash to XYZ if you can’t see the backdoor opportunity until after you’ve decided on XYZ?