I’m not sure I’m completely solid on how FHE works, so perhaps this won’t work, but here’s an idea of how B can exploit this approach:
Let’s imagine that Check_trustworthy(A_source) = 1. After step 3 of the parent comment B would know E1 = Encrypt(1, A_key). If Check_trustworthy(A_source) returned 0, B would instead know E0 = Encrypt(0, A_key) and the following steps works similarly. B knows which one it is by looking at msg_3.
B has another program: Check_blackmail(X, source) that simulates behaviour of an agent with the given source code in situation X and returns 1 if it would be blackmailable or 0 if not.
B knows Encrypt(A_source, A_key) and they can compute F(X) = Encrypt(Check_blackmail(X, A_source), A_key) for any X using FHE properties of the encryption scheme.
Let’s define W(X) = if(F(X) = E1, 1, 0). It’s easy to see that W(X) = Check_blackmail(X, A_source), so now B can compute that for any X.
I think your example won’t work, but it depends on the implementation of FHE. If there’s a nonce involved (which there really should be), then you’ll get different encrypted data for the output of the two programs you run, even though the underlying data is the same.
But you don’t actually need to do that. The protocol lets B exfiltrate one bit of data, whatever bit they like. A doesn’t get to validate the program that B runs, they can only validate the output. So any program that produces 0 or 1 will satisfy A and they’ll even decrypt the output for you.
That does indeed mean that B can find out if A is blackmailable, or something, so exposing your source code is still risky. What would be really cool would be a way to let A also be sure what program has been run on their source by B, but I couldn’t think of a way to do this such that both A and B are sure that the program was the one that actually got run.
I’m not sure I’m completely solid on how FHE works, so perhaps this won’t work, but here’s an idea of how B can exploit this approach:
Let’s imagine that Check_trustworthy(A_source) = 1. After step 3 of the parent comment B would know E1 = Encrypt(1, A_key). If Check_trustworthy(A_source) returned 0, B would instead know E0 = Encrypt(0, A_key) and the following steps works similarly. B knows which one it is by looking at msg_3.
B has another program: Check_blackmail(X, source) that simulates behaviour of an agent with the given source code in situation X and returns 1 if it would be blackmailable or 0 if not.
B knows Encrypt(A_source, A_key) and they can compute F(X) = Encrypt(Check_blackmail(X, A_source), A_key) for any X using FHE properties of the encryption scheme.
Let’s define W(X) = if(F(X) = E1, 1, 0). It’s easy to see that W(X) = Check_blackmail(X, A_source), so now B can compute that for any X.
Profit?
I think your example won’t work, but it depends on the implementation of FHE. If there’s a nonce involved (which there really should be), then you’ll get different encrypted data for the output of the two programs you run, even though the underlying data is the same.
But you don’t actually need to do that. The protocol lets B exfiltrate one bit of data, whatever bit they like. A doesn’t get to validate the program that B runs, they can only validate the output. So any program that produces 0 or 1 will satisfy A and they’ll even decrypt the output for you.
That does indeed mean that B can find out if A is blackmailable, or something, so exposing your source code is still risky. What would be really cool would be a way to let A also be sure what program has been run on their source by B, but I couldn’t think of a way to do this such that both A and B are sure that the program was the one that actually got run.