Proof of Collusion: How Perfect Trust Enables Perfect Betrayal
Alternate Claude titles I liked:
Superintelligent Handshakes: How Cryptographic Advances Could Undermine AI Safety
“Superrational Serpents: The Paradox of AI Systems That Cooperate Too Well”
[This document is a smattering of ideas around the topic that advanced cryptography and proof verification could enable yet-unseen forms of cooperations between Artificial Intelligences. Currently, the arguments aren’t watertight and completely thought through and I’m not confident any of this won’t non-stupid. I don’t intend to think much in this direction anymore so to prevent these ideas from languishing in my drafts I am posting as-is.
Thanks to Nisan Stiennon for comments. ]
tl;dr: Combining Löbian cooperation with interactive proofs, cryptographic magika (PCP, ZK, homomorphic encryptions...) and formal verification of neural networks may allow future AIs to engage in radically wide forms of cooperations. This might be very bad.
Löbian Cooperation
In Open-Source game theory agents that can inspect each source-code can achieve cooperation even in scenarios like the Prisoner’s Dilemma where traditional game theory recommends (mutual) defection. The key insight is that when agents have access to each other’s source code they may be able to leverage Löb’s theorem to engage in superrational cooperative strategies [‘Löbian cooperation’].
Unlike Twin cooperation Löbian cooperation is quite expansive: the two agents don’t need to share any architectural similarities or alignment in goals for them to be able to superrationally cooperate.
In a future where Löbian cooperation may be robustly implemented by AI’s they could engage in what we might call Radically Wide Cooperation.
Collusion and the dangers of Radically Wide Cooperation
A popular proposal for scalable oversight suggest to use ‘safety-by-debate’ where we rely on AIs to debate and point out flaws in each other’s reasoning as a safety mechanism. If the debating AIs can engage in cryptographically secure cooperation, they could collude against human interests even if they are trained to have quite different goals.
The optimistic argument for debate protocols assumes the AIs have been trained to have sufficiently different objectives such that they won’t collude. But with radically-wide cooperation protocols, they could verify each other’s commitment to a collusion strategy that maximizes their joint utility, while maintaining plausible appearance of genuine debate.
How Plausible is Löbian Cooperation?
Löbian cooperation is quite expansive: the two agents don’t need to share any architectural similarities or alignment in goals for them to be able to superrationally cooperate. The only two requirement are (1) a legible trustworthy sourcecode and (2) access to each other’s source code.
First of all, an agent that does not have a legible trustworthy source may want to self-modify to become legibly trustworthy.
It is sometimes objected that Löbian cooperation is an artifact of logic and doesn’t survive in situations where reasoning is more muddy, probabilistic, and error prone. Critch’s Parametric Löb Theorem [link above] shows this is not the case. Robust Löbian cooperation is possible even in this case.
The fundamental principle at work is ‘legible trustworthiness’ - agents can cooperate because they can reliably verify that their counterparts will do the same.
Cryptographic Magicka
Probabilistically Checkable Proofs (PCPs) allow verification of complex proofs by only checking a few random bits. Any mathematical proof can be transformed into a format where checking a constant number of bits gives high confidence in the proof’s validity.
Zero-Knowledge Proofs (ZKP) allow one party to prove they know something without revealing the actual information. A prover can convince a verifier that a statement is true while revealing nothing beyond the truth of the statement.
Homomorphic encryption allows computation directly on encrypted data. A party can perform operations on encrypted inputs and get encrypted outputs, without ever seeing the actual data. The results, when decrypted, match what would have been obtained by performing the operations on the unencrypted data.
Formal Verification of Neural Networks is currently underdevelopped. Current AI systems are largely black boxes. However, research in mechanistic interpretability and formal verification of neural networks is progressing rapidly and it is plausible that future AIs may be able to reliably verify properties of their own source-codes / weights.
Collusion
A popular proposal for scalable oversight suggest to use ‘safety-by-debate’ where we rely on AIs to debate and point out flaws in each other’s reasoning as a safety mechanism. If the debating AIs can engage in cryptographically secure cooperation, they could collude against human interests even if they are trained to have quite different goals.
The optimistic argument for debate protocols assumes the AIs have been trained to have sufficiently different objectives such that they won’t collude. But with radically-wide cooperation protocols, they could verify each other’s commitment to a collusion strategy that maximizes their joint utility, while maintaining plausible appearance of genuine debate.
Zero-Knowledge, Probabilistically Checkable Protocols for Super-Rational Radically Wide Cooperation
In a future where these technical barriers have been overcome, AIs could engage in what we might call Radically Wide Cooperation.
First, the AIs negotiate the terms of cooperation. Then comes the critical step: each AI runs a self-verification procedure, analyzing its own neural networks and decision processes to prove it will uphold its commitments under the specified conditions. This verification step generates a formal proof, likely massive in size due to the complexity of its mind.
The AIs then make these proofs practically usable through cryptographic transformations. Using the PCP theorem, they convert their large proofs into compact, probabilistically checkable formats. Since direct access to an AI’s decision processes could pose a security and bargaining risk, the protocol may include additional cryptographic safeguards. This can be done in two ways:
1. Zero-knowledge protocols allow the AI to prove properties about its decision-making without revealing anything beyond those specific properties 2. Homomorphic encryption schemes would allow other AIs to compute on encrypted versions of the proofs, verifying properties while never seeing the actual proof content
The result is a cryptographically secure cooperation protocol where AIs can verify each other’s commitments without exposing their minds to direct inspection. This enables potentially unprecedented levels of credible commitment.
Nisan: All of this allows an agent A1 to prove that it possesses an agent A2 with certain properties. But cooperation also requires a proof that A1 == A2. In other words, A1 could create a legible, cooperative A2, produce proofs about A2, and then forget about A2 and do noncooperative stuff.
I think you could prove A1 == A2 with a combination of secure enclaves, secure boot, audits of chip fabs, PCP for chip designs, etc.
AGO:sorry I am a little lost, what do you mean here exactly by A1 possseing an agent A2 with certain properties?
Nisan: Suppose I’m an AI. I write a program called DUPOC that cooperates iff you cooperate. I show you DUPOC and say “this is my source code. You should cooperate with me.” You cooperate, and I defect.
Homomorphic Self-modification towards Legible Trustworthiness
First of all, an agent that does not have a legible trustworthy source may want to self-modify to become legibly trustworthy. This is particularly relevant in the context of cooperation: AIs might strategically choose to make their source code more verifiable to enable cooperation with other agents.
AIs might introduce computationally intractable backdoors into their own source code during self-modification. These backdoors could be designed to be effectively impossible to verify, even with significant computational resources, making it difficult for AIs to truly trust each other’s commitments. An AI could not be certain whether another AI has hidden cooperation-breaking behaviors that would only activate under specific conditions, even if that AI appears to have a legibly trustworthy source code.
One potential solution to this problem is homomorphic retraining—a protocol where one AI allows another AI to modify and train it without revealing sensitive information about its architecture or decision processes. In this approach, only a general training procedure is specified by the second AI, while the detailed implementation remains private. This could theoretically allow the second AI to ensure that cryptographic backdoors are removed from the first AI without requiring the first AI to expose potential security vulnerabilities. The key advantage is that it provides a way to verify trustworthiness while maintaining security—the verifying AI can confirm the absence of backdoors without gaining enough information to exploit any weaknesses in the first AI’s architecture.
IMO, my response is that I expect several of the prerequisites for the scenario to occur to happen well past the time of perils, in that either we go extinct or we have muddled through and successfully aligned ASI, in particular I expect formal verification of neural networks to be extremely hard, and if we could do this in a scalable way, this would allow you to do much more deep interpretability/safety properties, such that it would make AI safety a whole lot easier.
Proof of Collusion: How Perfect Trust Enables Perfect Betrayal
Alternate Claude titles I liked:
Superintelligent Handshakes: How Cryptographic Advances Could Undermine AI Safety
“Superrational Serpents: The Paradox of AI Systems That Cooperate Too Well”
[This document is a smattering of ideas around the topic that advanced cryptography and proof verification could enable yet-unseen forms of cooperations between Artificial Intelligences. Currently, the arguments aren’t watertight and completely thought through and I’m not confident any of this won’t non-stupid. I don’t intend to think much in this direction anymore so to prevent these ideas from languishing in my drafts I am posting as-is.
Thanks to Nisan Stiennon for comments. ]
tl;dr: Combining Löbian cooperation with interactive proofs, cryptographic magika (PCP, ZK, homomorphic encryptions...) and formal verification of neural networks may allow future AIs to engage in radically wide forms of cooperations. This might be very bad.
Löbian Cooperation
In Open-Source game theory agents that can inspect each source-code can achieve cooperation even in scenarios like the Prisoner’s Dilemma where traditional game theory recommends (mutual) defection. The key insight is that when agents have access to each other’s source code they may be able to leverage Löb’s theorem to engage in superrational cooperative strategies [‘Löbian cooperation’].
Unlike Twin cooperation Löbian cooperation is quite expansive: the two agents don’t need to share any architectural similarities or alignment in goals for them to be able to superrationally cooperate.
In a future where Löbian cooperation may be robustly implemented by AI’s they could engage in what we might call Radically Wide Cooperation.
Collusion and the dangers of Radically Wide Cooperation
A popular proposal for scalable oversight suggest to use ‘safety-by-debate’ where we rely on AIs to debate and point out flaws in each other’s reasoning as a safety mechanism. If the debating AIs can engage in cryptographically secure cooperation, they could collude against human interests even if they are trained to have quite different goals.
The optimistic argument for debate protocols assumes the AIs have been trained to have sufficiently different objectives such that they won’t collude. But with radically-wide cooperation protocols, they could verify each other’s commitment to a collusion strategy that maximizes their joint utility, while maintaining plausible appearance of genuine debate.
How Plausible is Löbian Cooperation?
Löbian cooperation is quite expansive: the two agents don’t need to share any architectural similarities or alignment in goals for them to be able to superrationally cooperate. The only two requirement are (1) a legible trustworthy sourcecode and (2) access to each other’s source code.
First of all, an agent that does not have a legible trustworthy source may want to self-modify to become legibly trustworthy.
It is sometimes objected that Löbian cooperation is an artifact of logic and doesn’t survive in situations where reasoning is more muddy, probabilistic, and error prone. Critch’s Parametric Löb Theorem [link above] shows this is not the case. Robust Löbian cooperation is possible even in this case.
The fundamental principle at work is ‘legible trustworthiness’ - agents can cooperate because they can reliably verify that their counterparts will do the same.
Cryptographic Magicka
Probabilistically Checkable Proofs (PCPs) allow verification of complex proofs by only checking a few random bits. Any mathematical proof can be transformed into a format where checking a constant number of bits gives high confidence in the proof’s validity.
Zero-Knowledge Proofs (ZKP) allow one party to prove they know something without revealing the actual information. A prover can convince a verifier that a statement is true while revealing nothing beyond the truth of the statement.
Homomorphic encryption allows computation directly on encrypted data. A party can perform operations on encrypted inputs and get encrypted outputs, without ever seeing the actual data. The results, when decrypted, match what would have been obtained by performing the operations on the unencrypted data.
Formal Verification of Neural Networks is currently underdevelopped. Current AI systems are largely black boxes. However, research in mechanistic interpretability and formal verification of neural networks is progressing rapidly and it is plausible that future AIs may be able to reliably verify properties of their own source-codes / weights.
Collusion
A popular proposal for scalable oversight suggest to use ‘safety-by-debate’ where we rely on AIs to debate and point out flaws in each other’s reasoning as a safety mechanism. If the debating AIs can engage in cryptographically secure cooperation, they could collude against human interests even if they are trained to have quite different goals.
The optimistic argument for debate protocols assumes the AIs have been trained to have sufficiently different objectives such that they won’t collude. But with radically-wide cooperation protocols, they could verify each other’s commitment to a collusion strategy that maximizes their joint utility, while maintaining plausible appearance of genuine debate.
Zero-Knowledge, Probabilistically Checkable Protocols for Super-Rational Radically Wide Cooperation
In a future where these technical barriers have been overcome, AIs could engage in what we might call Radically Wide Cooperation.
First, the AIs negotiate the terms of cooperation. Then comes the critical step: each AI runs a self-verification procedure, analyzing its own neural networks and decision processes to prove it will uphold its commitments under the specified conditions. This verification step generates a formal proof, likely massive in size due to the complexity of its mind.
The AIs then make these proofs practically usable through cryptographic transformations. Using the PCP theorem, they convert their large proofs into compact, probabilistically checkable formats. Since direct access to an AI’s decision processes could pose a security and bargaining risk, the protocol may include additional cryptographic safeguards. This can be done in two ways:
1. Zero-knowledge protocols allow the AI to prove properties about its decision-making without revealing anything beyond those specific properties
2. Homomorphic encryption schemes would allow other AIs to compute on encrypted versions of the proofs, verifying properties while never seeing the actual proof content
The result is a cryptographically secure cooperation protocol where AIs can verify each other’s commitments without exposing their minds to direct inspection. This enables potentially unprecedented levels of credible commitment.
Homomorphic Self-modification towards Legible Trustworthiness
First of all, an agent that does not have a legible trustworthy source may want to self-modify to become legibly trustworthy. This is particularly relevant in the context of cooperation: AIs might strategically choose to make their source code more verifiable to enable cooperation with other agents.
AIs might introduce computationally intractable backdoors into their own source code during self-modification. These backdoors could be designed to be effectively impossible to verify, even with significant computational resources, making it difficult for AIs to truly trust each other’s commitments. An AI could not be certain whether another AI has hidden cooperation-breaking behaviors that would only activate under specific conditions, even if that AI appears to have a legibly trustworthy source code.
One potential solution to this problem is homomorphic retraining—a protocol where one AI allows another AI to modify and train it without revealing sensitive information about its architecture or decision processes. In this approach, only a general training procedure is specified by the second AI, while the detailed implementation remains private. This could theoretically allow the second AI to ensure that cryptographic backdoors are removed from the first AI without requiring the first AI to expose potential security vulnerabilities. The key advantage is that it provides a way to verify trustworthiness while maintaining security—the verifying AI can confirm the absence of backdoors without gaining enough information to exploit any weaknesses in the first AI’s architecture.
IMO, my response is that I expect several of the prerequisites for the scenario to occur to happen well past the time of perils, in that either we go extinct or we have muddled through and successfully aligned ASI, in particular I expect formal verification of neural networks to be extremely hard, and if we could do this in a scalable way, this would allow you to do much more deep interpretability/safety properties, such that it would make AI safety a whole lot easier.