The Pragmatic Side of Cryptographically Boxing AI

All previous discussions and literature I was about to find on the topic of cryptographic containment of powerful AI or AI boxing, primarily discussed or argued these points:

Certain AI boxing schemes could, in principle, solve alignment.
Even if AI boxing is implemented, social engineering would likely be the primary failure mode^[1].
AI boxing might be necessary as progress on alignment lags behind AI capabilities^[2].
Cryptographic AI boxing is generally computationally expensive.

However, there has been little attempt made to measure the magnitude of the alignment tax associated with the implementation of cryptographic containment for AI and use that to estimate the feasibility of such implementation. In this post, I argue that we are highly unlikely to see cryptographic containment of powerful AI in the coming years, if ever. This argument can be summarised as follows:

The computational overhead inherent to the cryptographic techniques required to contain a powerful AI is so enormous as to make the implementation infeasible, even when accounting for breakthroughs in hardware acceleration.

Additionally, I go over the general economic disincentives that make cryptographic containment of powerful AI even less likely to happen. And at the end, I assume that “computational overhead is not an issue”, and discuss the complex and troublesome nature of the centralisation of power it leads to.

I tried to keep this post as brief as possible, and broke down the above argument into 4 parts:

I define cryptographic AI boxing and give examples of how it could be useful, and potentially necessary in certain scenarios.
I estimate the alignment tax of cryptographic boxing and argue that, relative to the benefits it brings, it is huge.
I go over the (dis)incentives associated with constraining the accessibility/deployment of SOTA AI.
I talk about the caveat of successful implementation of AI containment—centralisation of power. I also propose a scheme that could solve this problem using Secure Multi-Party Computation (MPC), however, it exacerbates the underlying implementation challenges present in classic (centralised) containment schemes.

What do I mean by cryptographic AI boxing?

AI Boxing is typically defined as efforts to ensure safe communication with AI only allowing the bare minimum amount of information to pass through a specific information channel to the human liaison. AI boxing, as I show below, can also be used to enclose AI until a certain (safety) requirements are met. This definition generally implies the removal of all potential information leakages that could be exploited by a malicious AI^[3].

This level of precaution is likely unnecessary for any AI with intellectual capability at par with or lower than that of an average human being, but it becomes progressively more reasonable as the AI capabilities reach and surpass human-level and the risks from potential undesired behaviour increase. For an arbitrarily intelligent AI, or simply Superintelligence (SI), it becomes necessary to limit all possible communication channels to the world outside of the training and testing space, until proof of alignment, or some form of safety guarantee, is obtained.

For example, one way in which this could be done is to encrypt the entire computation performed during the AI development process using Fully Homomorphic Encryption (FHE). FHE allows for an arbitrary computation to be performed on encrypted data, which in this case would be the process of training, aligning and evaluating the AI. We could set up the process in such a way that we only receive decrypted information if a given safety requirements are met. For example, if the evaluations are passed, the model gets decrypted, otherwise, the model is deleted and we receive a message informing us of the failure to align the AI (0 in this case of the example diagram below).

Example implementation of FHE as an additional layer of security

Of course, this is merely an example and the exact details of implementation are subject to arbitrary configuration.

In the setup above we don’t have a guarantee that the AI upon decryption didn’t pass the tests/evals being deceptively aligned, this issue could be tackled by imposing a more demanding or ambitious scheme. For example, we could require the AI to generate an aligned copy of itself and delete itself afterwards^[4]. This scheme could be implemented in combination with a process outlined in the GS AI agenda, where we define a safety specification, a world model of a given AI, and use a verifier to confirm that the safety specification is met by a given AI for a given world model. The sequence diagram of the implementation of this scheme would look as follows:

This, assuming that we can define the safety specification, world model and have created a verifier would, in principle, solve the alignment problem. The nuanced argument on the feasibility of the GS AI agenda is outside of the scope of this post, so I am only going to focus on the shortcomings of performing encrypted computation at this scale.

The alignment tax of cryptographic boxing is huge

Cryptographic containment of AI would necessarily use FHE, which is ridiculously expensive to use, so much so that it hasn’t even reached commercial viability yet^[5], let alone be applicable for encryption of the training and evaluation process of SOTA AI.

The computational overhead imposed by FHE ranges from thousands to millions of times that of plaintext operations. Yes, that is between 1,000x and 1,000,000x computational overhead compared to standard, non-encrypted operations. For example, multiplying encrypted 32-bit integers using state-of-the-art FHE can take about 9 seconds, compared to the near-instantaneous equivalent of unencrypted operations. This overhead stems from the high time complexity of FHE, which can be estimated to be close to O(λ^4 * 2^L)^[6], where λ is the security parameter and L is the circuit depth. The exact (in)efficiency varies widely with specific implementations, hence the wide range. L (circuit depth) in the context of training SOTA AI would be the number of sequential operations or layers that need to be processed during training. In my view expecting FHE to be applied to one of the most computationally intensive endeavours of humanity would be unrealistic, to say the least.

Hardware acceleration

A question that immediately comes to mind when we hear “1,000,000x computational overhead” is: What about hardware acceleration? Can’t we use something like ASICs?

Unfortunately, hardware acceleration, at least in the case of FHE, is not a silver bullet solution. As it is not the inefficiency of the standard processors in performing the operations on encrypted data, but the substantial increase in operations performed and data to be processed that causes the overhead. Encryption inherently leads to at least 2 or 3 orders of magnitude more bits to process, so even with many breakthroughs in hardware design, a meaningful reduction of this overhead is highly improbable. While extremely clever solutions might emerge, I would assign a low probability to a significant reduction in the computational overhead through hardware acceleration in the next few years.

If some clever breakthrough occurs, and we can successfully manufacture specialised hardware for FHE, it could potentially reduce the overhead to as little as the 10x-to-1x range. Unfortunately, even in this highly optimistic scenario, it would take a considerable amount of time to reach the production scale required for it to apply to the problem of boxing SOTA AI.

Additionally, at best, hardware-accelerated FHE computation might match the baseline of plaintext computation. However, as SOTA AI is trained on TPUs and specialised GPUs there would still be a wide gap between the hardware efficiency. Given that we can anticipate continual improvements in hardware optimized for AI training we can expect this efficiency gap to only widen with time.

In summary, while hardware acceleration might reduce some of the computational overhead associated with FHE, it is unlikely to bridge the efficiency gap significantly enough to make cryptographic boxing practically applicable.

The incentives against boxing AI

Firstly, I would argue that most of the investment into the development of capabilities is primarily motivated by the eventual deployment or usage of these capabilities. Containment or limitation of deployment directly opposes the very motivation for developing these capabilities in the first place.

Additionally, the costs associated with the development of new SOTA AI models are enormous, causing heavy pressure to optimise this process. The dollar-to-FLOP ratio becomes one of the key metrics AI labs are concerned with, and as we discussed in the previous point, the alignment tax associated with cryptographic boxing is big enough to make it physically infeasible to implement, let alone consider paying with your safety budget.

If through some magical breakthroughs we can bring the computational overhead of FHE down to ~10x, which is a massive improvement from the current 1,000x to 1,000,000x range, it would mean that for the AI labs to pay the alignment tax of boxing, their safety budgets would need to be 10 times that of their capabilities compute budget. This, once again, is highly improbable.

Centralisation of power: What does success even look like?

Let’s assume that we somehow solved the computational overhead, whether it’s through hardware acceleration, budget increase or some magic breakthroughs. What would that imply?

The more the model accessibility is constrained, the more centralised the remaining access becomes. So, a successful implementation of cryptographic containment would imply a substantial centralization of power^[7]. From the safety perspective, in the ideal case, we would want a system that is facilitated in an almost Manhattan-project-like fashion, with top-level security and multiple levels of failsafe mechanisms. Whoever is in charge of this operation, is unlikely to invest this many resources into the development of new technology for charitable reasons (though I hope to be wrong here). This centralization of power could be as detrimental as the absence of cryptographic containment in the first place, as the risks imposed by potential misuse of Superintelligence (SI) may be at par with the risks posed by misaligned AI.

Potential solution?

Modern cryptography, namely Secure Multiparty Computation (MPC) and threshold cryptography, could allow for secure decentralisation of cryptographic containment. MPC permits computation of any function between multiple parties, in a way that none of the parties see each other’s inputs or the full computation being performed. Threshold cryptography permits decryption at a specified threshold m given n key-shares, where m<n. Together these two techniques allow for the implementation of a decentralised version of cryptographic containment schemes, such as the ones outlined in part 1 of this post.

To illustrate how this can be done I am going to propose the following example setup:

The keys to be used for FHE are split among n trusted parties.
The codebase (i.e. train->align->eval protocol^[8]) is encrypted using FHE.
All necessary data (model architecture, training data, alignment and evals code etc) are split among the trusted parties.
1. These parties can, but not necessarily have to be, the same ones as FHE key-share holders.
The MPC is performed (no one party can see the entire computation process).
Once a desired state of the system is reached the output is conditionally decrypted (the decryption conditions depend on the system design).

Example of a decentralised cryptographic boxing

While this approach permits the decentralisation of power inherent to classic containment schemes, it exacerbates the problem of bearing tremendous alignment tax. The exact computational overhead is extremely hard to estimate for MPC as it heavily depends on the number of parties involved and the network bandwidth between them. These two variables determine the efficiency of the scheme as every step of the computation needs to be shared with all the other parties. Without getting into details, we can expect the addition of MPC to a given scheme to add an additional computational overhead of 100x or more^[9].

Assuming the the implementation of FHE schemes is feasible, which is a considerable assumption to make, the computational overhead added by including MPC in the scheme would be approximately ~100x. However, if we are willing to assume that the computational overhead is not an issue (in general, not just for FHE), the scheme outlined above could facilitate a global cooperative effort on the secure development of the most critical technological endeavour humanity has ever faced. If AGI is inevitable, decentralising its creation to the representatives of humanity, rather than having a single body in charge, might be a more prudent approach. Who those “humanity representatives” are, is a different problem—governments may not be the ideal proxies for all of humanity’s preferences, but this problem is too far out of my expertise to make meaningful judgments on.

Once again, the exact details of the implementation of a decentralised cryptographic containment are subject to near-infinite possible configuration. However, they face the same issue as the centralised cryptographic containment schemes, they are simply computationally infeasible.

Conclusion

I found the idea of potentially being able to mathematically guarantee that an arbitrarily intelligent AI is only going to get decrypted upon passing certain safety specifications highly exciting to explore. It seems to me like these schemes are going to be implementable at some point in the future, but considering the rate of advancements in AI capabilities, I likely won’t be soon enough.

I might, and hope, to be wrong regarding my scepticism towards the likelihood of implementation of cryptographic boxing. I am far from a cryptography expert and am generally optimistic regarding clever workarounds to the overheads posed by FHE and MPC, whether these are going to come from the algorithmic or hardware side.

Additionally, a lot can be done with modern cryptography, and I would not be surprised if I missed some relatively simple way of achieving the same outcome without needing to encrypt the entire computation process. This might be an interesting direction for future research.

TL;DR / Summary

I defined cryptographic boxing as using encryption techniques like FHE to contain AI, restricting its informational output that can be used to ensure alignment and prevent malicious behaviour.
Argued that the alignment tax of cryptographic boxing is huge: The computational overhead of using FHE for cryptographic boxing is at least 1,000x, making it impractical and impossible to implement.
Argued that even with hardware acceleration the inherent complexity and data size increase in Fully Homomorphic Encryption (FHE) means that it would still lag significantly behind the efficiency of current hardware used to train AI. (Which we can expect to improve with time, widening the efficiency gap)
Outlined some of the incentives against implementing boxing schemes: The high development costs and the drive for deployment of advanced AI capabilities make cryptographic boxing directly opposed to the motivation of AI labs, as it imposes a substantial computational and financial burden.
Outlined the caveat of successfully boxing AI—The centralization of power it would entail, which bears arguably equal risk to an uncontained misaligned AI.
I proposed a scheme for a decentralised cryptographic containment scheme using Secure Multiparty Computation (MPC) and threshold cryptography that could distribute power, but faces even greater inefficiency challenges, making it equally impractical.

As I am far from an expert on any of the topics covered in this post, all comments will be greatly appreciated!

Additionally, a longer, more nuanced (and less polished) version of this post can be found here.

^
Famously argued by Youdkowsky by performing the box experiment. Also argued in this post.
^
Argued generally in this post
^
One can get very paranoid here as it is possible for the AI to even attempt communication through unexpected channels such as electromagnetic waves generated by transistors or even heat wave patterns of its cooling system.
^
This idea was originally proposed by Paul Christiano in 2010.
^
A great breakdown of the shortcomings of the applications of FHE can be found in this paper.
^
This includes SOTA noise management (Bootstrapping). The full breakdown of my estimate of the computational complexity of FHE can be found in the table on page 3 of this document
^
It is not impossible that some social schemes like mandatory government overwatch or other “anti-centralisation of access to SI” practices could be implemented, however, the effective and practical nature of these is highly speculative.
^
Once again, the exact implementation is subject to near-infinite configuration.
^
For garbled circuits, the complexity is typically 𝑂(∣𝐶∣⋅𝑛), while for secret sharing-based protocols, it can range from 𝑂(𝑛2 ⋅∣𝐶∣) to 𝑂(𝑛3⋅∣𝐶∣) depending on the security model. I recommend this paper, and this paper if you want to read more.

The Pragmatic Side of Cryptographically Boxing AI

What do I mean by cryptographic AI boxing?

The alignment tax of cryptographic boxing is huge

The incentives against boxing AI

Centralisation of power: What does success even look like?

Conclusion

TL;DR /​ Summary

TL;DR / Summary