We are headed into an extreme compute overhang

If we achieve AGI-level performance using an LLM-like approach, the training hardware will be capable of running ~1,000,000s concurrent instances of the model.

Definitions

Although there is some debate about the definition of compute overhang, I believe that the AI Impacts definition matches the original use, and I prefer it: “enough computing hardware to run many powerful AI systems already exists by the time the software to run such systems is developed”. A large compute overhang leads to additional risk due to faster takeoff.

I use the types of superintelligence defined in Bostrom’s Superintelligence book (summary here).

I use the definition of AGI in this Metaculus question. The adversarial Turing test portion of the definition is not very relevant to this post.

Thesis

Due to practical reasons, the compute requirements for training LLMs is several orders of magnitude larger than what is required for running a single inference instance. In particular, a single NVIDIA H100 GPU can run inference at a throughput of about 2000 tokens/​s, while Meta trained Llama3 70B on a GPU cluster[1] of about 24,000 GPUs. Assuming we require a performance of 40 tokens/​s, the training cluster can run concurrent instances of the resulting 70B model.

I will assume that the above ratios hold for an AGI level model. Considering the amount of data children absorb via the vision pathway, the amount of training data for LLMs may not be that much higher than the data humans are trained on, and so the current ratios are a useful anchor. This is explored further in the appendix.

Given the above ratios, we will have the capacity for ~1e6 AGI instances at the moment that training is complete. This will likely lead to superintelligence via “collective superintelligence” approach. Additional speed may be then available via accelerators such as GroqChip, which produces 300 tokens/​s for a single instance of a 70B model. This would result in a “speed superintelligence” or a combined “speed+collective superintelligence”.

From AGI to ASI

With 1e6 AGIs, we may be able to construct an ASI, with the AGIs collaborating in a “collective superintelligence”. Similar to groups of collaborating humans, a collective superintelligence divides tasks among its members for concurrent execution.

AGIs derived from the same model are likely to collaborate more effectively than humans because their weights are identical. Any fine-tune can be applied to all members, and text produced by one can be understood by all members.

Tasks that are inherently serial would benefit more from a speedup instead of a division of tasks. An accelerator such as GroqChip will be able to accelerate serial thought speed by a factor of 10x or more.

Counterpoints

  • It may be the case that a collective of sub-AGI models can reach AGI capability. It would be advantageous if we could achieve AGI earlier, with sub-AGI components, at a higher hardware cost per instance. This will reduce the compute overhang at the critical point in time.

  • There may a paradigm change on the path to AGI resulting in smaller training clusters, reducing the overhang at the critical point.

Conclusion

A single AGI may be able to replace one human worker, presenting minimal risk. A fleet of 1,000,000 AGIs may give rise to a collective superintelligence. This capability is likely to be available immediately upon training the AGI model.

We may be able to mitigate the overhang by achieving AGI with a cluster of sub-AGI components.

Appendix—Training Data Volume

A calculation of training data processed by humans during development:

  • time: ~20 years, or 6e8 seconds

  • raw data input: ~10 mb/​s = 1e7 b/​s

  • total for human training data: 6e15 bits

  • Llama3 training size: 1.5e13 tokens * 16 bits =~ 2e14 bits

The amount of data used for training current generation LLMs seems comparable to the amount processed by humans during childhood.

References

  1. ^

    two clusters are actually in production, and a 400B model is still being trained