Currently, the best AI we have comes at enormous computing/energy costs, but we know by example that this isn’t a physical requirement.
AI does everything faster, including consumption of power. If we compare tokens per joule, counterintuitively LLMs turn out to be cheaper (for now), not more costly.
Any given collection of GPUs working on inference is processing on the order of 100 requests at the same time. So for inference, 16 GPUs (2 nodes of H100s or MI300Xs) with 1500 watts each (counting the fraction of consumption by the whole datacenter) consume 24 kilowatts, but they are generating tokens for 100 LLM instances, each about 300 times faster than the speed of relevant human reasoning token generation (8 hours a day, one token per second). If we divide the 24 kilowatts by 30,000, what we get is about 1 watt. Training cost is roughly comparable to inference cost (across all inference done with a model), so doesn’t completely change this estimate.
An estimate from cost gives similar results. An H100 consumes 1500 watts (as fraction of the whole datacenter) and costs $4/hour. A million tokens of Llama-3-405B cost $5. A human takes a month to generate a million tokens, which is 750 hours. So the equivalent power consumed by an LLM to generate tokens at human speed is about 2 watts. Human brain consumes 10-30 watts (though for a fair comparison, reducing relevant use to 8 hours a day, this becomes more like 3-10 watts on average).
It matters what model is used to make the tokens, unlimited tokens from GPT 3 is of only limited use to me. If it requires ~GPT 6 to make useful tokens, then the energy cost is presumably a lot greater. I don’t know that its counterintuitive—a small, much less capable brain is faster, requires less energy, but useless for many tasks.
It’s counterintuitive in the sense that a 24 kilowatt machine trained using a 24 megawatt machine turns out to be producing cognition cheaper per joule than a 20 watt brain. I think it’s plausible that a GPT-4 scale model can be an AGI if trained on an appropriate dataset (necessarily synthetic). They know wildly unreasonable amount of trivia. Replacing it with general reasoning skills should be very effective.
There is funding for scaling from 5e25 FLOPs to 7e27 FLOPs and technical feasibility for scaling up to 3e29 FLOPs. This gives models with 5 trillion parameters (trained on 1 gigawatt clusters) and then 30 trillion parameters (using $1 trillion training systems). This is about 6 and then 30 times more expensive in joules per token than Llama-3-405B (assuming B200s for the 1 gigawatt clusters, and further 30% FLOP/joule improvement for the $1 trillion system). So we only get to 6-12 watts and then 30-60 watts per LLM when divided among LLM instances that share the same hardware and slowed down to human equivalent speed. (This is an oversimplification, since output token generation is not FLOPs-bounded, unlike input tokens and training.)
Kudos for referencing actual numbers. I don’t think it makes sense to measure humans in terms of tokens, but I don’t have a better metric handy. Tokens obviously aren’t all equivalent either. For some purposes, a small fast LLM is more way efficient than a human. For something like answering SIMPLEBENCH, I’d guess o1-preview is less efficient while still significantly below human performance.
AI does everything faster, including consumption of power. If we compare tokens per joule, counterintuitively LLMs turn out to be cheaper (for now), not more costly.
Any given collection of GPUs working on inference is processing on the order of 100 requests at the same time. So for inference, 16 GPUs (2 nodes of H100s or MI300Xs) with 1500 watts each (counting the fraction of consumption by the whole datacenter) consume 24 kilowatts, but they are generating tokens for 100 LLM instances, each about 300 times faster than the speed of relevant human reasoning token generation (8 hours a day, one token per second). If we divide the 24 kilowatts by 30,000, what we get is about 1 watt. Training cost is roughly comparable to inference cost (across all inference done with a model), so doesn’t completely change this estimate.
An estimate from cost gives similar results. An H100 consumes 1500 watts (as fraction of the whole datacenter) and costs $4/hour. A million tokens of Llama-3-405B cost $5. A human takes a month to generate a million tokens, which is 750 hours. So the equivalent power consumed by an LLM to generate tokens at human speed is about 2 watts. Human brain consumes 10-30 watts (though for a fair comparison, reducing relevant use to 8 hours a day, this becomes more like 3-10 watts on average).
It matters what model is used to make the tokens, unlimited tokens from GPT 3 is of only limited use to me. If it requires ~GPT 6 to make useful tokens, then the energy cost is presumably a lot greater. I don’t know that its counterintuitive—a small, much less capable brain is faster, requires less energy, but useless for many tasks.
It’s counterintuitive in the sense that a 24 kilowatt machine trained using a 24 megawatt machine turns out to be producing cognition cheaper per joule than a 20 watt brain. I think it’s plausible that a GPT-4 scale model can be an AGI if trained on an appropriate dataset (necessarily synthetic). They know wildly unreasonable amount of trivia. Replacing it with general reasoning skills should be very effective.
There is funding for scaling from 5e25 FLOPs to 7e27 FLOPs and technical feasibility for scaling up to 3e29 FLOPs. This gives models with 5 trillion parameters (trained on 1 gigawatt clusters) and then 30 trillion parameters (using $1 trillion training systems). This is about 6 and then 30 times more expensive in joules per token than Llama-3-405B (assuming B200s for the 1 gigawatt clusters, and further 30% FLOP/joule improvement for the $1 trillion system). So we only get to 6-12 watts and then 30-60 watts per LLM when divided among LLM instances that share the same hardware and slowed down to human equivalent speed. (This is an oversimplification, since output token generation is not FLOPs-bounded, unlike input tokens and training.)
Kudos for referencing actual numbers. I don’t think it makes sense to measure humans in terms of tokens, but I don’t have a better metric handy. Tokens obviously aren’t all equivalent either. For some purposes, a small fast LLM is more way efficient than a human. For something like answering SIMPLEBENCH, I’d guess o1-preview is less efficient while still significantly below human performance.