It costs well under $1/hour to rent hardware that performs 100 trillion operations per second. If a model using that much compute (something like 3 orders of magnitude more than gpt-3) were competitive with trained humans, it seems like it would be transformative. Even if you needed 3 more orders of magnitude to be human-level at typical tasks, it still looks like it would be transformative in a short period of time owing to its other advantages (quickly reaching and then surpassing the top end of the human range, and running at much larger serial speed—more likely you’d be paying 1000x as much to run your model 1000x faster than a human). If this were literally dropped in our laps right now it would fortunately be slowed down for a while because there just isn’t enough hardware, but that probably won’t be the case for long.
consider that writing 750 words with GPT-3 costs 6 cents.
vs
It costs well under $1/hour to rent hardware that performs 100 trillion operations per second. If a model using that much compute (something like 3 orders of magnitude more than gpt-3)...
That’s easy to reconcile! OpenAI is selling access to GPT-3 wayyyy above its own marginal hardware rental cost. Right? That would hardly be surprising; usually pricing decisions involve other things besides marginal costs, like price elasticity of demand, and capacity to scale up, and so on. (And/or OpenAI’s marginal costs includes things that are not hardware rental, e.g. human monitoring and approval processes.) But as soon as there’s some competition (especially competition from open-source projects) I expect price to rapidly approach the hardware rental cost (including electricity).
That estimate puts GPT-3 at about 500 billion floating point operations per word, 200x less than 100 trillion. If you think a human reads at 250 words per minute, then 6 cents for 750 words is $1.20/hour. So the two estimates differ by about 250x.
So that’s 430 trillion (operations per second) per ($/hour).
You shouldn’t expect to be able to get full utilization out of that for a variety of reasons, but in the very long run you should be getting reasonably close, certainly more than 100 trillion operations per second.
(ETA: But note that a service like the OpenAI API using EC2 would need to use on demand prices which are about 10x higher per flop if you want reasonable availability.)
You may have better info, but I’m not sure I expect 1000x better serial speed than humans (at least not with innovations in the next decade). Latency is already a bottleneck in practice, despite efforts to reduce it. Width-wise parallelism has its limits and depth- or data-wise parallelism doesn’t improve latency. For example, GPT-3 already has high latency compared to smaller models and it won’t help if you make it 10^3x or 10^6x bigger.
I’m trying to figure out a principled way to calculate/estimate how long it would take to cross the human range in a situation like this. How do you think about it? Taking the history of Go as a precedent, it would seem that we’d get AGI capable of competing with the average human first, and then several years (decades?) later we’d get an AGI architecture+project that blows through the entire human range in a few months. That feels like it can’t be right.
Depends on what you mean by “human range.” Go was decades only if you talk about crossing the range between people who don’t play Go at all to those who play as a hobby to those who have trained very extensively. If you restrict to the range of “how good would this human be if they trained extensively at Go?” then I’d guess the range is much smaller—I’d guess that the median person could reach a few amateur dan with practice, so maybe you are looking at like 10 stones of range between “unusually bad human” and “best human.”
My rough guess when I looked into it before was that doubling model size is worth about 1 stone around AlphaZero’s size/strength, so that’s about a factor of 1000 in model size.
then several years (decades?) later we’d get an AGI architecture+project that blows through the entire human range in a few months. That feels like it can’t be right.
I think this is mostly an artifact of scaling up R&D effort really quickly. If you have a 50th percentile human and then radically scale up R&D, it wouldn’t be that surprising if you got to “best human” within a year. The reason it would seem surprising to me for AGI is that investment will already be high enough that it won’t be possible to scale up R&D that much / that fast as you approach the average human.
As Steven noted, your $1/hour number is cheaper than my numbers and probably more realistic. That makes a significant difference.
I agree that transformative impact is possible once we’ve built enough GPUs and connected them up into many, many new supercomputers bigger than the ones we have today. In a <=10 year timeline scenario, this seems like a bottleneck. But maybe not with longer timelines.
It costs well under $1/hour to rent hardware that performs 100 trillion operations per second. If a model using that much compute (something like 3 orders of magnitude more than gpt-3) were competitive with trained humans, it seems like it would be transformative. Even if you needed 3 more orders of magnitude to be human-level at typical tasks, it still looks like it would be transformative in a short period of time owing to its other advantages (quickly reaching and then surpassing the top end of the human range, and running at much larger serial speed—more likely you’d be paying 1000x as much to run your model 1000x faster than a human). If this were literally dropped in our laps right now it would fortunately be slowed down for a while because there just isn’t enough hardware, but that probably won’t be the case for long.
I’m trying to reconcile:
vs
That’s easy to reconcile! OpenAI is selling access to GPT-3 wayyyy above its own marginal hardware rental cost. Right? That would hardly be surprising; usually pricing decisions involve other things besides marginal costs, like price elasticity of demand, and capacity to scale up, and so on. (And/or OpenAI’s marginal costs includes things that are not hardware rental, e.g. human monitoring and approval processes.) But as soon as there’s some competition (especially competition from open-source projects) I expect price to rapidly approach the hardware rental cost (including electricity).
Someone can correct me if I’m misunderstanding.
That estimate puts GPT-3 at about 500 billion floating point operations per word, 200x less than 100 trillion. If you think a human reads at 250 words per minute, then 6 cents for 750 words is $1.20/hour. So the two estimates differ by about 250x.
As a citation for the hardware cost:
P4d instances on EC2 currently cost $11.57/h if reserved for 3 years. They contain 8 A100s.
An A100 does about 624 trillion half-precision ops/second.
So that’s 430 trillion (operations per second) per ($/hour).
You shouldn’t expect to be able to get full utilization out of that for a variety of reasons, but in the very long run you should be getting reasonably close, certainly more than 100 trillion operations per second.
(ETA: But note that a service like the OpenAI API using EC2 would need to use on demand prices which are about 10x higher per flop if you want reasonable availability.)
Limitation:
Cost of compute + addition to pricing for:
a) Profit
b) To recuperate costs from training or acquiring the model
Having an additional feature, human monitoring/approval, does make things higher. (In principle maybe it could increase quality.)
You may have better info, but I’m not sure I expect 1000x better serial speed than humans (at least not with innovations in the next decade). Latency is already a bottleneck in practice, despite efforts to reduce it. Width-wise parallelism has its limits and depth- or data-wise parallelism doesn’t improve latency. For example, GPT-3 already has high latency compared to smaller models and it won’t help if you make it 10^3x or 10^6x bigger.
I’m trying to figure out a principled way to calculate/estimate how long it would take to cross the human range in a situation like this. How do you think about it? Taking the history of Go as a precedent, it would seem that we’d get AGI capable of competing with the average human first, and then several years (decades?) later we’d get an AGI architecture+project that blows through the entire human range in a few months. That feels like it can’t be right.
Depends on what you mean by “human range.” Go was decades only if you talk about crossing the range between people who don’t play Go at all to those who play as a hobby to those who have trained very extensively. If you restrict to the range of “how good would this human be if they trained extensively at Go?” then I’d guess the range is much smaller—I’d guess that the median person could reach a few amateur dan with practice, so maybe you are looking at like 10 stones of range between “unusually bad human” and “best human.”
My rough guess when I looked into it before was that doubling model size is worth about 1 stone around AlphaZero’s size/strength, so that’s about a factor of 1000 in model size.
I think this is mostly an artifact of scaling up R&D effort really quickly. If you have a 50th percentile human and then radically scale up R&D, it wouldn’t be that surprising if you got to “best human” within a year. The reason it would seem surprising to me for AGI is that investment will already be high enough that it won’t be possible to scale up R&D that much / that fast as you approach the average human.
As Steven noted, your $1/hour number is cheaper than my numbers and probably more realistic. That makes a significant difference.
I agree that transformative impact is possible once we’ve built enough GPUs and connected them up into many, many new supercomputers bigger than the ones we have today. In a <=10 year timeline scenario, this seems like a bottleneck. But maybe not with longer timelines.