It would explain a lot. If 5-level models require a lot more compute, and Nvidia is strategically ensuring no one has enough compute to train one yet but many have enough for 4-level models, then you’d see a lot of similarly strong models, until someone competent to train a 5-level model first accumulated enough compute. If you also think that essentially only OpenAI and perhaps Anthropic have the chops to pull it off, then that goes double.
I do still think, even if this theory was borne out, that the clustering at 4-level remains suspicious and worth pondering.
If we assume that OpenAI and Anthropic would be happy to buy more NVIDIA chips at significantly higher prices, then we should also ask: how difficult would it be for them to achieve a similar boost of training capability with non-NVIDIA providers?
Is it just impossible to do some developments with non-NVIDIA chips, or is it simply more expensive (at the current NVIDIA prices)?
And, of course, Google is surely relying on its own chips, and Google models are at the same 4-level as everyone’s else.
Another question we should ask: what are the chances that some of the companies have 5-level large, slow, and expensive to run models for their internal use, but are not being in a hurry to disclose that?
GPT-4 existed under the radar between August 2022 and February 2023, could something similar be happening now?
Another question we should ask: what are the chances that some of the companies have 5-level large, slow, and expensive to run models for their internal use, but are not being in a hurry to disclose that?
Yeah, I think there’s a reasonable chance that it won’t make sense for the companies to release the ‘true’ level-5 models because of inference expense and speed. So what we’ll actually get is some distilled smaller version trained using the help of the strongest models. I don’t think that’s necessarily even a bad thing for consumers, but the idea certainly does make my curiosity itch.
it won’t make sense for the companies to release the ‘true’ level-5 models because of inference expense and speed.
Yes, not only that, but one does not want to show one’s true level to the competitors, and one does not want to let the competitors to study the model by poking at it via API.
And if a level-5 model is already a big help in AI R&D, one does not want to share it either, instead one wants to use it to get ahead in AI R&D race.
I can imagine a strategy of waiting till one has level-6 models for internal use before sharing full level-5 models.
And then there are safety and liability considerations. It’s not that internal use is completely 100% safe, but it’s way safer than when one exposes the API to world.
Also, it looks like we are getting AIs that are easy to make corrigible, and thus align them iteratively to DWIM goals, but that the AI models can’t be released to the public without restrictions, because it will still be able to be highly misused.
But how would this make sense from a financing perspective? If the company reveals that they are in posession of a 5-level model they’d be able to raise money at a much higher valuation. Just imagine what would happen to Alphabet stock if they proved posession of something significantly smarter than GPT4.
Also, the fact that Nvidia is selling its GPUs rather than keeping them all for itself does seem like some kind of evidence against this. If it were really all just a matter of scaling, why not cut everyone off and rush forward? They have more than enough resources by now to pay the foremost experts millions of dollars a year, and they’d have the best equipment too. Seems like a no-brainer if AGI was around the corner.
I don’t think the primary decision makers at Nvidia do believe AGI is likely to be developed soon. I think they are hyping AI because it makes them money, but not really believing that progress will continue all the way to AGI in the near future. Also, it’s not always as easy as throwing money at the problem (acquihiring being the neologism these days). Those who are experts on a team that they already believe is the winning team would be really hard to convince to switch teams.
As for the company using the model to fundraise… Yeah, I think Google Deepmind is not likely to keep an extra powerful model secret for very long. Anthropic might. But also, you can give private demos to key investors under NDA if you want to impress them.
I do wonder if, in the future, AI companies will try to deliberately impair the AI research capabilities of their public models. I don’t expect it is happening yet. It would be a hard call to make, looking less competent in order to not share the advantage with competitors.
It feels hard to predict the details of how this all might play out!
I don’t think the primary decision makers at Nvidia do believe AGI is likely to be developed soon. I think they are hyping AI because it makes them money, but not really believing that progress will continue all the way to AGI in the near future.
I agree—and if they are at all rational they have expended significant resources to find out whether this belief is justified or not, and I’d take that seriously. If Nvidia do not believe that AGI is likely to be developed soon, I think they are probably right—and this makes more sense if there in fact aren’t any 5-level models around and scaling really has slowed down.
If I were in charge of Nvidia, I’d supply everybody until some design shows up that I believe will scale to AGI, and then I’d make sure to be the one who’s got the biggest training cluster. But since that’s not what’s happening yet, that’s evidence that Nvidia do not believe that the current paradigms are sufficiently capable.
there’s a reasonable chance that it won’t make sense for the companies to release the ‘true’ level-5 models because of inference expense and speed
Not really, Llama-3-405b goes for $3-5 per million output tokens with good speed, and it’s Chinchilla optimal for 4e25 FLOPs (at 40 tokens/parameter, moving higher than Chinchilla’s 20, also consistent with findings in Imbue’s CARBS). At 1e27 FLOPs (feasible compute with 100KH100s when training in FP8 for 6 months), we are only 25 times up from this in compute, which is 5 times up in model size (square root of compute increase), maybe 2 times up in model depth (square root of model size increase).
So a dense model at this scale should cost about $15-50 per million tokens (Claude 3 Opus goes for $75 per million output tokens) and get maybe 2-3 times slower, there is still some room for margin even at reasonable prices. With the more effective choice to train a MoE model (which is smarter at the same training compute cost, but harder to setup and requires more users to become efficient to serve), the inference cost might get somewhat higher, but it can still stay within last year’s precedent. So it doesn’t even need to be game-changingly better to be worth the price, just notably better. Also, next year’s Blackwell is 2x faster and can do inference in FP4 an additional 2x faster on top of that (which Hopper can’t), though that’s more relevant for input tokens.
If we assume that OpenAI and Anthropic would be happy to buy more NVIDIA chips at significantly higher prices, then we should also ask: how difficult would it be for them to achieve a similar boost of training capability with non-NVIDIA providers?
Is it just impossible to do some developments with non-NVIDIA chips, or is it simply more expensive (at the current NVIDIA prices)?
And, of course, Google is surely relying on its own chips, and Google models are at the same 4-level as everyone’s else.
Another question we should ask: what are the chances that some of the companies have 5-level large, slow, and expensive to run models for their internal use, but are not being in a hurry to disclose that?
GPT-4 existed under the radar between August 2022 and February 2023, could something similar be happening now?
Yeah, I think there’s a reasonable chance that it won’t make sense for the companies to release the ‘true’ level-5 models because of inference expense and speed. So what we’ll actually get is some distilled smaller version trained using the help of the strongest models. I don’t think that’s necessarily even a bad thing for consumers, but the idea certainly does make my curiosity itch.
Yes, not only that, but one does not want to show one’s true level to the competitors, and one does not want to let the competitors to study the model by poking at it via API.
And if a level-5 model is already a big help in AI R&D, one does not want to share it either, instead one wants to use it to get ahead in AI R&D race.
I can imagine a strategy of waiting till one has level-6 models for internal use before sharing full level-5 models.
And then there are safety and liability considerations. It’s not that internal use is completely 100% safe, but it’s way safer than when one exposes the API to world.
Also, it looks like we are getting AIs that are easy to make corrigible, and thus align them iteratively to DWIM goals, but that the AI models can’t be released to the public without restrictions, because it will still be able to be highly misused.
But how would this make sense from a financing perspective? If the company reveals that they are in posession of a 5-level model they’d be able to raise money at a much higher valuation. Just imagine what would happen to Alphabet stock if they proved posession of something significantly smarter than GPT4.
Also, the fact that Nvidia is selling its GPUs rather than keeping them all for itself does seem like some kind of evidence against this. If it were really all just a matter of scaling, why not cut everyone off and rush forward? They have more than enough resources by now to pay the foremost experts millions of dollars a year, and they’d have the best equipment too. Seems like a no-brainer if AGI was around the corner.
I don’t think the primary decision makers at Nvidia do believe AGI is likely to be developed soon. I think they are hyping AI because it makes them money, but not really believing that progress will continue all the way to AGI in the near future. Also, it’s not always as easy as throwing money at the problem (acquihiring being the neologism these days). Those who are experts on a team that they already believe is the winning team would be really hard to convince to switch teams.
As for the company using the model to fundraise… Yeah, I think Google Deepmind is not likely to keep an extra powerful model secret for very long. Anthropic might. But also, you can give private demos to key investors under NDA if you want to impress them.
I do wonder if, in the future, AI companies will try to deliberately impair the AI research capabilities of their public models. I don’t expect it is happening yet. It would be a hard call to make, looking less competent in order to not share the advantage with competitors.
It feels hard to predict the details of how this all might play out!
I agree—and if they are at all rational they have expended significant resources to find out whether this belief is justified or not, and I’d take that seriously. If Nvidia do not believe that AGI is likely to be developed soon, I think they are probably right—and this makes more sense if there in fact aren’t any 5-level models around and scaling really has slowed down.
If I were in charge of Nvidia, I’d supply everybody until some design shows up that I believe will scale to AGI, and then I’d make sure to be the one who’s got the biggest training cluster. But since that’s not what’s happening yet, that’s evidence that Nvidia do not believe that the current paradigms are sufficiently capable.
Not really, Llama-3-405b goes for $3-5 per million output tokens with good speed, and it’s Chinchilla optimal for 4e25 FLOPs (at 40 tokens/parameter, moving higher than Chinchilla’s 20, also consistent with findings in Imbue’s CARBS). At 1e27 FLOPs (feasible compute with 100K H100s when training in FP8 for 6 months), we are only 25 times up from this in compute, which is 5 times up in model size (square root of compute increase), maybe 2 times up in model depth (square root of model size increase).
So a dense model at this scale should cost about $15-50 per million tokens (Claude 3 Opus goes for $75 per million output tokens) and get maybe 2-3 times slower, there is still some room for margin even at reasonable prices. With the more effective choice to train a MoE model (which is smarter at the same training compute cost, but harder to setup and requires more users to become efficient to serve), the inference cost might get somewhat higher, but it can still stay within last year’s precedent. So it doesn’t even need to be game-changingly better to be worth the price, just notably better. Also, next year’s Blackwell is 2x faster and can do inference in FP4 an additional 2x faster on top of that (which Hopper can’t), though that’s more relevant for input tokens.