GPT-4 was trained on OpenAI’s new supercomputer which is composed of [edit] NVIDIA DGX A100 nodes.
I’m assuming each individual instance of GPT-4 runs on one DGX A100 node.
Each DGX node has 8x A100 GPUs. Each A100 can have either 40 or 80GB vram. So a single DGX node running GPT-4 has either 320 or 640 GB. That can allow us to calculate an upper limit to the number of parameters in a single GPT-4 instance.
Assuming GPT-4 uses float16 to represent parameters (same as GPT-3), and assuming they’re using the 80GB A100s, that gives us an upper limit of 343 billion parameters in one GPT-4 instance.
GPT-3 had 175 Billion parameters. I’ve seen a few references online to some interview where Sam Altman said GPT-4 actually has fewer parameters than GPT-3 but different architecture and more training. I can’t find the original source so I can’t verify that quote, but assuming that’s true, that gives us a lower bound of slightly smaller than GPT-3′s 175 Billion parameters.
Looking at the compute architecture that it’s running on, an upper bound of 343 billion parameters seems reasonable.
[edited: removed incorrect estimate of number of DGX nodes as 800. Figure wasn’t used in parameter estimate anyways.]
GPT-4 was trained on OpenAI’s new supercomputer which is composed of 800 NVIDIA DGX A100 nodes...Each DGX node has 8x A100 GPUs
What are you getting that? 8x800=6400 A100s sounds off by a factor of three.
(Also, it does not follow that they are storing parameters in the same precision rather than mixed-precision nor solely in RAM, and most of the serious scaling frameworks support various model offload approaches.)
I don’t remember, I probably misremembered on the number of DGX nodes. Will edit the comment to remove that figure.
Assuming that GPT-4 is able to run on a single DGX node with 640 GB VRAM, what would be a reasonable upper bound on the parameter count assuming that they’re using mixed precision and model offload approaches?
[Edit] I’ve been researching various offloading approaches. It seems unlikely that they are using anything like weight offloading as the process of loading layers too and from VRAM would be way too time consuming for them to make a usable API out of it. If it is too large to fit on a single DGX node then it’s more likely that they’re splitting up the layers over multiple nodes rather than offloading weights.
I already assumed that they’re using float16 like GPT-3 when calculating the total number of parameters that could be stored in one DGX VRAM. Unless they’re using something even smaller like float8, mixed precision with float32 or float64 would only increase VRAM requirements.
You’re missing the possibility that parameters during training were larger than models used for inference. It is common practice now to train large, then distill into a series of smaller models that can be used based on the task need.
cites the Sam Altman quote on GPT-4 having few parameters as being from the AC10 online meetup, however I can’t seem to find any transcript or videos of that meetup to verify it.
GPT-4 was trained on OpenAI’s new supercomputer which is composed of [edit] NVIDIA DGX A100 nodes.
I’m assuming each individual instance of GPT-4 runs on one DGX A100 node.
Each DGX node has 8x A100 GPUs. Each A100 can have either 40 or 80GB vram. So a single DGX node running GPT-4 has either 320 or 640 GB. That can allow us to calculate an upper limit to the number of parameters in a single GPT-4 instance.
Assuming GPT-4 uses float16 to represent parameters (same as GPT-3), and assuming they’re using the 80GB A100s, that gives us an upper limit of 343 billion parameters in one GPT-4 instance.
GPT-3 had 175 Billion parameters. I’ve seen a few references online to some interview where Sam Altman said GPT-4 actually has fewer parameters than GPT-3 but different architecture and more training. I can’t find the original source so I can’t verify that quote, but assuming that’s true, that gives us a lower bound of slightly smaller than GPT-3′s 175 Billion parameters.
Looking at the compute architecture that it’s running on, an upper bound of 343 billion parameters seems reasonable.
[edited: removed incorrect estimate of number of DGX nodes as 800. Figure wasn’t used in parameter estimate anyways.]
What are you getting that? 8x800=6400 A100s sounds off by a factor of three.
(Also, it does not follow that they are storing parameters in the same precision rather than mixed-precision nor solely in RAM, and most of the serious scaling frameworks support various model offload approaches.)
I don’t remember, I probably misremembered on the number of DGX nodes. Will edit the comment to remove that figure.
Assuming that GPT-4 is able to run on a single DGX node with 640 GB VRAM, what would be a reasonable upper bound on the parameter count assuming that they’re using mixed precision and model offload approaches?
[Edit] I’ve been researching various offloading approaches. It seems unlikely that they are using anything like weight offloading as the process of loading layers too and from VRAM would be way too time consuming for them to make a usable API out of it. If it is too large to fit on a single DGX node then it’s more likely that they’re splitting up the layers over multiple nodes rather than offloading weights.
I already assumed that they’re using float16 like GPT-3 when calculating the total number of parameters that could be stored in one DGX VRAM. Unless they’re using something even smaller like float8, mixed precision with float32 or float64 would only increase VRAM requirements.
You’re missing the possibility that parameters during training were larger than models used for inference. It is common practice now to train large, then distill into a series of smaller models that can be used based on the task need.
This source (Chinese news website) Not 175 billion!OpenAI CEO’s announcement: GPT-4 parameters do not increase but decrease—iMedia (min.news)
cites the Sam Altman quote on GPT-4 having few parameters as being from the AC10 online meetup, however I can’t seem to find any transcript or videos of that meetup to verify it.