GPT-4 was trained on OpenAI’s new supercomputer which is composed of 800 NVIDIA DGX A100 nodes...Each DGX node has 8x A100 GPUs
What are you getting that? 8x800=6400 A100s sounds off by a factor of three.
(Also, it does not follow that they are storing parameters in the same precision rather than mixed-precision nor solely in RAM, and most of the serious scaling frameworks support various model offload approaches.)
I don’t remember, I probably misremembered on the number of DGX nodes. Will edit the comment to remove that figure.
Assuming that GPT-4 is able to run on a single DGX node with 640 GB VRAM, what would be a reasonable upper bound on the parameter count assuming that they’re using mixed precision and model offload approaches?
[Edit] I’ve been researching various offloading approaches. It seems unlikely that they are using anything like weight offloading as the process of loading layers too and from VRAM would be way too time consuming for them to make a usable API out of it. If it is too large to fit on a single DGX node then it’s more likely that they’re splitting up the layers over multiple nodes rather than offloading weights.
I already assumed that they’re using float16 like GPT-3 when calculating the total number of parameters that could be stored in one DGX VRAM. Unless they’re using something even smaller like float8, mixed precision with float32 or float64 would only increase VRAM requirements.
You’re missing the possibility that parameters during training were larger than models used for inference. It is common practice now to train large, then distill into a series of smaller models that can be used based on the task need.
What are you getting that? 8x800=6400 A100s sounds off by a factor of three.
(Also, it does not follow that they are storing parameters in the same precision rather than mixed-precision nor solely in RAM, and most of the serious scaling frameworks support various model offload approaches.)
I don’t remember, I probably misremembered on the number of DGX nodes. Will edit the comment to remove that figure.
Assuming that GPT-4 is able to run on a single DGX node with 640 GB VRAM, what would be a reasonable upper bound on the parameter count assuming that they’re using mixed precision and model offload approaches?
[Edit] I’ve been researching various offloading approaches. It seems unlikely that they are using anything like weight offloading as the process of loading layers too and from VRAM would be way too time consuming for them to make a usable API out of it. If it is too large to fit on a single DGX node then it’s more likely that they’re splitting up the layers over multiple nodes rather than offloading weights.
I already assumed that they’re using float16 like GPT-3 when calculating the total number of parameters that could be stored in one DGX VRAM. Unless they’re using something even smaller like float8, mixed precision with float32 or float64 would only increase VRAM requirements.
You’re missing the possibility that parameters during training were larger than models used for inference. It is common practice now to train large, then distill into a series of smaller models that can be used based on the task need.