I don’t remember, I probably misremembered on the number of DGX nodes. Will edit the comment to remove that figure.
Assuming that GPT-4 is able to run on a single DGX node with 640 GB VRAM, what would be a reasonable upper bound on the parameter count assuming that they’re using mixed precision and model offload approaches?
[Edit] I’ve been researching various offloading approaches. It seems unlikely that they are using anything like weight offloading as the process of loading layers too and from VRAM would be way too time consuming for them to make a usable API out of it. If it is too large to fit on a single DGX node then it’s more likely that they’re splitting up the layers over multiple nodes rather than offloading weights.
I already assumed that they’re using float16 like GPT-3 when calculating the total number of parameters that could be stored in one DGX VRAM. Unless they’re using something even smaller like float8, mixed precision with float32 or float64 would only increase VRAM requirements.
You’re missing the possibility that parameters during training were larger than models used for inference. It is common practice now to train large, then distill into a series of smaller models that can be used based on the task need.
I don’t remember, I probably misremembered on the number of DGX nodes. Will edit the comment to remove that figure.
Assuming that GPT-4 is able to run on a single DGX node with 640 GB VRAM, what would be a reasonable upper bound on the parameter count assuming that they’re using mixed precision and model offload approaches?
[Edit] I’ve been researching various offloading approaches. It seems unlikely that they are using anything like weight offloading as the process of loading layers too and from VRAM would be way too time consuming for them to make a usable API out of it. If it is too large to fit on a single DGX node then it’s more likely that they’re splitting up the layers over multiple nodes rather than offloading weights.
I already assumed that they’re using float16 like GPT-3 when calculating the total number of parameters that could be stored in one DGX VRAM. Unless they’re using something even smaller like float8, mixed precision with float32 or float64 would only increase VRAM requirements.
You’re missing the possibility that parameters during training were larger than models used for inference. It is common practice now to train large, then distill into a series of smaller models that can be used based on the task need.