I’m confused about the parallelization part and what it implies. It says the model was trained on 2K GPUs but GPT-4 was probably trained on 1 OOM more than that right?
Parallelization part (data parallelism, tensor parallelism, pipeline parallelism, ZeRO) is completely standard. See Efficient Training on Multiple GPUs by Hugging Face for a standard description. Failure recovery part is relatively unusual.
I’m confused about the parallelization part and what it implies. It says the model was trained on 2K GPUs but GPT-4 was probably trained on 1 OOM more than that right?
Parallelization part (data parallelism, tensor parallelism, pipeline parallelism, ZeRO) is completely standard. See Efficient Training on Multiple GPUs by Hugging Face for a standard description. Failure recovery part is relatively unusual.
I don’t get what the parallelization strategy should have to do with the chip ban? It sounds like just a basic parallelism approach.
You’re right. I was pretty tired when I wrote this and am not sure where that thought came from.