Parallelization part (data parallelism, tensor parallelism, pipeline parallelism, ZeRO) is completely standard. See Efficient Training on Multiple GPUs by Hugging Face for a standard description. Failure recovery part is relatively unusual.
I don’t get what the parallelization strategy should have to do with the chip ban? It sounds like just a basic parallelism approach.
You’re right. I was pretty tired when I wrote this and am not sure where that thought came from.
Parallelization part (data parallelism, tensor parallelism, pipeline parallelism, ZeRO) is completely standard. See Efficient Training on Multiple GPUs by Hugging Face for a standard description. Failure recovery part is relatively unusual.
I don’t get what the parallelization strategy should have to do with the chip ban? It sounds like just a basic parallelism approach.
You’re right. I was pretty tired when I wrote this and am not sure where that thought came from.