In long term, Moore’s law seems to be on your side.
How much is GPT-3 parallelizable? I honestly have no idea.
Also, how difficult it would be to obtain the data GPT-3 was trained with? (You could also use some different data, but then I assume the users of the distributed AI would have to agree on the same set, so that the AI does not have to be trained for each of them separately. Or maybe you could have multiple shared data sets, each user choosing which one they want to use now.)
The data for GPT-2 has been replicated by the open source OpenWebText project. To my knowledge the same dataset was utilised for GPT-3, so accessing it is not a problem.
The parallelizability of GPT-3 is something I’ve been looking into. The current implementation of zero-2 seems like the best way to memory-optimally train a 170B parameter transformer model.
In long term, Moore’s law seems to be on your side.
How much is GPT-3 parallelizable? I honestly have no idea.
Also, how difficult it would be to obtain the data GPT-3 was trained with? (You could also use some different data, but then I assume the users of the distributed AI would have to agree on the same set, so that the AI does not have to be trained for each of them separately. Or maybe you could have multiple shared data sets, each user choosing which one they want to use now.)
The data for GPT-2 has been replicated by the open source OpenWebText project. To my knowledge the same dataset was utilised for GPT-3, so accessing it is not a problem.
The parallelizability of GPT-3 is something I’ve been looking into. The current implementation of zero-2 seems like the best way to memory-optimally train a 170B parameter transformer model.