it could be sparse...a 175B parameters GPT-4 that has 90 percent sparsity could essentially equivalent to 1.75T param GPT-3. Also I am not exactly sure, but my guess is that if it is multimodal the scaling laws change (essentially you get more varied data instead of training it always on predicting text which is repetitive and likely just a small percentage contains new useful information to learn).
My impression (could be totally wrong) was that GPT-4 won’t be much larger than GPT-3 but it’s effective parameter size will be much larger by using techniques like this.
it could be sparse...a 175B parameters GPT-4 that has 90 percent sparsity could essentially equivalent to 1.75T param GPT-3. Also I am not exactly sure, but my guess is that if it is multimodal the scaling laws change (essentially you get more varied data instead of training it always on predicting text which is repetitive and likely just a small percentage contains new useful information to learn).
My impression (could be totally wrong) was that GPT-4 won’t be much larger than GPT-3 but it’s effective parameter size will be much larger by using techniques like this.