Rumors are GPT-4 will have less than 1T parameters (and possibly no larger than GPT-3) - unless Chinchilla turns out to be wrong or obsoleted apparently this is to be expected.
it could be sparse...a 175B parameters GPT-4 that has 90 percent sparsity could essentially equivalent to 1.75T param GPT-3. Also I am not exactly sure, but my guess is that if it is multimodal the scaling laws change (essentially you get more varied data instead of training it always on predicting text which is repetitive and likely just a small percentage contains new useful information to learn).
My impression (could be totally wrong) was that GPT-4 won’t be much larger than GPT-3 but it’s effective parameter size will be much larger by using techniques like this.
Rumors are GPT-4 will have less than 1T parameters (and possibly no larger than GPT-3) - unless Chinchilla turns out to be wrong or obsoleted apparently this is to be expected.
it could be sparse...a 175B parameters GPT-4 that has 90 percent sparsity could essentially equivalent to 1.75T param GPT-3. Also I am not exactly sure, but my guess is that if it is multimodal the scaling laws change (essentially you get more varied data instead of training it always on predicting text which is repetitive and likely just a small percentage contains new useful information to learn).
My impression (could be totally wrong) was that GPT-4 won’t be much larger than GPT-3 but it’s effective parameter size will be much larger by using techniques like this.