The only other explicit replication attempt I am aware of has not succeeded; this is the GPT-NeoX project by the independent research collective EleutherAI.
FWIW, if the desire was there, I think EleutherAI definitely could have replicated a full 175B parameter GPT-3 equivalent. They had/have sufficient compute, knowledge, etc. From what I recall being a part of that community, some reasons for stopping at 20B were:
Many folks wanted to avoid racing forward and avoid advancing timelines. Not training a big model “just because” was one way to do that.
20B parameter models are stretching what you can fit on any single GPU setup, and 175B parameter models require a whole bunch of GPUs to inference, so they’re ~useless for most researchers.
Nobody wanted to lead that project, when push came to shove. GPT-Neo, GPT-J, and GPT-NeoX-20B all took lots of individual initiative to get done, and most of the individual who led those projects had other priorities by the time the compute for a 175B parameter model training run became available.
Putting out the best possible 175B parameter model would’ve required gathering more data, given DeepMind’s Chinchilla findings, which would have been a PITA.
Eventually, Meta put out a replication, which made working on another replication seem silly.
FWIW, if the desire was there, I think EleutherAI definitely could have replicated a full 175B parameter GPT-3 equivalent. They had/have sufficient compute, knowledge, etc. From what I recall being a part of that community, some reasons for stopping at 20B were:
Many folks wanted to avoid racing forward and avoid advancing timelines. Not training a big model “just because” was one way to do that.
20B parameter models are stretching what you can fit on any single GPU setup, and 175B parameter models require a whole bunch of GPUs to inference, so they’re ~useless for most researchers.
Nobody wanted to lead that project, when push came to shove. GPT-Neo, GPT-J, and GPT-NeoX-20B all took lots of individual initiative to get done, and most of the individual who led those projects had other priorities by the time the compute for a 175B parameter model training run became available.
Putting out the best possible 175B parameter model would’ve required gathering more data, given DeepMind’s Chinchilla findings, which would have been a PITA.
Eventually, Meta put out a replication, which made working on another replication seem silly.