Here’s an equation for the MMLA vs Loss plot:
A MMLA = 100% corresponds to a loss of 1.8304. Using the scaling laws, listed here, this can be reached using:
The GPT-4 dataset (4Gtokens) and a model 11x the size of Megatron-Turing NLG (6 trillion parameters). Compute time: 111 days on Eos.
GPT-4′s 175B params with 18.5 trillion training tokens (4.6x the size of GPT-4′s dataset). Compute time: 16 days on Eos, but getting that many tokens may be a problem.
Megatron-Turing NLG’s 530B parameters, and 8.5 trillion tokens (2.1x the size of GPT-4′s dataset). Compute time: 23 days on Eos. This is a much more reachable dataset.
The nx compute speed of Eos used for GPT-4 was 18.4 ExaFLOP/s.
Data seems to be a bottleneck, so we should expect the number of model parameters to run high to compensate.
Note, that a MMLU of 100% should be achievable using a model the same size as Megatron-Turing NLG, and a data only 2.1x more data than GPT-4, which should be achievable in the near term.