Here’s an equation for the MMLA vs Loss plot: MMLA=−2.468∗Loss+5.5174
A MMLA = 100% corresponds to a loss of 1.8304. Using the scaling laws, listed here, this can be reached using:
The GPT-4 dataset (4Gtokens) and a model 11x the size of Megatron-Turing NLG (6 trillion parameters). Compute time: 111 days on Eos.
GPT-4′s 175B params with 18.5 trillion training tokens (4.6x the size of GPT-4′s dataset). Compute time: 16 days on Eos, but getting that many tokens may be a problem.
Megatron-Turing NLG’s 530B parameters, and 8.5 trillion tokens (2.1x the size of GPT-4′s dataset). Compute time: 23 days on Eos. This is a much more reachable dataset.
The nx compute speed of Eos used for GPT-4 was 18.4 ExaFLOP/s.
Here’s an equation for the MMLA vs Loss plot: MMLA=−2.468∗Loss+5.5174
A MMLA = 100% corresponds to a loss of 1.8304. Using the scaling laws, listed here, this can be reached using:
The GPT-4 dataset (4Gtokens) and a model 11x the size of Megatron-Turing NLG (6 trillion parameters). Compute time: 111 days on Eos.
GPT-4′s 175B params with 18.5 trillion training tokens (4.6x the size of GPT-4′s dataset). Compute time: 16 days on Eos, but getting that many tokens may be a problem.
Megatron-Turing NLG’s 530B parameters, and 8.5 trillion tokens (2.1x the size of GPT-4′s dataset). Compute time: 23 days on Eos. This is a much more reachable dataset.
The nx compute speed of Eos used for GPT-4 was 18.4 ExaFLOP/s.