It’s very interesting to see the implications of how well transformers will continue to scale.
Here are some stats:
Megatron-Turing NLG on Lambada
GPT-3
Lambada (few shot)
0.872
0.864
PiQA (zero shot)
0.820
80.5
PiQA (one shot)
0.810
80.5
PiQA (few shot)
0.832
82.8
Megatron-Truing NLG performs better, and even if the difference is small, I’ve seen comparisons with smaller models where even small differences of 1% means there is a noticeable difference in intelligence when using the models for text generation.
“…even small differences of 1% means there is a noticeable difference in intelligence when using the models for text generation.”
I wish we had better automated metrics for that sort of subjective quality measure. A user study of subjective quality/usefulness would have been good too. That’s not too much to ask of Microsoft, and since they’re presumably aiming to sell access to this and similar models, it would be good for them to provide some indication of how capable the model feels to humans.
It’s very interesting to see the implications of how well transformers will continue to scale.
Here are some stats:
0.820
Megatron-Truing NLG performs better, and even if the difference is small, I’ve seen comparisons with smaller models where even small differences of 1% means there is a noticeable difference in intelligence when using the models for text generation.
“…even small differences of 1% means there is a noticeable difference in intelligence when using the models for text generation.”
I wish we had better automated metrics for that sort of subjective quality measure. A user study of subjective quality/usefulness would have been good too. That’s not too much to ask of Microsoft, and since they’re presumably aiming to sell access to this and similar models, it would be good for them to provide some indication of how capable the model feels to humans.