Note I’m mainly using this as an opportunity to talk about ideas and compute in NLP.
I don’t know how big an improvement DeBERTaV2 is over SoTA.
DeBERTaV2 is pretty solid and mainly got its performance from an architectural change. Note the DeBERTa paper was initially uploaded in 2020, but it was updated early this year to include DeBERTa V2. The previous main popular SOTA on SuperGLUE was T5 (which beat RoBERTa). DeBERTaV2 uses 8x fewer parameters and 4x less compute than T5. DeBERTa’s high performance isn’t an artifact of SuperGLUE; in downstream tasks such as some legal NLP tasks it does better too.
Compared to unidirectional models on NLU tasks, the bidirectional models do far better. On CommonsenseQA, a good task that’s been around for a few years, the bidirectional models do far better than fine-tuned GPT-3--DeBERTaV3 differs in three ideas from GPT-3 (roughly encoding, ELECTRA training, and bidirectionality, if I recall correctly), and it’s >400x smaller.
I agree with the overall sentiment that much of the performance is from brute compute, but even in NLP, ideas can help sometimes. For vision/continuous signals, algorithmic advances continue to account for much progress; ideas move the needle substantially more frequently in vision than in NLP.
For tasks when there is less traction, ideas are even more useful. Just to use a recent example, “the use of verifiers results in approximately the same performance boost as a 30x model size increase.” I think the initially proposed heuristic depends on how much progress has already been made on a task. For nearly solved tasks, the next incremental idea shouldn’t help much. On new hard tasks such as some maths tasks, scaling laws are worse and ideas will be a practical necessity. Not all the first ideas are obvious “low hanging fruits” because it might take a while for the community to get oriented and find good angles of attack.
Note I’m mainly using this as an opportunity to talk about ideas and compute in NLP.
DeBERTaV2 is pretty solid and mainly got its performance from an architectural change. Note the DeBERTa paper was initially uploaded in 2020, but it was updated early this year to include DeBERTa V2. The previous main popular SOTA on SuperGLUE was T5 (which beat RoBERTa). DeBERTaV2 uses 8x fewer parameters and 4x less compute than T5. DeBERTa’s high performance isn’t an artifact of SuperGLUE; in downstream tasks such as some legal NLP tasks it does better too.
Compared to unidirectional models on NLU tasks, the bidirectional models do far better. On CommonsenseQA, a good task that’s been around for a few years, the bidirectional models do far better than fine-tuned GPT-3--DeBERTaV3 differs in three ideas from GPT-3 (roughly encoding, ELECTRA training, and bidirectionality, if I recall correctly), and it’s >400x smaller.
I agree with the overall sentiment that much of the performance is from brute compute, but even in NLP, ideas can help sometimes. For vision/continuous signals, algorithmic advances continue to account for much progress; ideas move the needle substantially more frequently in vision than in NLP.
For tasks when there is less traction, ideas are even more useful. Just to use a recent example, “the use of verifiers results in approximately the same performance boost as a 30x model size increase.” I think the initially proposed heuristic depends on how much progress has already been made on a task. For nearly solved tasks, the next incremental idea shouldn’t help much. On new hard tasks such as some maths tasks, scaling laws are worse and ideas will be a practical necessity. Not all the first ideas are obvious “low hanging fruits” because it might take a while for the community to get oriented and find good angles of attack.