ETA: I’m talking about the comparison to SOTA from a new clever trick. I’m not saying that “the cumulative impacts of all clever ideas is <4x,” that would be obviously insane. (I don’t know how big an improvement DeBERTaV2 is over SoTA. But isn’t RoBERTa from August 2019, basically contemporary with SuperGLUE, and gets 84.6% accuracy with many fewer parameters than T5? So I don’t think I care at all about the comparison to T5.)
I said I would be surprised in a couple years, not that I would be surprised now.
I’m less surprised on SuperGLUE than downstream applications.
Much of the reason for the gap seems to be that none of the models you are comparing DeBERTaV2 against seem to be particularly optimized for SuperGLUE performance (in part because it’s a new-ish benchmark that doesn’t track downstream usefulness that well, so it’s not going to be stable until people try on it). (ETA: actually isn’t this just because you aren’t comparing to SOTA? I think this was probably just a misunderstanding.)
Similarly, I expect people to get giant model size gains on many more recent datasets for a while at the beginning (if people try on them), but I think the gains from small projects or single ideas will be small by the time that a larger effort has been made.
Note I’m mainly using this as an opportunity to talk about ideas and compute in NLP.
I don’t know how big an improvement DeBERTaV2 is over SoTA.
DeBERTaV2 is pretty solid and mainly got its performance from an architectural change. Note the DeBERTa paper was initially uploaded in 2020, but it was updated early this year to include DeBERTa V2. The previous main popular SOTA on SuperGLUE was T5 (which beat RoBERTa). DeBERTaV2 uses 8x fewer parameters and 4x less compute than T5. DeBERTa’s high performance isn’t an artifact of SuperGLUE; in downstream tasks such as some legal NLP tasks it does better too.
Compared to unidirectional models on NLU tasks, the bidirectional models do far better. On CommonsenseQA, a good task that’s been around for a few years, the bidirectional models do far better than fine-tuned GPT-3--DeBERTaV3 differs in three ideas from GPT-3 (roughly encoding, ELECTRA training, and bidirectionality, if I recall correctly), and it’s >400x smaller.
I agree with the overall sentiment that much of the performance is from brute compute, but even in NLP, ideas can help sometimes. For vision/continuous signals, algorithmic advances continue to account for much progress; ideas move the needle substantially more frequently in vision than in NLP.
For tasks when there is less traction, ideas are even more useful. Just to use a recent example, “the use of verifiers results in approximately the same performance boost as a 30x model size increase.” I think the initially proposed heuristic depends on how much progress has already been made on a task. For nearly solved tasks, the next incremental idea shouldn’t help much. On new hard tasks such as some maths tasks, scaling laws are worse and ideas will be a practical necessity. Not all the first ideas are obvious “low hanging fruits” because it might take a while for the community to get oriented and find good angles of attack.
RE: “like I’m surprised if a clever innovation does more good than spending 4x more compute”
Earlier this year, DeBERTaV2 did better on SuperGLUE than models 10x the size and got state of the art.
Models such as DeBERTaV3 can do better than on commonsense question answering tasks than models that are tens or several hundreds of times larger.
DeBERTaV3-large
Accuracy: 84.6 1 Parameters: 0.4B
T5-11B
Accuracy: 83.5 1 Parameters: 11B
Fine-tuned GPT-3
73.0 1 175B
https://arxiv.org/pdf/2112.03254.pdf#page=5
Bidirectional models + training ideas + better positional encoding helped more than 4x.
ETA: I’m talking about the comparison to SOTA from a new clever trick. I’m not saying that “the cumulative impacts of all clever ideas is <4x,” that would be obviously insane. (I don’t know how big an improvement DeBERTaV2 is over SoTA. But isn’t RoBERTa from August 2019, basically contemporary with SuperGLUE, and gets 84.6% accuracy with many fewer parameters than T5? So I don’t think I care at all about the comparison to T5.)
I said I would be surprised in a couple years, not that I would be surprised now.
I’m less surprised on SuperGLUE than downstream applications.
Much of the reason for the gap seems to be that none of the models you are comparing DeBERTaV2 against seem to be particularly optimized for SuperGLUE performance (in part because it’s a new-ish benchmark that doesn’t track downstream usefulness that well, so it’s not going to be stable until people try on it). (ETA: actually isn’t this just because you aren’t comparing to SOTA? I think this was probably just a misunderstanding.)
Similarly, I expect people to get giant model size gains on many more recent datasets for a while at the beginning (if people try on them), but I think the gains from small projects or single ideas will be small by the time that a larger effort has been made.
Note I’m mainly using this as an opportunity to talk about ideas and compute in NLP.
DeBERTaV2 is pretty solid and mainly got its performance from an architectural change. Note the DeBERTa paper was initially uploaded in 2020, but it was updated early this year to include DeBERTa V2. The previous main popular SOTA on SuperGLUE was T5 (which beat RoBERTa). DeBERTaV2 uses 8x fewer parameters and 4x less compute than T5. DeBERTa’s high performance isn’t an artifact of SuperGLUE; in downstream tasks such as some legal NLP tasks it does better too.
Compared to unidirectional models on NLU tasks, the bidirectional models do far better. On CommonsenseQA, a good task that’s been around for a few years, the bidirectional models do far better than fine-tuned GPT-3--DeBERTaV3 differs in three ideas from GPT-3 (roughly encoding, ELECTRA training, and bidirectionality, if I recall correctly), and it’s >400x smaller.
I agree with the overall sentiment that much of the performance is from brute compute, but even in NLP, ideas can help sometimes. For vision/continuous signals, algorithmic advances continue to account for much progress; ideas move the needle substantially more frequently in vision than in NLP.
For tasks when there is less traction, ideas are even more useful. Just to use a recent example, “the use of verifiers results in approximately the same performance boost as a 30x model size increase.” I think the initially proposed heuristic depends on how much progress has already been made on a task. For nearly solved tasks, the next incremental idea shouldn’t help much. On new hard tasks such as some maths tasks, scaling laws are worse and ideas will be a practical necessity. Not all the first ideas are obvious “low hanging fruits” because it might take a while for the community to get oriented and find good angles of attack.