I’m not sure your results really support the interpretation that davinci “transfers less well”. Notably, achieving 100% accuracy from 50% is often a lot harder than achieving 50% from 0%/whatever random chance is on your datasets (I haven’t looked through your code yet to examine the datasets) and I’d predict that davinci already does pretty well zero-shot (w/ no finetuning) on most of the tasks you consider here (which limits its improvement from finetuning, as you can’t get above 100% accuracy).
In addition, larger LMs are often significantly more data efficient, so you’d predict that they need less total finetuning to do well on tasks (and therefore the additional finetuning on related tasks would benefit the larger models less).
I’m not sure your results really support the interpretation that davinci “transfers less well”. Notably, achieving 100% accuracy from 50% is often a lot harder than achieving 50% from 0%/whatever random chance is on your datasets (I haven’t looked through your code yet to examine the datasets) and I’d predict that davinci already does pretty well zero-shot (w/ no finetuning) on most of the tasks you consider here (which limits its improvement from finetuning, as you can’t get above 100% accuracy).
In addition, larger LMs are often significantly more data efficient, so you’d predict that they need less total finetuning to do well on tasks (and therefore the additional finetuning on related tasks would benefit the larger models less).