One sanity check here would be to just make some 128k ctx window calls full of examples; if you cannot k-shot this capability even with that, then you shouldn’t expect “finetuning” to either; while if it works, that implies the “finetuning” is much worse than it ought to be and so the original results are uninformative.
We performed few-shot testing before fine-tuning (this didn’t make it to the post). I reran some experiments on the permutation iteration problem, and got similar results as before: for one function (and n = 6), the model got ~60% accuracy for k=2, but not great[1] accuracy for k=3. For two functions, it already failed at the f(x)+g(y) problem.
(This was with 50 few-shot examples; gpt-3.5-turbo-0125 only allows 16k tokens.)
So fine-tuning really does give considerably better capabilities than simply many-shot prompting.
Let me clarify that with fine-tuning, our intent wasn’t so much to create or teach the model new capabilities, but to elicit the capabilities the model already has. (C.f. Hubinger’s When can we trust model evaluations?, section 3.) I admit that it’s not clear where to draw the lines between teaching and eliciting, though.
Relatedly, I do not mean to claim that one simply cannot construct a 175B model that successfully performs nested addition and multiplication. Rather, I’d take the results as evidence for GPT-3.5 not doing much parallel reasoning off-the-shelf (e.g. with light fine-tuning). I could see this being consistent with the data multiplexing paper (they do much heavier training). I’m still confused, though.
I tried to run experiments on open source models on full fine-tuning, but it does, in fact, require much more RAM. I don’t currently have the multi-GPU setups required to do full fine-tuning on even 7B models (I could barely fine-tune Pythia-1.4B on a single A100, and did not get much oomph out of it). So I’m backing down; if someone else is able to do proper tests here, go ahead.
Note that while you can get 1⁄6 accuracy trivially, you can get 1⁄5 if you realize that the data is filtered so that fk(x)≠x, and 1⁄4 if you also realize that fk(x)≠f(x) (and are able to compute f(x)), …
We performed few-shot testing before fine-tuning (this didn’t make it to the post). I reran some experiments on the permutation iteration problem, and got similar results as before: for one function (and n = 6), the model got ~60% accuracy for k=2, but not great[1] accuracy for k=3. For two functions, it already failed at the f(x)+g(y) problem.
(This was with 50 few-shot examples; gpt-3.5-turbo-0125 only allows 16k tokens.)
So fine-tuning really does give considerably better capabilities than simply many-shot prompting.
Let me clarify that with fine-tuning, our intent wasn’t so much to create or teach the model new capabilities, but to elicit the capabilities the model already has. (C.f. Hubinger’s When can we trust model evaluations?, section 3.) I admit that it’s not clear where to draw the lines between teaching and eliciting, though.
Relatedly, I do not mean to claim that one simply cannot construct a 175B model that successfully performs nested addition and multiplication. Rather, I’d take the results as evidence for GPT-3.5 not doing much parallel reasoning off-the-shelf (e.g. with light fine-tuning). I could see this being consistent with the data multiplexing paper (they do much heavier training). I’m still confused, though.
I tried to run experiments on open source models on full fine-tuning, but it does, in fact, require much more RAM. I don’t currently have the multi-GPU setups required to do full fine-tuning on even 7B models (I could barely fine-tune Pythia-1.4B on a single A100, and did not get much oomph out of it). So I’m backing down; if someone else is able to do proper tests here, go ahead.
Note that while you can get 1⁄6 accuracy trivially, you can get 1⁄5 if you realize that the data is filtered so that fk(x)≠x, and 1⁄4 if you also realize that fk(x)≠f(x) (and are able to compute f(x)), …
Going to message you a suggestion I think.