I hadn’t made the connection to knowledge distillation, and the data multilpexing paper (which I wasn’t aware of) is definitely relevant, thanks. I agree that our results seem very odd in this light.
It is certainly big news if OA fine-tuning doesn’t work as it’s supposed to. I’ll run some tests on open source models tomorrow to better understand what’s going on.
It is certainly big news if OA fine-tuning doesn’t work as it’s supposed to
The docs are pretty vague, but I notice that most of them are framed as being around declarative sorts of knowledge. It’s positioned as being a way to reduce the number of examples in the prompt (to save tokens & reduce latency), or include additional factual knowledge, like defining edge cases. There is one brief mention that you may be able to use it for “Performing a new skill or task that’s hard to articulate in a prompt”, but that’s about it.
And when it comes to lightweight finetuning such as LoRA, people tend to notice that they are good for adding new factual knowledge or increasing the prior of specific pre-existing knowledge, but don’t really add qualitatively new things—like you cannot simply LoRA your way to better hands in an image generator or teach it 3D generation if it didn’t already know that. So I’ve long been suspicious that OA isn’t doing real finetuning, of the entire model, but much cheaper underperforming LoRA-like lightweight finetuning (of the sort which can be easily stored on-GPU rather than loading an entire finetuned model or its delta from cloud storage, or tying up entire sets of GPUs to keep a full finetuned model hot).
One sanity check here would be to just make some 128k ctx window calls full of examples; if you cannot k-shot this capability even with that, then you shouldn’t expect “finetuning” to either; while if it works, that implies the “finetuning” is much worse than it ought to be and so the original results are uninformative.
To me the strongest evidence that fine-tuning is based on LoRA or similar is the fact that pricing is based just on training and input / output and doesn’t factor in the cost of storing your fine-tuned models. Llama-3-8b-instruct is ~16GB (I think this ought to be roughly comparable, at least in the same ballpark). You’d almost surely care if you were storing that much data for each fine-tune.
Yeah, that’s part of why I’m suspicious. I remember the original OA finetuning as being quite expensive, but the current one is not that expensive. If a GPT-3 is like 100GB of weights, say, after optimization, and it’s doing true finetuning, how is OA making it so cheap and so low-latency?
One sanity check here would be to just make some 128k ctx window calls full of examples; if you cannot k-shot this capability even with that, then you shouldn’t expect “finetuning” to either; while if it works, that implies the “finetuning” is much worse than it ought to be and so the original results are uninformative.
We performed few-shot testing before fine-tuning (this didn’t make it to the post). I reran some experiments on the permutation iteration problem, and got similar results as before: for one function (and n = 6), the model got ~60% accuracy for k=2, but not great[1] accuracy for k=3. For two functions, it already failed at the f(x)+g(y) problem.
(This was with 50 few-shot examples; gpt-3.5-turbo-0125 only allows 16k tokens.)
So fine-tuning really does give considerably better capabilities than simply many-shot prompting.
Let me clarify that with fine-tuning, our intent wasn’t so much to create or teach the model new capabilities, but to elicit the capabilities the model already has. (C.f. Hubinger’s When can we trust model evaluations?, section 3.) I admit that it’s not clear where to draw the lines between teaching and eliciting, though.
Relatedly, I do not mean to claim that one simply cannot construct a 175B model that successfully performs nested addition and multiplication. Rather, I’d take the results as evidence for GPT-3.5 not doing much parallel reasoning off-the-shelf (e.g. with light fine-tuning). I could see this being consistent with the data multiplexing paper (they do much heavier training). I’m still confused, though.
I tried to run experiments on open source models on full fine-tuning, but it does, in fact, require much more RAM. I don’t currently have the multi-GPU setups required to do full fine-tuning on even 7B models (I could barely fine-tune Pythia-1.4B on a single A100, and did not get much oomph out of it). So I’m backing down; if someone else is able to do proper tests here, go ahead.
Note that while you can get 1⁄6 accuracy trivially, you can get 1⁄5 if you realize that the data is filtered so that fk(x)≠x, and 1⁄4 if you also realize that fk(x)≠f(x) (and are able to compute f(x)), …
Thanks for the insightful comment!
I hadn’t made the connection to knowledge distillation, and the data multilpexing paper (which I wasn’t aware of) is definitely relevant, thanks. I agree that our results seem very odd in this light.
It is certainly big news if OA fine-tuning doesn’t work as it’s supposed to. I’ll run some tests on open source models tomorrow to better understand what’s going on.
The docs are pretty vague, but I notice that most of them are framed as being around declarative sorts of knowledge. It’s positioned as being a way to reduce the number of examples in the prompt (to save tokens & reduce latency), or include additional factual knowledge, like defining edge cases. There is one brief mention that you may be able to use it for “Performing a new skill or task that’s hard to articulate in a prompt”, but that’s about it.
And when it comes to lightweight finetuning such as LoRA, people tend to notice that they are good for adding new factual knowledge or increasing the prior of specific pre-existing knowledge, but don’t really add qualitatively new things—like you cannot simply LoRA your way to better hands in an image generator or teach it 3D generation if it didn’t already know that. So I’ve long been suspicious that OA isn’t doing real finetuning, of the entire model, but much cheaper underperforming LoRA-like lightweight finetuning (of the sort which can be easily stored on-GPU rather than loading an entire finetuned model or its delta from cloud storage, or tying up entire sets of GPUs to keep a full finetuned model hot).
One sanity check here would be to just make some 128k ctx window calls full of examples; if you cannot k-shot this capability even with that, then you shouldn’t expect “finetuning” to either; while if it works, that implies the “finetuning” is much worse than it ought to be and so the original results are uninformative.
To me the strongest evidence that fine-tuning is based on LoRA or similar is the fact that pricing is based just on training and input / output and doesn’t factor in the cost of storing your fine-tuned models. Llama-3-8b-instruct is ~16GB (I think this ought to be roughly comparable, at least in the same ballpark). You’d almost surely care if you were storing that much data for each fine-tune.
Yeah, that’s part of why I’m suspicious. I remember the original OA finetuning as being quite expensive, but the current one is not that expensive. If a GPT-3 is like 100GB of weights, say, after optimization, and it’s doing true finetuning, how is OA making it so cheap and so low-latency?
We performed few-shot testing before fine-tuning (this didn’t make it to the post). I reran some experiments on the permutation iteration problem, and got similar results as before: for one function (and n = 6), the model got ~60% accuracy for k=2, but not great[1] accuracy for k=3. For two functions, it already failed at the f(x)+g(y) problem.
(This was with 50 few-shot examples; gpt-3.5-turbo-0125 only allows 16k tokens.)
So fine-tuning really does give considerably better capabilities than simply many-shot prompting.
Let me clarify that with fine-tuning, our intent wasn’t so much to create or teach the model new capabilities, but to elicit the capabilities the model already has. (C.f. Hubinger’s When can we trust model evaluations?, section 3.) I admit that it’s not clear where to draw the lines between teaching and eliciting, though.
Relatedly, I do not mean to claim that one simply cannot construct a 175B model that successfully performs nested addition and multiplication. Rather, I’d take the results as evidence for GPT-3.5 not doing much parallel reasoning off-the-shelf (e.g. with light fine-tuning). I could see this being consistent with the data multiplexing paper (they do much heavier training). I’m still confused, though.
I tried to run experiments on open source models on full fine-tuning, but it does, in fact, require much more RAM. I don’t currently have the multi-GPU setups required to do full fine-tuning on even 7B models (I could barely fine-tune Pythia-1.4B on a single A100, and did not get much oomph out of it). So I’m backing down; if someone else is able to do proper tests here, go ahead.
Note that while you can get 1⁄6 accuracy trivially, you can get 1⁄5 if you realize that the data is filtered so that fk(x)≠x, and 1⁄4 if you also realize that fk(x)≠f(x) (and are able to compute f(x)), …
Going to message you a suggestion I think.