Thanks for the insightful comment!
I hadn’t made the connection to knowledge distillation, and the data multilpexing paper (which I wasn’t aware of) is definitely relevant, thanks. I agree that our results seem very odd in this light.
It is certainly big news if OA fine-tuning doesn’t work as it’s supposed to. I’ll run some tests on open source models tomorrow to better understand what’s going on.
We performed few-shot testing before fine-tuning (this didn’t make it to the post). I reran some experiments on the permutation iteration problem, and got similar results as before: for one function (and n = 6), the model got ~60% accuracy for k=2, but not great[1] accuracy for k=3. For two functions, it already failed at the f(x)+g(y) problem.
(This was with 50 few-shot examples; gpt-3.5-turbo-0125 only allows 16k tokens.)
So fine-tuning really does give considerably better capabilities than simply many-shot prompting.
Let me clarify that with fine-tuning, our intent wasn’t so much to create or teach the model new capabilities, but to elicit the capabilities the model already has. (C.f. Hubinger’s When can we trust model evaluations?, section 3.) I admit that it’s not clear where to draw the lines between teaching and eliciting, though.
Relatedly, I do not mean to claim that one simply cannot construct a 175B model that successfully performs nested addition and multiplication. Rather, I’d take the results as evidence for GPT-3.5 not doing much parallel reasoning off-the-shelf (e.g. with light fine-tuning). I could see this being consistent with the data multiplexing paper (they do much heavier training). I’m still confused, though.
I tried to run experiments on open source models on full fine-tuning, but it does, in fact, require much more RAM. I don’t currently have the multi-GPU setups required to do full fine-tuning on even 7B models (I could barely fine-tune Pythia-1.4B on a single A100, and did not get much oomph out of it). So I’m backing down; if someone else is able to do proper tests here, go ahead.
Note that while you can get 1⁄6 accuracy trivially, you can get 1⁄5 if you realize that the data is filtered so that fk(x)≠x, and 1⁄4 if you also realize that fk(x)≠f(x) (and are able to compute f(x)), …