That being said, I agree with Fabien that the title is a bit overstated, insofar as it’s about your results in particular::
Thus, fine-tuned performance provides very little information about the best performance that would be achieved by a large number of actors fine-tuning models with random prompting schemes in parallel.
It’s a general fact of ML that small changes in finetuning setup can greatly affect performance if you’re not careful. In particular, it seems likely to me that the empirical details that Fabien asks for may affect your results. But this has little to do with formatting, and much more to deal with the intrinsic difficulty of finetuning LLMs properly.
As shown in Fabien’s password experiments, there are many ways to mess up on finetuning (including by having a bad seed), and different finetuning techniques are likely to lead to different levels of performance. (And the problem gets worse as you start using RL and not just SFT) So it’s worth being very careful on claiming that the results of any particular finetuning run upper bounds model capabilities. But it’s still plausible that trying very hard on finetuning elicits capabilities more efficiently than trying very hard on prompting, for example, which I think is closer to what people mean when they say that finetuning is an upper bound on model capabilities.
Very cool work; I’m glad it was done.
That being said, I agree with Fabien that the title is a bit overstated, insofar as it’s about your results in particular::
It’s a general fact of ML that small changes in finetuning setup can greatly affect performance if you’re not careful. In particular, it seems likely to me that the empirical details that Fabien asks for may affect your results. But this has little to do with formatting, and much more to deal with the intrinsic difficulty of finetuning LLMs properly.
As shown in Fabien’s password experiments, there are many ways to mess up on finetuning (including by having a bad seed), and different finetuning techniques are likely to lead to different levels of performance. (And the problem gets worse as you start using RL and not just SFT) So it’s worth being very careful on claiming that the results of any particular finetuning run upper bounds model capabilities. But it’s still plausible that trying very hard on finetuning elicits capabilities more efficiently than trying very hard on prompting, for example, which I think is closer to what people mean when they say that finetuning is an upper bound on model capabilities.