And another reason why all this is relevant, we know that fine-tuning GPT-3.5 can produce drastic boosts in narrow domains, and some of us (e.g. myself) have expected the same from fine-tuning GPT-4, being able to achieve the performance of the non-existing GPT-4.5 (or 5) in narrow domains.
But that’s not what has happened. Instead OpenAI has communicated that
Preliminary results indicate that GPT-4 fine-tuning requires more work to achieve meaningful improvements over the base model compared to the substantial gains realized with GPT-3.5 fine-tuning.
and, moreover, therefore
We’re creating an experimental access program for GPT-4 fine-tuning. Preliminary results indicate that GPT-4 fine-tuning requires more work to achieve meaningful improvements over the base model compared to the substantial gains realized with GPT-3.5 fine-tuning. As quality and safety for GPT-4 fine-tuning improves, developers actively using GPT-3.5 fine-tuning will be presented with an option to apply to the GPT-4 program within their fine-tuning console.
It is very important to understand the mysterious base-GPT-4 better in the context of both potential benefits and potential hazards of GPT-4 fine-tuning, and also in the context of these newly emerged difficulties of fine-tuning it as fruitfully as GPT-3.5.
I’m not sure finetuning GPT-3 is all that different or those difficulties ‘newly emerged’.
As I recall, the original GPT-3 finetuning API was removed not terribly long after it was announced and didn’t come back for a long time. There were also issues with finetune users like AI Dungeon 2. This might have been connected with the finetune doing shenanigans behind the scenes—OA declined to talk about what the ‘finetuning’ even was, and the general assumption seems to be that they were doing some sort of cheap lightweight-finetune or hack and not a true finetune.
(These are why I never wound up doing any of the GPT-3 finetuning ideas I had back in 2020, like trying to fix poetry by re-tokenizing our poem corpus into IPA phonetic notation—why waste the time & hundreds of dollars if OA is just going to screw it up behind the scenes & not even give you a hint why?)
Right. But the reports specifically on GPT-3.5-turbo fine-tuning announced in August were glowing, with people reporting being able to reach GPT-4-like levels on performance in narrow domains.
That’s why our expectations were high.
I am sure they do something relatively lightweight, like LoRA, https://arxiv.org/abs/2106.09685, which is what people tend to be mostly using (I think).
And, of course, with GPT-4 being very different from a conventional Transformer of GPT-3-like type, if one believes the rumors, the difficulties might have easily emerged, if one has been trying to do something like a LoRA-like thing.
But the reports specifically on GPT-3.5-turbo fine-tuning announced in August were glowing, with people reporting being able to reach GPT-4-like levels on performance in narrow domains.
Indeed, but only years after their original attempt. All of the early GPT-3 finetuning reports were very… meh. No one seemed terribly happy with it.
That’s my point: it seems like the first attempts did not go well for GPT-3. So, it’s not clear that the first attempts going poorly for GPT-4 is anything different. Perhaps in another 3 years, OA will have a new GPT-4 finetuning service which doesn’t require “more work” and Just Works™. (One does hope it wouldn’t take that long the second time around.)
So, I continue to maintain that OA “finetuning” is unfit for research* and for any purposes that involve deep transformation of the model rather than ‘locating’ an existing capability. Especially now that Llama-3-405b has been released and you can finetune that yourself and be sure that it genuinely is finetuning rather than a pinchbeck substitute.
* ie. it can be OK if you have an extremely specific claim like ‘the OA blackbox finetuning service does or does not do X’; but it is totally illegitimate to argue ‘GPT-4 cannot do X as proven by our OA-finetuned version still not doing X’, which is the usual way it comes up in DL research. At best, it is a loose lower bound, and should be treated no more seriously than lazy garbage arguments like ‘we tried a few prompts and X didn’t work, therefore, LLMs will never do X’.
There are lots of people working on it and offering or will be offering it. And even when they aren’t offering true finetuning, it’s still better: Snowflake (first hit in google for “Llama 405B finetuning”) for example is making no bones about their single-node lightweight-finetuning being a LoRA, and is open sourcing code upfront so at least you know what it is now—instead of depending on borderline-gossip buried 40 minutes into a Youtube video months/years later.
Yes, the main rumor is that it’s a mixture-of-experts. This is already quite a difference from a single Transformer.
We presume that these experts are mostly made of various components of a Transformer (with some possible additions and modifications, which we don’t know), but we don’t know how independent those experts are, or whether they share a sizeable common initial computation and then branch off that, or something else entirely with some kind of dynamic sparse routing through a single network, and so on… I think it’s unlikely to be “just take a bunch of GPT-3′s, run an appropriate subset of them in parallel, and combine the results”.
So, we really don’t know, these rumors are only enough to make some partial guesses.
If we survive for a while, all this will eventually became public knowledge, and we’ll probably understand eventually how the magic of GPT-4 is possible.
And another reason why all this is relevant, we know that fine-tuning GPT-3.5 can produce drastic boosts in narrow domains, and some of us (e.g. myself) have expected the same from fine-tuning GPT-4, being able to achieve the performance of the non-existing GPT-4.5 (or 5) in narrow domains.
But that’s not what has happened. Instead OpenAI has communicated that
and, moreover, therefore
It is very important to understand the mysterious base-GPT-4 better in the context of both potential benefits and potential hazards of GPT-4 fine-tuning, and also in the context of these newly emerged difficulties of fine-tuning it as fruitfully as GPT-3.5.
I’m not sure finetuning GPT-3 is all that different or those difficulties ‘newly emerged’.
As I recall, the original GPT-3 finetuning API was removed not terribly long after it was announced and didn’t come back for a long time. There were also issues with finetune users like AI Dungeon 2. This might have been connected with the finetune doing shenanigans behind the scenes—OA declined to talk about what the ‘finetuning’ even was, and the general assumption seems to be that they were doing some sort of cheap lightweight-finetune or hack and not a true finetune.
(These are why I never wound up doing any of the GPT-3 finetuning ideas I had back in 2020, like trying to fix poetry by re-tokenizing our poem corpus into IPA phonetic notation—why waste the time & hundreds of dollars if OA is just going to screw it up behind the scenes & not even give you a hint why?)
Right. But the reports specifically on GPT-3.5-turbo fine-tuning announced in August were glowing, with people reporting being able to reach GPT-4-like levels on performance in narrow domains.
That’s why our expectations were high.
I am sure they do something relatively lightweight, like LoRA, https://arxiv.org/abs/2106.09685, which is what people tend to be mostly using (I think).
And, of course, with GPT-4 being very different from a conventional Transformer of GPT-3-like type, if one believes the rumors, the difficulties might have easily emerged, if one has been trying to do something like a LoRA-like thing.
Indeed, but only years after their original attempt. All of the early GPT-3 finetuning reports were very… meh. No one seemed terribly happy with it.
That’s my point: it seems like the first attempts did not go well for GPT-3. So, it’s not clear that the first attempts going poorly for GPT-4 is anything different. Perhaps in another 3 years, OA will have a new GPT-4 finetuning service which doesn’t require “more work” and Just Works™. (One does hope it wouldn’t take that long the second time around.)
OA does have a new finetuning service for GPT-4o, and people seem to be happier with it, but OA has also apparently confirmed that it’s a LoRA (as I was speculating about it being a cheap shallow hack rather than true finetuning): https://x.com/CFGeek/status/1826749739502895618 https://www.youtube.com/watch?v=X57GT1Y5URY&t=2479s
It also is doing shenanigans behind the scenes like trying to dynamically guess a size but apparently hiding that from you if you aren’t a favored customer: https://x.com/CFGeek/status/1826749748549988800
So, I continue to maintain that OA “finetuning” is unfit for research* and for any purposes that involve deep transformation of the model rather than ‘locating’ an existing capability. Especially now that Llama-3-405b has been released and you can finetune that yourself and be sure that it genuinely is finetuning rather than a pinchbeck substitute.
* ie. it can be OK if you have an extremely specific claim like ‘the OA blackbox finetuning service does or does not do X’; but it is totally illegitimate to argue ‘GPT-4 cannot do X as proven by our OA-finetuned version still not doing X’, which is the usual way it comes up in DL research. At best, it is a loose lower bound, and should be treated no more seriously than lazy garbage arguments like ‘we tried a few prompts and X didn’t work, therefore, LLMs will never do X’.
Thanks, that’s very useful to know!
It’s still not trivial to finetune Llama 405B. You require 16 bytes/parameter using Adam + activation memory, so a minimum of ~100 H100s.
There are lots of people working on it and offering or will be offering it. And even when they aren’t offering true finetuning, it’s still better: Snowflake (first hit in google for “Llama 405B finetuning”) for example is making no bones about their single-node lightweight-finetuning being a LoRA, and is open sourcing code upfront so at least you know what it is now—instead of depending on borderline-gossip buried 40 minutes into a Youtube video months/years later.
What are the rumors? I’m only aware of MoE.
Yes, the main rumor is that it’s a mixture-of-experts. This is already quite a difference from a single Transformer.
We presume that these experts are mostly made of various components of a Transformer (with some possible additions and modifications, which we don’t know), but we don’t know how independent those experts are, or whether they share a sizeable common initial computation and then branch off that, or something else entirely with some kind of dynamic sparse routing through a single network, and so on… I think it’s unlikely to be “just take a bunch of GPT-3′s, run an appropriate subset of them in parallel, and combine the results”.
There is a huge diversity of techniques combining the MoE motifs and motifs associated with Transformers, see e.g. this collection of references https://github.com/XueFuzhao/awesome-mixture-of-experts
So, we really don’t know, these rumors are only enough to make some partial guesses.
If we survive for a while, all this will eventually became public knowledge, and we’ll probably understand eventually how the magic of GPT-4 is possible.