Yeah, OpenAI has communicated very poorly and this has led to a lot of confusion. I’m trying to use the terminology more consistently: if I mean RL training or some sort of non-differentiable loss, I try to say ‘RL’, and ‘finetuning’ just means what it usually means—supervised or self-supervised training using gradient descent on a dataset. Because they have different results in both theory & practice.
Sure, but MS is probably not using a research project from Anthropic published half a month after ChatGPT launched. If it was solely prompt engineering, maybe, because that’s so easy and fast—but not the RL part too. (The first lesson of using DRL is “don’t.”)
See my other comment. The prompt leaks are highly questionable. I don’t believe anything in them which can’t be confirmed outside of Sydney hallucinations.
Also, I don’t particularly see why GPT-4 would be expected to be much more up to date. After all, by Nadella’s account, they had ‘Prometheus’ way back in summer 2022, so it had to be trained earlier than that, so the dataset had to be collected & finalized earlier than that, so a 2021 cutoff isn’t too implausible, especially if you are counting on retrieval to keep the model up to date.
Yes, this is possible. While MS has all the money in the world and has already blown tens of billions of dollars making the also-ran Bing and is willing to blow billions more if it can gain market share at Google’s expense, they still might want to economize on cost (or perhaps more accurately, how many users they can support with their finite supply of datacenter GPUs?) and do so by using a cheaper model.
This might account for why the Sydney model seems smarter than GPT-3 models but not as huge of a leap as rumors have been making GPT-4 out to be: ‘Prometheus’ is the babbage or curie of GPT-4 rather than the davinci. (On the other hand, the fact that Pichai is explicitly trying to squeeze pennies I would take as motivation and evidence for Nadella doing the exact opposite.)
It seems to me like “fine-tuning” usually just means a small amount of extra training on top of a model that’s already been trained, whether that’s supervised, autoregressive, RL, or whatever. I don’t find that language confusing in itself. It is often important to distinguish different kinds of fine-tuning, just as it’s often important to distinguish different kinds of training in general, and adjectives seem like a pretty reasonable way to do that.
I’d be open to changing my usage if I saw some data on other people also using or interpreting “fine-tuning” to mean “fine-tuning with a differentiable objective.” I talk a fair amount with people who use fine-tuning in the broader sense, and haven’t noticed practitioners using it more narrowly / didn’t realize this might cause confusion.
Yeah, OpenAI has communicated very poorly and this has led to a lot of confusion. I’m trying to use the terminology more consistently: if I mean RL training or some sort of non-differentiable loss, I try to say ‘RL’, and ‘finetuning’ just means what it usually means—supervised or self-supervised training using gradient descent on a dataset. Because they have different results in both theory & practice.
Sure, but MS is probably not using a research project from Anthropic published half a month after ChatGPT launched. If it was solely prompt engineering, maybe, because that’s so easy and fast—but not the RL part too. (The first lesson of using DRL is “don’t.”)
See my other comment. The prompt leaks are highly questionable. I don’t believe anything in them which can’t be confirmed outside of Sydney hallucinations.
Also, I don’t particularly see why GPT-4 would be expected to be much more up to date. After all, by Nadella’s account, they had ‘Prometheus’ way back in summer 2022, so it had to be trained earlier than that, so the dataset had to be collected & finalized earlier than that, so a 2021 cutoff isn’t too implausible, especially if you are counting on retrieval to keep the model up to date.
Yes, this is possible. While MS has all the money in the world and has already blown tens of billions of dollars making the also-ran Bing and is willing to blow billions more if it can gain market share at Google’s expense, they still might want to economize on cost (or perhaps more accurately, how many users they can support with their finite supply of datacenter GPUs?) and do so by using a cheaper model.
This might account for why the Sydney model seems smarter than GPT-3 models but not as huge of a leap as rumors have been making GPT-4 out to be: ‘Prometheus’ is the
babbage
orcurie
of GPT-4 rather than thedavinci
. (On the other hand, the fact that Pichai is explicitly trying to squeeze pennies I would take as motivation and evidence for Nadella doing the exact opposite.)It seems to me like “fine-tuning” usually just means a small amount of extra training on top of a model that’s already been trained, whether that’s supervised, autoregressive, RL, or whatever. I don’t find that language confusing in itself. It is often important to distinguish different kinds of fine-tuning, just as it’s often important to distinguish different kinds of training in general, and adjectives seem like a pretty reasonable way to do that.
I’d be open to changing my usage if I saw some data on other people also using or interpreting “fine-tuning” to mean “fine-tuning with a differentiable objective.” I talk a fair amount with people who use fine-tuning in the broader sense, and haven’t noticed practitioners using it more narrowly / didn’t realize this might cause confusion.