Another note: RLHF does indeed require very large amounts of feedback data, which Microsoft may not have licensed from OpenAI. But RLHF is not strictly necessary anyway, as Anthropic showed in their Constitutional AI (Claude) paper, which used supervised learning from human dialog data, like ChatGPT—but unlike ChatGPT, they used fully automatic “RLAIF” instead of RLHF. In OpenAI’s most recent blog post they mention both Constitutional AI and DeepMind’s Sparrow-Paper (which was the first to introduce the ability of providing sources, which Sydney is able to do).
The theory that Sydney uses some GPT-4 model sounds interesting. But the Sydney prompt document, which was reported by several users, mentions a knowledge cutoff in 2021, same as ChatGPT, which uses GPT-3.5. For GPT-4 we would probably expect a knowledge cutoff in 2022. So GPT 3.5 seems more likely?
Whatever base model Bing chat uses, it may not be the largest one OpenAI has available. (Which could potentially explain some of Sidney’s not-so-smart responses?) The reason is that larger models have higher inference cost, limiting ROI for search companies. It doesn’t make sense to spend more money on search than you expect to get back via ads. Quote from Sundar Pichai:
We’re releasing [Bard] initially with our lightweight model version of LaMDA. This much smaller model requires significantly less computing power, enabling us to scale to more users, allowing for more feedback.
Yeah, OpenAI has communicated very poorly and this has led to a lot of confusion. I’m trying to use the terminology more consistently: if I mean RL training or some sort of non-differentiable loss, I try to say ‘RL’, and ‘finetuning’ just means what it usually means—supervised or self-supervised training using gradient descent on a dataset. Because they have different results in both theory & practice.
Sure, but MS is probably not using a research project from Anthropic published half a month after ChatGPT launched. If it was solely prompt engineering, maybe, because that’s so easy and fast—but not the RL part too. (The first lesson of using DRL is “don’t.”)
See my other comment. The prompt leaks are highly questionable. I don’t believe anything in them which can’t be confirmed outside of Sydney hallucinations.
Also, I don’t particularly see why GPT-4 would be expected to be much more up to date. After all, by Nadella’s account, they had ‘Prometheus’ way back in summer 2022, so it had to be trained earlier than that, so the dataset had to be collected & finalized earlier than that, so a 2021 cutoff isn’t too implausible, especially if you are counting on retrieval to keep the model up to date.
Yes, this is possible. While MS has all the money in the world and has already blown tens of billions of dollars making the also-ran Bing and is willing to blow billions more if it can gain market share at Google’s expense, they still might want to economize on cost (or perhaps more accurately, how many users they can support with their finite supply of datacenter GPUs?) and do so by using a cheaper model.
This might account for why the Sydney model seems smarter than GPT-3 models but not as huge of a leap as rumors have been making GPT-4 out to be: ‘Prometheus’ is the babbage or curie of GPT-4 rather than the davinci. (On the other hand, the fact that Pichai is explicitly trying to squeeze pennies I would take as motivation and evidence for Nadella doing the exact opposite.)
It seems to me like “fine-tuning” usually just means a small amount of extra training on top of a model that’s already been trained, whether that’s supervised, autoregressive, RL, or whatever. I don’t find that language confusing in itself. It is often important to distinguish different kinds of fine-tuning, just as it’s often important to distinguish different kinds of training in general, and adjectives seem like a pretty reasonable way to do that.
I’d be open to changing my usage if I saw some data on other people also using or interpreting “fine-tuning” to mean “fine-tuning with a differentiable objective.” I talk a fair amount with people who use fine-tuning in the broader sense, and haven’t noticed practitioners using it more narrowly / didn’t realize this might cause confusion.
This is a very interesting theory.
Small note regarding terminology: Both using supervised learning on dialog/instruction data and reinforcement learning (from human feedback) is called fine-tuning the base model, at least by OpenAI, where both techniques were used to create ChatGPT.
Another note: RLHF does indeed require very large amounts of feedback data, which Microsoft may not have licensed from OpenAI. But RLHF is not strictly necessary anyway, as Anthropic showed in their Constitutional AI (Claude) paper, which used supervised learning from human dialog data, like ChatGPT—but unlike ChatGPT, they used fully automatic “RLAIF” instead of RLHF. In OpenAI’s most recent blog post they mention both Constitutional AI and DeepMind’s Sparrow-Paper (which was the first to introduce the ability of providing sources, which Sydney is able to do).
The theory that Sydney uses some GPT-4 model sounds interesting. But the Sydney prompt document, which was reported by several users, mentions a knowledge cutoff in 2021, same as ChatGPT, which uses GPT-3.5. For GPT-4 we would probably expect a knowledge cutoff in 2022. So GPT 3.5 seems more likely?
Whatever base model Bing chat uses, it may not be the largest one OpenAI has available. (Which could potentially explain some of Sidney’s not-so-smart responses?) The reason is that larger models have higher inference cost, limiting ROI for search companies. It doesn’t make sense to spend more money on search than you expect to get back via ads. Quote from Sundar Pichai:
Yeah, OpenAI has communicated very poorly and this has led to a lot of confusion. I’m trying to use the terminology more consistently: if I mean RL training or some sort of non-differentiable loss, I try to say ‘RL’, and ‘finetuning’ just means what it usually means—supervised or self-supervised training using gradient descent on a dataset. Because they have different results in both theory & practice.
Sure, but MS is probably not using a research project from Anthropic published half a month after ChatGPT launched. If it was solely prompt engineering, maybe, because that’s so easy and fast—but not the RL part too. (The first lesson of using DRL is “don’t.”)
See my other comment. The prompt leaks are highly questionable. I don’t believe anything in them which can’t be confirmed outside of Sydney hallucinations.
Also, I don’t particularly see why GPT-4 would be expected to be much more up to date. After all, by Nadella’s account, they had ‘Prometheus’ way back in summer 2022, so it had to be trained earlier than that, so the dataset had to be collected & finalized earlier than that, so a 2021 cutoff isn’t too implausible, especially if you are counting on retrieval to keep the model up to date.
Yes, this is possible. While MS has all the money in the world and has already blown tens of billions of dollars making the also-ran Bing and is willing to blow billions more if it can gain market share at Google’s expense, they still might want to economize on cost (or perhaps more accurately, how many users they can support with their finite supply of datacenter GPUs?) and do so by using a cheaper model.
This might account for why the Sydney model seems smarter than GPT-3 models but not as huge of a leap as rumors have been making GPT-4 out to be: ‘Prometheus’ is the
babbage
orcurie
of GPT-4 rather than thedavinci
. (On the other hand, the fact that Pichai is explicitly trying to squeeze pennies I would take as motivation and evidence for Nadella doing the exact opposite.)It seems to me like “fine-tuning” usually just means a small amount of extra training on top of a model that’s already been trained, whether that’s supervised, autoregressive, RL, or whatever. I don’t find that language confusing in itself. It is often important to distinguish different kinds of fine-tuning, just as it’s often important to distinguish different kinds of training in general, and adjectives seem like a pretty reasonable way to do that.
I’d be open to changing my usage if I saw some data on other people also using or interpreting “fine-tuning” to mean “fine-tuning with a differentiable objective.” I talk a fair amount with people who use fine-tuning in the broader sense, and haven’t noticed practitioners using it more narrowly / didn’t realize this might cause confusion.