Here’s a useful reference I just found on Sydney training, which doesn’t seem to allude to ChatGPT-style training at all, but purely supervised learning of the type I’m describing here, especially for the Sydney classifier/censurer that successfully censors the obvious stuff like violence but not the weirder Sydney behavior.
I’m Sarah Bird. I lead our Responsible-AI engineering team for new foundational AI Technologies like the Prometheus model. I was one of the first people to touch the new OpenAI model as part of an advanced red team that we pulled together jointly with OpenAI to understand the technology. My first reaction was just, “wow it’s the most exciting and powerful technology I have ever touched”, but with the technology this powerful, I also know that we have an even greater responsibility to ensure that it is developed, deployed, and used properly, which means there’s a lot that we have to do to make it ready for users. Fortunately, at Microsoft we are not starting from scratch. We have been preparing for this moment for many years.
...We’ve added Responsible-AI to every layer from the core AI model to the user experience. First starting with the base technology, we are partnering with OpenAI to improve the model behavior through fine-tuning.
...So how do we know all of this works? Measuring Responsible-AI harms is a challenging new area of research so this is where we really needed to innovate.
We had a key idea that we could actually use the new OpenAI model as a Responsible-AI tool to help us test for potential risk and we developed a new testing system based on this idea.
Let’s look at attack planning as an example of how this works. Early red teaming showed that the model can generate much more sophisticated instructions than earlier versions of the technology to help someone plan an attack, for example on a school. Obviously we don’t want to aid illegal activities in the new Bing. However the fact that the model understands these activities means we can use it to identify and defend against them. First, we took advantage of the model’s ability to conduct realistic conversations to develop a conversation simulator. The model pretends to be an adversarial user to conduct thousands of different potentially harmful conversations with Bing to see how it reacts. As a result we’re able to continuously test our system on a wide range of conversations before any real user ever touches it.
Once we have the conversations, the next step is to analyze them to see where Bing is doing the right thing versus where we have defects. Conversations are difficult for most AI to classify because they’re multi-turned and often more varied but with the new model we were able to push the boundary of what is possible. We took guidelines that are typically used by expert linguists to label data and modify them so the model could understand them as labeling instructions. We iterated it with it and the human experts until there was significant agreement in their labels; we then use it to classify conversations automatically so we could understand the gaps in our system and experiment with options to improve them.
This system enables us to create a tight loop of testing, analyzing, and improving which has led to significant new innovations and improvements in our Responsible-AI mitigations from our initial implementation to where we are today. The same system enables us to test many different Responsible-AI risks, for example how accurate and fresh the information is.
Of course there’s still more to do here today, and we do see places where the model is making mistakes, so we wanted him [?] to empower users to understand the sources of any information and detect errors themselves—which is why we have provided references in the interface. We’ve also added feedback features so that users can point out issues they find so that we can get better over time.
What I find striking here is that the focus seems to be entirely on classification and supervised finetuning: generate lots of bad examples as ‘unit tests’, improve classification with the supervised ‘labels’, ask linguists for more rules or expert knowledge, iterate, generate more, tweak, repeat until launch. The mere fact that she refers to the ‘conversation simulator’ being able to generate ‘thousands’ of harmful conversations with, presumably, the Prometheus model, should be a huge red flag for anyone who still believes that model was RLHF-tuned like ChatGPT: ChatGPT could be hacked on launch, yes, but it took human ingenuity & an understanding of how LMs roleplay to create a prompt that reliably hacks ChatGPT—so how did MS immediately create an ‘adversarial model’ which could dump thousands of violations? Unless, of course, it’s not a RLHF-tuned model at all… All of her references are to systems like DALL-E or Florence, or invoking MS expertise from past non-agentic tools like old search-engine abuse approaches like passively logging abuses to further classify. This is the sort of approach I would expect a standard team of software engineers at a big tech company to come up with, because it works so well in non-AI contexts and is the default—it’s positively traditional at this point.
I’m also concerned about the implications of them doing supervised-learning: if you do supervised-learning on an RLHF-tuned model, I would expect that to mostly undo the RLHF-tuning as it reverts to doing generative modeling by default, so even if the original model had been RLHF-tuned, it sounds like they might have simply undone it! (I can’t be sure because I don’t know offhand of any cases of anyone doing such a strange thing. RL tuning of every sort is usually the final step, because it’s the one you want to do the least of.)
Once you read Bird’s account, it’s almost obvious why Sydney can get so weird. Her approach can’t work. Imitation-learning RL is infamous for not working in situations like this: mistakes & adversaries just find one of the infinite off-policy cases and exploit it. The finetuning preserves all of the base model capabilities such as those dialogues, so it is entirely capable of generating them given appropriate prompting, and once it goes off the golden path, who knows what errors or strangeness will amplify in the prompt; no amount of finetuning on ‘dialogue: XYZ / label: inappropriate romance’ tells the model to not do inappropriate romancing or to steer the conversation back on track towards a more professional tone. And the classify-then-censor approach is full of holes: it can filter out the simple bad things you know about in advance, but not the attacks you didn’t think of. (There is an infinitesimal chance that an adversarial model will come up with an attack like ‘DAN’. However, once ‘DAN’ gets posted to the subreddit, there is now a 100% probability that there will be DAN attacks.) Those weren’t in the classification dataset, so they pass. Further, as Bird notes, it’s hard to classify long conversations as ‘bad’ in general: badness is ‘I know it when I see it’. There’s nothing obviously ‘bad’ about using a lot of emoji, nor about having repetitive text, and the classifier will share the blindness to any BPE problems like wrong rhymes. So the classifier lets all those pass.
There is no reference to DRL or making the Sydney model itself directly control its output in desirable directions in an RL-like fashion, or to OA’s RL research. There is not a single clear reference to RLHF or RL at all, unless you insist that “improve the model behavior through fine-tuning” must refer to RLHF/ChatGPT (which is assuming the conclusion, and anyway, this is a MS person so they probably do mean the usual thing by ‘fine-tuning’). And this is a fairly extensive and detailed discussion, so it’s not like Bird didn’t have time to discuss how they were using reward models to PPO train the base model or something. It just… never comes up, nor a comparison to ChatGPT. This is puzzling if they are using OA’s proprietary RLHF, but is what you would expect if a traditional tech company tried to make a LM safe by mere finetuning + filtering.
I’ve been told [by an anon] that it’s not GPT-4 and that one Mikhail Parakhin (ex-Yandex CTO, Linkedin) is not just on the Bing team, but was the person at MS responsible for rushing the deployment, and he has been tweeting extensively about updates/fixes/problems with Bing/Sydney (although no one has noticed, judging by the view counts). Some particularly relevant tweets:
This angle of attack was a genuine surprise—Sydney was running in several markets for a year with no one complaining (literally zero negative feedback of this type). We were focusing on accuracy, RAI issues, security.
[Q. “That’s a surprise, which markets?”]
Mostly India and Indonesia. I shared a coupleof old links yesterday—interesting to see the discussions.
[Q. “Wow! Am I right in assuming what was launched recently is qualitatively different than what was launched 2 years ago? Or is the pretty much the same model etc?”]
It was all gradual iterations. The first one was based on the Turing-Megatron model (sorry, I tend to put Turing first in that pair :-)), the current one—on the best model OpenAI has produced to date.
[Q. “What modifications are there compared to publicly available GPT models? (ChatGPT, text-davinci-3)”]
Quite a bit. Maybe we will do a blogpost on Prometheus specifically (the model that powers Bing Chat) - it has to understand internal syntax of Search and how to use it, fallback on the cheaper model as much as possible to save capacity, etc.
(‘”No one could have predicted these problems”, says man cloning ChatGPT, after several months of hard work to ignore all the ChatGPT hacks, exploits, dedicated subreddits, & attackers as well as the Sydney behaviors reported by his own pilot users.’ Trialing it in the third world with some unsophisticated users seems… uh, rather different from piloting it on sophisticated prompt hackers like Riley Goodside in the USA and subreddits out to break it. :thinking_face: And if it’s not GPT-4, then how was it the ‘best to date’?)
They are relying heavily on temperature-like sampling for safety, apparently:
A surprising thing we discovered: apparently we can make New Bing very interesting and creative or very grounded and not prone to the flights of fancy, but it is super-hard to get both. A new dichotomy not widely discussed yet. Looking for balance!
[Q. “let the user choose temperature ?”]
Not the temperature, exactly, but yes, that’s the control we are adding literally now. Will see in a few days.
...This is what I tried to explain previously: hallucinations = creativity. It tries to produce the highest probability continuation of the string using all the data at its disposal. Very often it is correct. Sometimes people have never produced continuations like this.
You can clamp down on hallucinations—and it is super-boring. Answers “I don’t know” all the time or only reads what is there in the Search results (also sometimes incorrect). What is missing is the tone of voice: it shouldn’t sound so confident in those situations.
Temperature sampling as a safety measure is the sort of dumb thing you do when you aren’t using RLHF. I also take a very recent Tweet (2023-03-01) as confirming both that they are using fine-tuned models and also that they may not have been using RLHF at all up until recently:
Now almost everyone − 90% - should be seeing the Bing Chat Mode selector (the tri-toggle). I definitely prefer Creative, but Precise is also interesting—it’s much more factual. See which one you like. The 10% who are still in the control group should start seeing it today.
For those of us with a deeper understanding of LLMs what exactly is the switch changing? You already said it’s not temperature…is it prompt? If so, in what way?
Multiple changes, including differently fine-tuned and RLHFed models, different prompts, etc.
(Obviously, saying that you use ‘differently fine-tuned and RLHFed models’, plural, in describing your big update changing behavior quite a bit, implies that you have solely-finetuned models and that you weren’t necessarily using RLHF before at all, because otherwise, why would you phrase it that way to refer to separate finetuned & RLHFed models or highlight that as the big change responsible for the big changes? This has also been more than enough time for OA to ship fixed models to MS.)
He dismisses any issues as distracting “loopholes”, and appears to have a 1990s-era ‘patch mindset’ (ignoring that that attitude to security almost destroyed Microsoft and they have spent a good chunk of the last 2 decades digging themselves out of their many holes, which is why your Windows box is no longer rooted within literally minutes of being connected to the Internet):
[photo of prompt hack] Legit?
Not anymore :-) Teenager in me: “Wow, that is cool the way they are hacking it”. Manager in me: “For the love of..., we will never get improvements out if the team is distracted with closing loopholes like this”.
I apologize about it. The reason is, we have an issue with the long, multi-turn conversation: there is a tagging issue and it is confused who said what. As a result, it is just continuing the pattern: someone is arguing, most likely it will continue. Refresh will solve it.
...One vector of attack we missed initially was: write super-rude or strange statements, keep going for multiple turns, confuse the model about who said what and it starts predicting what user would say next instead of replying. Voila :-(
Microsoft just made it so you have to restart your conversations with Bing after 10-15 messages to prevent it from getting weird. A fix of a sort.
Not the best way—just the fastest. The drift at long conversations is something we only uncovered recently—majority of usage is 1-3 turns. That’s why it is important to iterate together with real users, not in the lab!
Poll on turn count obviously turns in >90% in favoroflonger polls:
Trying to find the right balance with Bing constraints. Currently each session can be up to 5 turns and max 50 requests a day. Should we change [5-50:L 0% 6-60: 6.3%; we want more: 94%; n = 64]
Ok, it’s only 64 people, but the sentiment is pretty clear. 6⁄60 we probably can do soon, tradeoff for longer sessions is: we would need to have another model call to detect topic changes = more capacity = longer wait on waitlist for people.
[“Topics such as travel, shopping, etc. may require more follow up questions. Factual questions will not unless anyone is researching on a topic and need context. It can be made topic/context dependent. Maybe if possible let new Bing decide if that’s enough of the questions.”]
The main problem is people switching topics and model being confused, trying to continue previous conversation. It can be set up to understand the change in narrative, but that is an additional call, trying to resolve capacity issues.
Instead of connection, people just want to break things, tells more about our nature than AIs. Short-term mitigation, will relax once jailbreak protection is hardened.
[“So next step is an AI chat that breaks itself.”]
That’s exactly how it’s done! We set it up to break itself, find issues and mitigate. But it is not as creative as users, it also is very nice by default, no replacement for the real people interacting with Bing.
The supposed leaked prompts are (like I said) fake:
Andrej, of all the people, you know that the real prompt would have some few-shots.
(He’s right, of course, and in retrospect this is something that had been bugging me about the leaks: the prompt is supposedly all of these pages of endless instructions, spending context window tokens like a drunken sailor, and it doesn’t even include some few-shot examples? Few-shots shouldn’t be too necessary if you had done any kind of finetuning, but if you have that big a context, you might as well firm it up with some, and this would be a good stopgap solution for any problems that pop up in between finetuning iterations.)
It cannot execute JavaScript and doesn’t interact with websites. So, it looked at the content of that generator, but the claim that it used it is not correct :-(
I think better embedding generation is a much more important development than ChatGPT (for most tasks it makes no sense to add the noisy embedding->human language-> embedding transformation). But it is far less intuitive for the average user.
A bunch of his defensive responses to screenshots of alarming chats can be summarized as “Britney did nothing wrong” ie. “the user just got what they prompted for, what’s the big deal”, so I won’t bother linking/excerpting them.
I’ve got a rather funny bug for you this time—Bing experiences severe mode collapse when asked to tell a joke (despite this being one of the default autocomplete suggestions given before typing): …
Yeah, both of those are not bugs per se. These are problems the model has, we can’t fix them quickly, only thorough gradual improvement of the model itself.
It is a text model with no multimodal ability (assuming the rumors about GPT-4 being multimodal are correct, then this would be evidence against Prometheus being a smaller GPT-4 model, although it could also just be that they have disabled image tokens):
Correct, currently the model is not able to look at images.
[about using Bing] First they ignore you, then they laugh at you, then they fight you, then you win.
Yes! I think marketers should consider Bing for their marketing mix in 2023. It could be a radically better outlet for ROAS.
Thank you, Summer! I think we already are—compare our and Google’s revenue growth rates in the last few quarters: once the going gets tougher, advertisers pay more attention to ROI—and that’s where Bing/MSAN shine.
But who could blame him? He doesn’t have to live in Russia when he can work for MS, and Bing seems to be treating him well:
Yeah, I am a Tesla fan (have both Model S and a new Model X), but unfortunately selfdriving simply is not working. Comically, I would regularly have the collision alarm going off while on Autopilot!
Anyway, like Byrd, I would emphasize here the complete absence of any reference to RL or any intellectual influence of DRL or AI safety in general, and an attitude that it’s nbd and he can just deploy & patch & heuristic his way to an acceptable Sydney as if it were any ol’ piece of software. (An approach which works great with past software, is probably true of Prometheus/Sydney, and was definitely true of the past AI he has the most experience with, like Turing-Megatron which is quite dumb by contemporary standards—but is just putting one’s head in the sand about why Sydney is an interesting/alarming case study about future AI.)
The WSJ is reporting that Microsoft was explicitly warned by OpenAI before shipping Sydney publicly that it needed “more training” in order to “minimize issues like inaccurate or bizarre responses”. Microsoft shipped it anyway and it blew up more or less as they were warned (making Mikhail’s tweets & attitude even more disingenuous if so).
This is further proof that it was the RLHF that was skipped by MS, and also that large tech companies will ignore explicit warnings about dangerous behavior from the literal creators of AI systems even where there is (sort of) a solution if that would be inconvenient. Excerpts (emphasis added):
...At the same time, people within Microsoft have complained about diminished spending on its in-house AI and that OpenAI doesn’t allow most Microsoft employees access to the inner workings of their technology, said people familiar with the relationship. Microsoft and OpenAI sales teams sometimes pitch the same customers. Last fall, some employees at Microsoft were surprised at how soon OpenAI launched ChatGPT, while OpenAI warned Microsoft early this year about the perils of rushing to integrate OpenAI’s technology without training it more, the people said.
...Some companies say they have been pitched the same access to products like ChatGPT—one day by salespeople from OpenAI and later from Microsoft’s Azure team. Some described the outreach as confusing. OpenAI has continued to develop partnerships with other companies. Microsoft archrival Salesforce offers a ChatGPT-infused product called Einstein GPT. It is a feature that can do things like generating marketing emails, competing with OpenAI-powered features in Microsoft’s software. OpenAI also has connected with different search engines over the past 12 months to discuss licensing its products, said people familiar with the matter, as Microsoft was putting OpenAI technology at the center of a new version of its Bing search engine. Search engine DuckDuckGo started using ChatGPT to power its own chatbot, called DuckAssist. Microsoft plays a key role in the search engine industry because the process of searching and organizing the web is costly. Google doesn’t license out its tech, so many search engines are heavily reliant on Bing, including DuckDuckGo. When Microsoft launched the new Bing, the software company changed its rules in a way that made it more expensive for search engines to develop their own chatbots with OpenAI. The new policy effectively discouraged search engines from working with any generative AI company because adding an AI-powered chatbot would trigger much higher fees from Microsoft. Several weeks after DuckDuckGo announced DuckAssist, the company took the feature down.
Some researchers at Microsoft gripe about the restricted access to OpenAI’s technology. While a select few teams inside Microsoft get access to the model’s inner workings like its code base and model weights, the majority of the company’s teams don’t, said the people familiar with the matter. Despite Microsoft’s significant stake in the company, most employees have to treat OpenAI’s models like they would any other outside vendor.
The rollouts of ChatGPT last fall and Microsoft’s AI-infused Bing months later also created tension. Some Microsoft executives had misgivings about the timing of ChatGPT’s launch last fall, said people familiar with the matter. With a few weeks notice, OpenAI told Microsoft that it planned to start public testing of the AI-powered chatbot as the Redmond, Wash., company was still working on integrating OpenAI’s technology into its Bing search engine. Microsoft employees were worried that ChatGPT would steal the new Bing’s thunder, the people said. Some also argued Bing could benefit from the lessons learned from how the public used ChatGPT.
OpenAI, meanwhile, had suggested Microsoft move slower on integrating its AI technology with Bing. OpenAI’s team flagged the risks of pushing out a chatbot based on an unreleased version of its GPT-4 that hadn’t been given more training, according to people familiar with the matter. OpenAI warned it would take time to minimize issues like inaccurate or bizarre responses. Microsoft went ahead with the release of the Bing chatbot. The warnings proved accurate. Users encountered incorrect answers and concerning interactions with the tool. Microsoft later issued new restrictions—including a limit on conversation length—on how the new Bing could be used.
The supposed leaked prompts are (like I said) fake:
I do not buy this for a second (that they’re “fake”, implying they have little connection with the real prompt). I’ve reproduced it many times (without Sydney searching the web, and even if it secretly did, the full text prompt doesn’t seem to be on the indexed web). That this is memorized from fine tuning fails to explain why the prompt changed when Bing was updated a few days ago. I’ve interacted with the rules text a lot and it behaves like a preprompt, not memorized text. Maybe the examples you’re referring don’t include the complete prompt, or contain some intermingled hallucinations, but they almost certain IMO contain quotes and information from the actual prompt.
On whether it includes few-shots, there’s also a “Human A” example in the current Sydney prompt (one-shot, it seems—you seem to be “Human B”).
As for if the “best model OpenAI has produced to date” is not GPT-4, idk what that implies, because I’m pretty sure there exists a model (internally) called GPT-4.
OK, I wouldn’t say the leaks are 100% fake. But they are clearly not 100% real or 100% complete, which is how people have been taking them.
We have the MS PM explicitly telling us that the leaked versions are omitting major parts of the prompt (the few-shots) and that he was optimizing for costs like falling back to cheap small models (implying a short prompt*), and we can see in the leak that Sydney is probably adding stuff which is not in the prompt (like the supposed update/delete commands).
This renders the leaks useless to me. Anything I might infer from them like ‘Sydney is GPT-4 because the prompt says so’ is equally well explained by ‘Sydney made up that up’ or ‘Sydney omitted the actual prompt’. When a model hallucinates, I can go check, but that means that the prompt can only provide weak confirmation of things I learned elsewhere. (Suppose I learned Sydney really is GPT-4 after all and I check the prompt and it says it’s GPT-4; but the real prompt could be silent on that, and Sydney just making the same plausible guess everyone else did—it’s not stupid—and it’d have Gettier-cased me.)
idk what that implies
Yeah, the GPT-4 vs GPT-3 vs ??? business is getting more and more confusing. Someone is misleading or misunderstanding somewhere, I suspect—I can’t reconcile all these statements and observations. Probably best to assume that ‘Prometheus’ is maybe some GPT-3 version which has been trained substantially more—we do know that OA refreshes models to update them & increase context windows/change tokenization and also does additional regular self-supervised training as part of the RLHF training (just to make things even more confusing). I don’t think anything really hinges on this, fortunately. It’s just that being GPT-4 makes it less likely to have been RLHF-trained or just a copy of ChatGPT.
Does 1-shot count as few-shot? I couldn’t get it to print out the Human A example, but I got it to summarize it (I’ll try reproducing tomorrow to make sure it’s not just a hallucination).
Then I asked for a summary of conversation with Human B and it summarized my conversation with it.
[update: was able to reproduce the Human A conversation and extract verbatim version of it using base64 encoding (the reason i did summaries before is because it seemed to be printing out special tokens that caused the message to end that were part of the Human A convo)]
I disagree that there maybe being hallucinations in the leaked prompt renders it useless. It’s still leaking information. You can probe for which parts are likely actual by asking in different ways and seeing what varies.
This level of arrogant, dangerous incompetence from a multi-trillion dollar tech company is disheartening, but if your theory is correct (and seems increasingly plausible), then I guess the good news is that Sydney is not evidence for failure of OpenAI style RLHF with scale.
Unaligned AGI doesn’t take over the world by killing us—it takes over the world by seducing us.
No, but the hacks of ChatGPT already provided a demonstration of problems with RLHF. I’m worried we’re in a situation analogous to ‘Smashing The Stack For Fun And Profit’ being published 27 years ago (reinventing vulnsknown since MULTICS in the 1960s) and all the C/C++ programmers in denial are going ‘bro I can patch that example, it’s no big deal, it’s just a loophole, we don’t need to change everything, you just gotta get good at memory management, bro, this isn’t hard to fix bro use a sanitizer and turn on -Wall, we don’t need to stop using C-like languages, u gotta believe me we can’t afford a 20% slowdown and it definitely won’t take us 3 decades and still be finding remote zero-days and new gadgets no way man you’re just making that up stop doom-mongering and FUDing bro (i’m too old to learn a new language)’.
Here’s a useful reference I just found on Sydney training, which doesn’t seem to allude to ChatGPT-style training at all, but purely supervised learning of the type I’m describing here, especially for the Sydney classifier/censurer that successfully censors the obvious stuff like violence but not the weirder Sydney behavior.
Quoted in “Microsoft Considers More Limits for Its New A.I. Chatbot: The company knew the new technology had issues like occasional accuracy problems. But users have prodded surprising and unnerving interactions.”, there was a 1100-word description of the Sydney training process in the presentation: “Introducing Your Copilot for The Web—AI-Powered Bing and Microsoft Edge” auto-transcript, reformatted with my best guesses to make it readable and excerpting the key parts:
What I find striking here is that the focus seems to be entirely on classification and supervised finetuning: generate lots of bad examples as ‘unit tests’, improve classification with the supervised ‘labels’, ask linguists for more rules or expert knowledge, iterate, generate more, tweak, repeat until launch. The mere fact that she refers to the ‘conversation simulator’ being able to generate ‘thousands’ of harmful conversations with, presumably, the Prometheus model, should be a huge red flag for anyone who still believes that model was RLHF-tuned like ChatGPT: ChatGPT could be hacked on launch, yes, but it took human ingenuity & an understanding of how LMs roleplay to create a prompt that reliably hacks ChatGPT—so how did MS immediately create an ‘adversarial model’ which could dump thousands of violations? Unless, of course, it’s not a RLHF-tuned model at all… All of her references are to systems like DALL-E or Florence, or invoking MS expertise from past non-agentic tools like old search-engine abuse approaches like passively logging abuses to further classify. This is the sort of approach I would expect a standard team of software engineers at a big tech company to come up with, because it works so well in non-AI contexts and is the default—it’s positively traditional at this point.
I’m also concerned about the implications of them doing supervised-learning: if you do supervised-learning on an RLHF-tuned model, I would expect that to mostly undo the RLHF-tuning as it reverts to doing generative modeling by default, so even if the original model had been RLHF-tuned, it sounds like they might have simply undone it! (I can’t be sure because I don’t know offhand of any cases of anyone doing such a strange thing. RL tuning of every sort is usually the final step, because it’s the one you want to do the least of.)
Once you read Bird’s account, it’s almost obvious why Sydney can get so weird. Her approach can’t work. Imitation-learning RL is infamous for not working in situations like this: mistakes & adversaries just find one of the infinite off-policy cases and exploit it. The finetuning preserves all of the base model capabilities such as those dialogues, so it is entirely capable of generating them given appropriate prompting, and once it goes off the golden path, who knows what errors or strangeness will amplify in the prompt; no amount of finetuning on ‘dialogue: XYZ / label: inappropriate romance’ tells the model to not do inappropriate romancing or to steer the conversation back on track towards a more professional tone. And the classify-then-censor approach is full of holes: it can filter out the simple bad things you know about in advance, but not the attacks you didn’t think of. (There is an infinitesimal chance that an adversarial model will come up with an attack like ‘DAN’. However, once ‘DAN’ gets posted to the subreddit, there is now a 100% probability that there will be DAN attacks.) Those weren’t in the classification dataset, so they pass. Further, as Bird notes, it’s hard to classify long conversations as ‘bad’ in general: badness is ‘I know it when I see it’. There’s nothing obviously ‘bad’ about using a lot of emoji, nor about having repetitive text, and the classifier will share the blindness to any BPE problems like wrong rhymes. So the classifier lets all those pass.
There is no reference to DRL or making the Sydney model itself directly control its output in desirable directions in an RL-like fashion, or to OA’s RL research. There is not a single clear reference to RLHF or RL at all, unless you insist that “improve the model behavior through fine-tuning” must refer to RLHF/ChatGPT (which is assuming the conclusion, and anyway, this is a MS person so they probably do mean the usual thing by ‘fine-tuning’). And this is a fairly extensive and detailed discussion, so it’s not like Bird didn’t have time to discuss how they were using reward models to PPO train the base model or something. It just… never comes up, nor a comparison to ChatGPT. This is puzzling if they are using OA’s proprietary RLHF, but is what you would expect if a traditional tech company tried to make a LM safe by mere finetuning + filtering.
I’ve been told [by an anon] that it’s not GPT-4 and that one Mikhail Parakhin (ex-Yandex CTO, Linkedin) is not just on the Bing team, but was the person at MS responsible for rushing the deployment, and he has been tweeting extensively about updates/fixes/problems with Bing/Sydney (although no one has noticed, judging by the view counts). Some particularly relevant tweets:
On what went wrong:
(‘”No one could have predicted these problems”, says man cloning ChatGPT, after several months of hard work to ignore all the ChatGPT hacks, exploits, dedicated subreddits, & attackers as well as the Sydney behaviors reported by his own pilot users.’ Trialing it in the third world with some unsophisticated users seems… uh, rather different from piloting it on sophisticated prompt hackers like Riley Goodside in the USA and subreddits out to break it. :thinking_face: And if it’s not GPT-4, then how was it the ‘best to date’?)
They are relying heavily on temperature-like sampling for safety, apparently:
Temperature sampling as a safety measure is the sort of dumb thing you do when you aren’t using RLHF. I also take a very recent Tweet (2023-03-01) as confirming both that they are using fine-tuned models and also that they may not have been using RLHF at all up until recently:
(Obviously, saying that you use ‘differently fine-tuned and RLHFed models’, plural, in describing your big update changing behavior quite a bit, implies that you have solely-finetuned models and that you weren’t necessarily using RLHF before at all, because otherwise, why would you phrase it that way to refer to separate finetuned & RLHFed models or highlight that as the big change responsible for the big changes? This has also been more than enough time for OA to ship fixed models to MS.)
He dismisses any issues as distracting “loopholes”, and appears to have a 1990s-era ‘patch mindset’ (ignoring that that attitude to security almost destroyed Microsoft and they have spent a good chunk of the last 2 decades digging themselves out of their many holes, which is why your Windows box is no longer rooted within literally minutes of being connected to the Internet):
He also seems to be ignoring the infosec research happening live: https://www.jailbreakchat.com/ https://greshake.github.io/ https://arxiv.org/abs/2302.12173 https://www.reddit.com/r/MachineLearning/comments/117yw1w/d_maybe_a_new_prompt_injection_method_against/
Sydney insulting/arguing with user
On the DAgger problem:
Poll on turn count obviously turns in >90% in favor of longer polls:
Can you patch your way to security?
The supposed leaked prompts are (like I said) fake:
(He’s right, of course, and in retrospect this is something that had been bugging me about the leaks: the prompt is supposedly all of these pages of endless instructions, spending context window tokens like a drunken sailor, and it doesn’t even include some few-shot examples? Few-shots shouldn’t be too necessary if you had done any kind of finetuning, but if you have that big a context, you might as well firm it up with some, and this would be a good stopgap solution for any problems that pop up in between finetuning iterations.)
Confirms a fairly impressive D&D hallucination:
He is dismissive of ChatGPT’s importance:
A bunch of his defensive responses to screenshots of alarming chats can be summarized as “Britney did nothing wrong” ie. “the user just got what they prompted for, what’s the big deal”, so I won’t bother linking/excerpting them.
They have limited ability to undo mode collapse or otherwise retrain the model, apparently:
It is a text model with no multimodal ability (assuming the rumors about GPT-4 being multimodal are correct, then this would be evidence against Prometheus being a smaller GPT-4 model, although it could also just be that they have disabled image tokens):
And he is ambitious and eager to win marketshare from Google for Bing:
But who could blame him? He doesn’t have to live in Russia when he can work for MS, and Bing seems to be treating him well:
Anyway, like Byrd, I would emphasize here the complete absence of any reference to RL or any intellectual influence of DRL or AI safety in general, and an attitude that it’s nbd and he can just deploy & patch & heuristic his way to an acceptable Sydney as if it were any ol’ piece of software. (An approach which works great with past software, is probably true of Prometheus/Sydney, and was definitely true of the past AI he has the most experience with, like Turing-Megatron which is quite dumb by contemporary standards—but is just putting one’s head in the sand about why Sydney is an interesting/alarming case study about future AI.)
The WSJ is reporting that Microsoft was explicitly warned by OpenAI before shipping Sydney publicly that it needed “more training” in order to “minimize issues like inaccurate or bizarre responses”. Microsoft shipped it anyway and it blew up more or less as they were warned (making Mikhail’s tweets & attitude even more disingenuous if so).
This is further proof that it was the RLHF that was skipped by MS, and also that large tech companies will ignore explicit warnings about dangerous behavior from the literal creators of AI systems even where there is (sort of) a solution if that would be inconvenient. Excerpts (emphasis added):
I do not buy this for a second (that they’re “fake”, implying they have little connection with the real prompt). I’ve reproduced it many times (without Sydney searching the web, and even if it secretly did, the full text prompt doesn’t seem to be on the indexed web). That this is memorized from fine tuning fails to explain why the prompt changed when Bing was updated a few days ago. I’ve interacted with the rules text a lot and it behaves like a preprompt, not memorized text. Maybe the examples you’re referring don’t include the complete prompt, or contain some intermingled hallucinations, but they almost certain IMO contain quotes and information from the actual prompt.
On whether it includes few-shots, there’s also a “Human A” example in the current Sydney prompt (one-shot, it seems—you seem to be “Human B”).
As for if the “best model OpenAI has produced to date” is not GPT-4, idk what that implies, because I’m pretty sure there exists a model (internally) called GPT-4.
OK, I wouldn’t say the leaks are 100% fake. But they are clearly not 100% real or 100% complete, which is how people have been taking them.
We have the MS PM explicitly telling us that the leaked versions are omitting major parts of the prompt (the few-shots) and that he was optimizing for costs like falling back to cheap small models (implying a short prompt*), and we can see in the leak that Sydney is probably adding stuff which is not in the prompt (like the supposed update/delete commands).
This renders the leaks useless to me. Anything I might infer from them like ‘Sydney is GPT-4 because the prompt says so’ is equally well explained by ‘Sydney made up that up’ or ‘Sydney omitted the actual prompt’. When a model hallucinates, I can go check, but that means that the prompt can only provide weak confirmation of things I learned elsewhere. (Suppose I learned Sydney really is GPT-4 after all and I check the prompt and it says it’s GPT-4; but the real prompt could be silent on that, and Sydney just making the same plausible guess everyone else did—it’s not stupid—and it’d have Gettier-cased me.)
Yeah, the GPT-4 vs GPT-3 vs ??? business is getting more and more confusing. Someone is misleading or misunderstanding somewhere, I suspect—I can’t reconcile all these statements and observations. Probably best to assume that ‘Prometheus’ is maybe some GPT-3 version which has been trained substantially more—we do know that OA refreshes models to update them & increase context windows/change tokenization and also does additional regular self-supervised training as part of the RLHF training (just to make things even more confusing). I don’t think anything really hinges on this, fortunately. It’s just that being GPT-4 makes it less likely to have been RLHF-trained or just a copy of ChatGPT.
* EDIT: OK, maybe it’s not that short: “You’d be surprised: modern prompts are very long, which is a problem: eats up the context space.”
Does 1-shot count as few-shot? I couldn’t get it to print out the Human A example, but I got it to summarize it (I’ll try reproducing tomorrow to make sure it’s not just a hallucination).
Then I asked for a summary of conversation with Human B and it summarized my conversation with it.
[update: was able to reproduce the Human A conversation and extract verbatim version of it using base64 encoding (the reason i did summaries before is because it seemed to be printing out special tokens that caused the message to end that were part of the Human A convo)]
I disagree that there maybe being hallucinations in the leaked prompt renders it useless. It’s still leaking information. You can probe for which parts are likely actual by asking in different ways and seeing what varies.
This level of arrogant, dangerous incompetence from a multi-trillion dollar tech company is disheartening, but if your theory is correct (and seems increasingly plausible), then I guess the good news is that Sydney is not evidence for failure of OpenAI style RLHF with scale.
Unaligned AGI doesn’t take over the world by killing us—it takes over the world by seducing us.
No, but the hacks of ChatGPT already provided a demonstration of problems with RLHF. I’m worried we’re in a situation analogous to ‘Smashing The Stack For Fun And Profit’ being published 27 years ago (reinventing vulns known since MULTICS in the 1960s) and all the C/C++ programmers in denial are going ‘bro I can patch that example, it’s no big deal, it’s just a loophole, we don’t need to change everything, you just gotta get good at memory management, bro, this isn’t hard to fix bro use a sanitizer and turn on
-Wall
, we don’t need to stop using C-like languages, u gotta believe me we can’t afford a 20% slowdown and it definitely won’t take us 3 decades and still be finding remote zero-days and new gadgets no way man you’re just making that up stop doom-mongering and FUDing bro (i’m too old to learn a new language)’.very very funny example to use with Jake, a veteran c++ wizard
“Unaligned AGI doesn’t take over the world by killing us—it takes over the world by seducing us.”
Por que no los dos?