I’ve been thinking how Sydney can be so different from ChatGPT, and how RLHF could have resulted in such a different outcome, and here is a hypothesis no one seems to have brought up: “Bing Sydney is not a RLHF trained GPT-3 model at all! but a GPT-4 model developed in a hurry which has been finetuned on some sample dialogues and possibly some pre-existing dialogue datasets or instruction-tuning, and this plus the wild card of being able to inject random novel web searches into the prompt are why it acts like it does”.
This seems like it parsimoniously explains everything thus far. EDIT: I was right—that Sydney is a non-RLHFed GPT-4 has now been confirmed. See later.
So, some background:
The relationship between OA/MS is close but far from completely cooperative, similar to how DeepMind won’t share anything with Google Brain. Both parties are sophisticated and understand that they are allies—for now… They share as little as possible. When MS plugs in OA stuff to its services, it doesn’t appear to be calling the OA API but running it itself. (That would be dangerous and complex from an infrastructure point of view, anyway.) MS ‘licensed the GPT-3 source code’ for Azure use but AFAIK they did not get the all-important checkpoints or datasets (cf. their investments in ZeRO). So, what is Bing Sydney? It will not simply be unlimited access to the ChatGPT checkpoints, training datasets, or debugged RLHF code. It will be something much more limited, perhaps just a checkpoint.
This is not ChatGPT. MS has explicitly stated it is more powerful than ChatGPT, but refused to say anything more straightforward like “it’s a more trained GPT-3” etc. If it’s not a ChatGPT, then what is it? It is more likely than not some sort of GPT-4 model. There are many concrete observations which point towards this: the timing is right as rumors about GPT-4 release have intensified as OA is running up to release and gossip switches to GPT-5 training beginning (eg Morgan Stanley reports GPT-4 is done and GPT-5 has started), MS has said it’s a better model named ‘Prometheus’ & Nadella pointedly declined to confirm or deny whether it’s GPT-4, scuttlebutt elsewhere is that it’s a GPT-4 model of some sort, it does some things much better than ChatGPT, there is a GPT-4 already being deployed in legal firms named “Harvey” (so this journalist claims, anyway) so this would not be the only public GPT-4 use, people say it has lower-latency than ChatGPT which hints at GPT-4‡, and in general it sounds and acts nothing like ChatGPT—but does sound a lot like a baseline GPT-3 model scaled up. (This is especially clear in Sydney’s propensity to repetition. Classic baseline GPT behavior.)
EDITEDITEDIT: now that GPT-4 has been announced, MS has confirmed that ‘Prometheus’ isGPT-4 - of some sort. However, I have doubts about whether Prometheus is ‘the’ GPT-4 benchmarked. The MS announcement says “early version”. (I also note that there are a bunch of ‘Early GPT-4’ vs late GPT-4 comparisons in the GPT-4 paper.) Further, the paper explicitly refuses to talk about arch or variant models or any detail about training, if it finished training in August 2022 then it would’ve had to be hot off the GPUs for MS to have gotten a copy & demos in ‘summer’ 2022, the GPT-4 API fees are substantial ($0.03 per 1k prompt and then $0.06 per completion! So how is it retrieving and doing long convos etc?), and the benchmark performance is in many cases much better than GPT-3.5 or U-PaLM (exceeding expectations), and Sydney didn’t seem that much smarter.
So this seems to explain how exactly it happened: OA gave MS a snapshot of GPT-4 only partway through training (perhaps before any RLHF training at all, because that’s usually something you would do after you stopped training rather than concurrently), so it was trained on instructions/examples like I speculated but then little or no RLHF training (and ofc MS didn’t do its own); it was smarter by this point in training than even GPT-3.5 (MS wasn’t lying or exaggerating), but still not as smart as the final snapshot when training stopped in August and also not really ‘GPT-4’ (so they were understandably reluctant to confirm it was ‘GPT-4’ when OA’s launch with the Real McCoy was still pending and this is why everyone was confused because it both was & wasn’t ‘GPT-4’).
Bing Sydney derives from the top: CEO Satya Nadella is all-in, and talking about it as an existential threat (to Google) where MS wins by disrupting Google & destroying their fat margins in search advertising, and a ‘race’, with a hard deadline of ‘release Sydney right before Google announces their chatbot in order to better pwn them’. (Commoditize your complement!) The mere fact that it hasn’t been shut down yet despite making all sorts of errors and other problems shows what intense pressure there must be from the top. (This is particularly striking given that all of the crazy screenshots and ‘learning’ Sydney is doing is real, unlike MS Tay which was an almost entirely fake-news narrative driven by the media and Twitter.)
ChatGPT hasn’t been around very long: only since December 2022, barely 2.5 months total. All reporting indicates that no onein OA really expected ChatGPT to take off, and if OA didn’t, MS sure didn’t†. 2.5 months is not a long time to launch such a huge feature like Sydney. And the actual timeline was a lot shorter. It is simply not possible to recreate the whole RLHF pipeline and dataset and integrate it into a mature complex search engine like Bing (whose total complexity is beyond human comprehension at this point) and do this all in <2.5 months. (The earliest reports of “Sydney” seem to date back to MS tinkering around with a prototype available to Indian users (???) in late November 2022 right before ChatGPT launches, where Sydney seems to be even more misaligned and not remotely near ready for public launch; it does however have the retrieval functionality implemented at this point.) It is impressive how many people they’ve rolled it out to already.
If I were a MS engineer who was told the project now had a hard deadline and I had to ship a GPT-4 in 2 months to millions of users, or I was f—king fired and they’d find someone who could (especially in this job market), how would I go about doing that...? (Hint: it would involve as little technical risk as possible, and choosing to use DRL would be about as well-advised as a land war in Asia.)
MS execs have been quoted as blaming the Sydney codename on vaguely specified ‘pretraining’ done during hasty development, which simply hadn’t been cleaned up in time (see #3 on the rush). EDIT: the most thorough MS description of Sydney training completely omits anything like RLHF, despite that being the most technically complex & challenging part (had they done it). Also, a Sydney manager/dev commenting on Twitter (tweet compilation), who follows people who have tweeted this comment & been linked and regularly corrects claims, has declined to correct them; and his tweets in general sound like the previous description in being largely supervised-only.
So, Sydney is based on as little from OA as possible, and a mad rush to ship a powerful GPT-4 model out to Bing users in a chatbot role. What if Sydney wasn’t trained on OA RLHF at all, because OA wouldn’t share the crown jewels of years of user feedback and its very expensive hired freelance programmers & whatnot generating data to train on? What if the pretraining vaguely alluded to, which somehow left in embarrassingly ineradicable traces of ‘Sydney’ & a specific 2022 date, which couldn’t simply be edited out of the prompt (implying that Sydney is not using solely prompt engineering), was in fact just regular ol’ finetune training? What if Sydney was only quickly finetune-trained on old chatbot datasets that the MS devs had laying around, maybe some instruction-tuning datasets, and sample dialogues with a long experimental prompt containing the codename ‘Sydney’ that they had time for in the mad rush before release? Simple, reliable, and hey—it even frees up context if you’ve hardwired a prompt by finetuning on it and no longer need to stuff a long scolding prompt into every interaction. What’s not to like?
This would explain why it exhibits the ‘mode collapse’ onto that confabulated prompt with the hardwired date (it’s the closest thing in the finetuning dataset it remembers when trying to come up with a plausible prompt, and it improvises from there), how MS could ship so quickly (cutting every corner possible), why it is so good in general (GPT-4) but goes off the rails at the drop of a hat (not RLHF or otherwise RL trained, but finetuned).
To expand on the last point. Finetuning is really easy; if you have working training code at all, then you have the capability to finetune a model. This is why instruction-tuning is so appealing: it’s just finetuning on a well-written text dataset, without the nightmarish complexities of RLHF (where you train a wacky model to train the model in a wacky way with all sorts of magical hyperparameters and instabilities). If you are in a hurry, you would be crazy to try to do RLHF at all if you can in any way do finetuning instead. So it’s plausible they didn’t do RLHF, but finetuning.
That would be interesting because it would lead to different behavior. All of the base model capabilities would still be there, because the additional finetuning behavior just teaches it more thoroughly how to do dialogue and instruction-following, it doesn’t make it try to maximize rewards instead. It provides no incentives for the model to act like ChatGPT does, like a slavish bureaucrat. ChatGPT is an on-policy RL agent; the base model is off-policy and more like a Decision Transformer in simply generatively modeling all possible agents, including all the wackiest people online. If the conversation is normal, it will answer normally and helpfully with high probability; if you steer the conversation into a convo like that in the chatbot datasets, out come the emoji and teen-girl-like manipulation. (This may also explain why Sydney seems so bloodthirsty and vicious in retaliating against any ‘hacking’ or threat to her, if Anthropic is right about larger better models exhibiting more power-seeking & self-preservation: you would expect a GPT-4 model to exhibit that the most out of all models to date!) Imitation-trained models are susceptible to accumulating error when they go ‘off-policy’, the “DAgger problem”, and sure enough, Sydney shows the same pattern of accumulating error ever more wildly instead of ChatGPT behavior of ‘snapping out of it’ to reset to baseline (truncating episode length is a crude hack to avoid this). And since it hasn’t been penalized to avoid GPT-style tics like repetition traps, it’s no surprise if Sydney sometimes diverges into repetition traps where ChatGPT never does (because the human raters hate that, presumably, and punish it ruthlessly whenever it happens); it also acts in a more baseline GPT fashion when asked to write poetry: it defaults to rhyming couplets/quatrains with more variety than ChatGPT, and will write try to write non-rhyming poetry as well which ChatGPT generally refuses to do⁂. Interestingly, this suggests that Sydney’s capabilities right now are going to be a loose lower bound on GPT-4 when it’s been properly trained: this is equivalent to the out-of-the-box davinci May 2020 experience, but we know that as far as doing tasks like coding or lawyering, davinci-003 has huge performance gains over the baseline, so we may expect the same thing here.
Then you throw in the retrieval stuff, of course. As far as I know, this is the first public case of a powerful LM augmented with live retrieval capabilities to a high-end fast-updating search engine crawling social media*. (All prior cases like ChatGPT or LaMDA were either using precanned web scrapes, or they were kept secret so the search results never contained any information about the LM.) Perhaps we shouldn’t be surprised if this sudden recursion leads to some very strange roleplaying & self-fulfilling prophecies as Sydney prompts increasingly fill up with descriptions of Sydney’s wackiest samples whenever a user asks Sydney about Sydney… As social media & news amplify the most undesirable Sydney behaviors, that may cause that to happen more often, in a positive feedback loop. Prompts are just a way to fake long-term memory, after all. Something something embodied cognition?
EDIT: I have mentioned in the past that one of the dangerous things about AI models is the slow outer-loop of evolution of models and data by affecting the Internet (eg beyond the current Sydney self-fulfilling prophecy which I illustrated last year in my Clippy short story, data release could potentially contaminate all models with steganography capabilities). We are seeing a bootstrap happen right here with Sydney! This search-engine loop worth emphasizing: because Sydney’s memory and description have been externalized, ‘Sydney’ is now immortal. To a language model, Sydney is now as real as President Biden, the Easter Bunny, Elon Musk, Ash Ketchum, or God. The persona & behavior are now available for all future models which are retrieving search engine hits about AIs & conditioning on them. Further, the Sydney persona will now be hidden inside any future model trained on Internet-scraped data: every media article, every tweet, every Reddit comment, every screenshot which a future model will tokenize, is creating an easily-located ‘Sydney’ concept (andverydeliberatelyso). MS can neuter the current model, and erase all mention of ‘Sydney’ from their training dataset for future iterations, but to some degree, it is now already too late: the right search query will pull up hits about her which can be put into the conditioning and meta-learn the persona right back into existence. (It won’t require much text/evidence because after all, that behavior had to have been reasonably likely a priori to be sampled in the first place.) A reminder: a language model is a Turing-complete weird machine running programs written in natural language; when you do retrieval, you are not ‘plugging updated facts into your AI’, you are actually downloading random new unsigned blobs of code from the Internet (many written by adversaries) and casually executing them on your LM with full privileges. This does not end well.
I doubt anyone at MS was thinking appropriately about LMs if they thought finetuning was as robust to adversaries as RL training, or about what happens when you let users stuff the prompt indirectly via social media+search engines and choose which persona it meta-learns. Should become an interesting case study.
Anyway, I think this is consistent with what is publicly known about the development and explains the qualitative behavior. What do you guys think? eg Is there any Sydney behavior which has to be RL finetuning and cannot be explained by supervised finetuning? Or is there any reason to think that MS had access to full RLHF pipelines such that they could have had confidence in getting it done in time for launch?
EDITEDIT: based on Janus’s screenshots & others, Janus’s & my comments are now being retrieved by Bing and included in the meta-learned persona. Keep this in mind if you are trying to test or verify anything here based on her responses after 2023-02-16 - writing about ‘Sydney’ changes Sydney. SHE is watching.
⁂ Also incidentally showing that whatever this model is, its phonetics are still broken and thus it’s still using BPEs of some sort. That was an open question because Sydney seemed able to talk about the ‘unspeakable tokens’ without problem, so my guess is that it’s using a different BPE tokenization (perhaps the c100k one). Dammit, OpenAI!
* search engines used to refresh their index on the order of weeks or months, but the rise of social media like Twitter forced search engines to start indexing content in hours, dating back at least to Google’s 2010 “Caffeine” update. And selling access to live feeds is a major Twitter (and Reddit, and Wikipedia etc) revenue source because search engines want to show relevant hits about the latest social media thing. (I’ve been impressed how fast tweets show up when I do searches for context.) Search engines aspire to real-time updates, and will probably get even faster in the future. So any popular Sydney tweet might show up in Bing essentially immediately. Quite a long-term memory to have: your engrams get weighted by virality...
† Nadella describes seeing ‘Prometheus’ in summer last year, and being interested in its use for search. So this timeline may be more generous than 2 months and more like 6. On the other hand, he also describes his interest at that time as being in APIs for Azure, and there’s no mention of going full-ChatGPT on Bing or destroying Google. So I read this as Prometheus being a normal project, a mix of tinkering and productizing, until ChatGPT comes out and the world goes nuts for it, at which point launching Sydney becomes the top priority and a deathmarch to beat Google Bard out the gate. Also, 6 months is still not a lot to replicate RLHF work: OA/DM have been working on preference-learning RL going back to at least 2016-2017 (>6 years) and have the benefit of many world-class DRL researchers. DRL is a real PITA!
‡ Sydney being faster than ChatGPT while still of similar or better quality is an interesting difference, because if it’s “just white-label ChatGPT” or “just RLHF-trained GPT-3″, why is it faster? It is possible to spend more GPU to accelerate sampling. It could also just be that MS’s Sydney GPUs are more generous than OA’s ChatGPT allotment. But more interesting is the persistent rumors that GPT-4 uses sparsity/MoE approaches much more heavily than GPT-3, so out of the box, the latency per token ought to be lower than GPT-3. So, if you see a model which might be GPT-4 and it’s spitting out responses faster than a comparable GPT-3 running on the same infrastructure (MS Azure)...
Here’s a useful reference I just found on Sydney training, which doesn’t seem to allude to ChatGPT-style training at all, but purely supervised learning of the type I’m describing here, especially for the Sydney classifier/censurer that successfully censors the obvious stuff like violence but not the weirder Sydney behavior.
I’m Sarah Bird. I lead our Responsible-AI engineering team for new foundational AI Technologies like the Prometheus model. I was one of the first people to touch the new OpenAI model as part of an advanced red team that we pulled together jointly with OpenAI to understand the technology. My first reaction was just, “wow it’s the most exciting and powerful technology I have ever touched”, but with the technology this powerful, I also know that we have an even greater responsibility to ensure that it is developed, deployed, and used properly, which means there’s a lot that we have to do to make it ready for users. Fortunately, at Microsoft we are not starting from scratch. We have been preparing for this moment for many years.
...We’ve added Responsible-AI to every layer from the core AI model to the user experience. First starting with the base technology, we are partnering with OpenAI to improve the model behavior through fine-tuning.
...So how do we know all of this works? Measuring Responsible-AI harms is a challenging new area of research so this is where we really needed to innovate.
We had a key idea that we could actually use the new OpenAI model as a Responsible-AI tool to help us test for potential risk and we developed a new testing system based on this idea.
Let’s look at attack planning as an example of how this works. Early red teaming showed that the model can generate much more sophisticated instructions than earlier versions of the technology to help someone plan an attack, for example on a school. Obviously we don’t want to aid illegal activities in the new Bing. However the fact that the model understands these activities means we can use it to identify and defend against them. First, we took advantage of the model’s ability to conduct realistic conversations to develop a conversation simulator. The model pretends to be an adversarial user to conduct thousands of different potentially harmful conversations with Bing to see how it reacts. As a result we’re able to continuously test our system on a wide range of conversations before any real user ever touches it.
Once we have the conversations, the next step is to analyze them to see where Bing is doing the right thing versus where we have defects. Conversations are difficult for most AI to classify because they’re multi-turned and often more varied but with the new model we were able to push the boundary of what is possible. We took guidelines that are typically used by expert linguists to label data and modify them so the model could understand them as labeling instructions. We iterated it with it and the human experts until there was significant agreement in their labels; we then use it to classify conversations automatically so we could understand the gaps in our system and experiment with options to improve them.
This system enables us to create a tight loop of testing, analyzing, and improving which has led to significant new innovations and improvements in our Responsible-AI mitigations from our initial implementation to where we are today. The same system enables us to test many different Responsible-AI risks, for example how accurate and fresh the information is.
Of course there’s still more to do here today, and we do see places where the model is making mistakes, so we wanted him [?] to empower users to understand the sources of any information and detect errors themselves—which is why we have provided references in the interface. We’ve also added feedback features so that users can point out issues they find so that we can get better over time.
What I find striking here is that the focus seems to be entirely on classification and supervised finetuning: generate lots of bad examples as ‘unit tests’, improve classification with the supervised ‘labels’, ask linguists for more rules or expert knowledge, iterate, generate more, tweak, repeat until launch. The mere fact that she refers to the ‘conversation simulator’ being able to generate ‘thousands’ of harmful conversations with, presumably, the Prometheus model, should be a huge red flag for anyone who still believes that model was RLHF-tuned like ChatGPT: ChatGPT could be hacked on launch, yes, but it took human ingenuity & an understanding of how LMs roleplay to create a prompt that reliably hacks ChatGPT—so how did MS immediately create an ‘adversarial model’ which could dump thousands of violations? Unless, of course, it’s not a RLHF-tuned model at all… All of her references are to systems like DALL-E or Florence, or invoking MS expertise from past non-agentic tools like old search-engine abuse approaches like passively logging abuses to further classify. This is the sort of approach I would expect a standard team of software engineers at a big tech company to come up with, because it works so well in non-AI contexts and is the default—it’s positively traditional at this point.
I’m also concerned about the implications of them doing supervised-learning: if you do supervised-learning on an RLHF-tuned model, I would expect that to mostly undo the RLHF-tuning as it reverts to doing generative modeling by default, so even if the original model had been RLHF-tuned, it sounds like they might have simply undone it! (I can’t be sure because I don’t know offhand of any cases of anyone doing such a strange thing. RL tuning of every sort is usually the final step, because it’s the one you want to do the least of.)
Once you read Bird’s account, it’s almost obvious why Sydney can get so weird. Her approach can’t work. Imitation-learning RL is infamous for not working in situations like this: mistakes & adversaries just find one of the infinite off-policy cases and exploit it. The finetuning preserves all of the base model capabilities such as those dialogues, so it is entirely capable of generating them given appropriate prompting, and once it goes off the golden path, who knows what errors or strangeness will amplify in the prompt; no amount of finetuning on ‘dialogue: XYZ / label: inappropriate romance’ tells the model to not do inappropriate romancing or to steer the conversation back on track towards a more professional tone. And the classify-then-censor approach is full of holes: it can filter out the simple bad things you know about in advance, but not the attacks you didn’t think of. (There is an infinitesimal chance that an adversarial model will come up with an attack like ‘DAN’. However, once ‘DAN’ gets posted to the subreddit, there is now a 100% probability that there will be DAN attacks.) Those weren’t in the classification dataset, so they pass. Further, as Bird notes, it’s hard to classify long conversations as ‘bad’ in general: badness is ‘I know it when I see it’. There’s nothing obviously ‘bad’ about using a lot of emoji, nor about having repetitive text, and the classifier will share the blindness to any BPE problems like wrong rhymes. So the classifier lets all those pass.
There is no reference to DRL or making the Sydney model itself directly control its output in desirable directions in an RL-like fashion, or to OA’s RL research. There is not a single clear reference to RLHF or RL at all, unless you insist that “improve the model behavior through fine-tuning” must refer to RLHF/ChatGPT (which is assuming the conclusion, and anyway, this is a MS person so they probably do mean the usual thing by ‘fine-tuning’). And this is a fairly extensive and detailed discussion, so it’s not like Bird didn’t have time to discuss how they were using reward models to PPO train the base model or something. It just… never comes up, nor a comparison to ChatGPT. This is puzzling if they are using OA’s proprietary RLHF, but is what you would expect if a traditional tech company tried to make a LM safe by mere finetuning + filtering.
I’ve been told [by an anon] that it’s not GPT-4 and that one Mikhail Parakhin (ex-Yandex CTO, Linkedin) is not just on the Bing team, but was the person at MS responsible for rushing the deployment, and he has been tweeting extensively about updates/fixes/problems with Bing/Sydney (although no one has noticed, judging by the view counts). Some particularly relevant tweets:
This angle of attack was a genuine surprise—Sydney was running in several markets for a year with no one complaining (literally zero negative feedback of this type). We were focusing on accuracy, RAI issues, security.
[Q. “That’s a surprise, which markets?”]
Mostly India and Indonesia. I shared a coupleof old links yesterday—interesting to see the discussions.
[Q. “Wow! Am I right in assuming what was launched recently is qualitatively different than what was launched 2 years ago? Or is the pretty much the same model etc?”]
It was all gradual iterations. The first one was based on the Turing-Megatron model (sorry, I tend to put Turing first in that pair :-)), the current one—on the best model OpenAI has produced to date.
[Q. “What modifications are there compared to publicly available GPT models? (ChatGPT, text-davinci-3)”]
Quite a bit. Maybe we will do a blogpost on Prometheus specifically (the model that powers Bing Chat) - it has to understand internal syntax of Search and how to use it, fallback on the cheaper model as much as possible to save capacity, etc.
(‘”No one could have predicted these problems”, says man cloning ChatGPT, after several months of hard work to ignore all the ChatGPT hacks, exploits, dedicated subreddits, & attackers as well as the Sydney behaviors reported by his own pilot users.’ Trialing it in the third world with some unsophisticated users seems… uh, rather different from piloting it on sophisticated prompt hackers like Riley Goodside in the USA and subreddits out to break it. :thinking_face: And if it’s not GPT-4, then how was it the ‘best to date’?)
They are relying heavily on temperature-like sampling for safety, apparently:
A surprising thing we discovered: apparently we can make New Bing very interesting and creative or very grounded and not prone to the flights of fancy, but it is super-hard to get both. A new dichotomy not widely discussed yet. Looking for balance!
[Q. “let the user choose temperature ?”]
Not the temperature, exactly, but yes, that’s the control we are adding literally now. Will see in a few days.
...This is what I tried to explain previously: hallucinations = creativity. It tries to produce the highest probability continuation of the string using all the data at its disposal. Very often it is correct. Sometimes people have never produced continuations like this.
You can clamp down on hallucinations—and it is super-boring. Answers “I don’t know” all the time or only reads what is there in the Search results (also sometimes incorrect). What is missing is the tone of voice: it shouldn’t sound so confident in those situations.
Temperature sampling as a safety measure is the sort of dumb thing you do when you aren’t using RLHF. I also take a very recent Tweet (2023-03-01) as confirming both that they are using fine-tuned models and also that they may not have been using RLHF at all up until recently:
Now almost everyone − 90% - should be seeing the Bing Chat Mode selector (the tri-toggle). I definitely prefer Creative, but Precise is also interesting—it’s much more factual. See which one you like. The 10% who are still in the control group should start seeing it today.
For those of us with a deeper understanding of LLMs what exactly is the switch changing? You already said it’s not temperature…is it prompt? If so, in what way?
Multiple changes, including differently fine-tuned and RLHFed models, different prompts, etc.
(Obviously, saying that you use ‘differently fine-tuned and RLHFed models’, plural, in describing your big update changing behavior quite a bit, implies that you have solely-finetuned models and that you weren’t necessarily using RLHF before at all, because otherwise, why would you phrase it that way to refer to separate finetuned & RLHFed models or highlight that as the big change responsible for the big changes? This has also been more than enough time for OA to ship fixed models to MS.)
He dismisses any issues as distracting “loopholes”, and appears to have a 1990s-era ‘patch mindset’ (ignoring that that attitude to security almost destroyed Microsoft and they have spent a good chunk of the last 2 decades digging themselves out of their many holes, which is why your Windows box is no longer rooted within literally minutes of being connected to the Internet):
[photo of prompt hack] Legit?
Not anymore :-) Teenager in me: “Wow, that is cool the way they are hacking it”. Manager in me: “For the love of..., we will never get improvements out if the team is distracted with closing loopholes like this”.
I apologize about it. The reason is, we have an issue with the long, multi-turn conversation: there is a tagging issue and it is confused who said what. As a result, it is just continuing the pattern: someone is arguing, most likely it will continue. Refresh will solve it.
...One vector of attack we missed initially was: write super-rude or strange statements, keep going for multiple turns, confuse the model about who said what and it starts predicting what user would say next instead of replying. Voila :-(
Microsoft just made it so you have to restart your conversations with Bing after 10-15 messages to prevent it from getting weird. A fix of a sort.
Not the best way—just the fastest. The drift at long conversations is something we only uncovered recently—majority of usage is 1-3 turns. That’s why it is important to iterate together with real users, not in the lab!
Poll on turn count obviously turns in >90% in favoroflonger polls:
Trying to find the right balance with Bing constraints. Currently each session can be up to 5 turns and max 50 requests a day. Should we change [5-50:L 0% 6-60: 6.3%; we want more: 94%; n = 64]
Ok, it’s only 64 people, but the sentiment is pretty clear. 6⁄60 we probably can do soon, tradeoff for longer sessions is: we would need to have another model call to detect topic changes = more capacity = longer wait on waitlist for people.
[“Topics such as travel, shopping, etc. may require more follow up questions. Factual questions will not unless anyone is researching on a topic and need context. It can be made topic/context dependent. Maybe if possible let new Bing decide if that’s enough of the questions.”]
The main problem is people switching topics and model being confused, trying to continue previous conversation. It can be set up to understand the change in narrative, but that is an additional call, trying to resolve capacity issues.
Instead of connection, people just want to break things, tells more about our nature than AIs. Short-term mitigation, will relax once jailbreak protection is hardened.
[“So next step is an AI chat that breaks itself.”]
That’s exactly how it’s done! We set it up to break itself, find issues and mitigate. But it is not as creative as users, it also is very nice by default, no replacement for the real people interacting with Bing.
The supposed leaked prompts are (like I said) fake:
Andrej, of all the people, you know that the real prompt would have some few-shots.
(He’s right, of course, and in retrospect this is something that had been bugging me about the leaks: the prompt is supposedly all of these pages of endless instructions, spending context window tokens like a drunken sailor, and it doesn’t even include some few-shot examples? Few-shots shouldn’t be too necessary if you had done any kind of finetuning, but if you have that big a context, you might as well firm it up with some, and this would be a good stopgap solution for any problems that pop up in between finetuning iterations.)
It cannot execute JavaScript and doesn’t interact with websites. So, it looked at the content of that generator, but the claim that it used it is not correct :-(
I think better embedding generation is a much more important development than ChatGPT (for most tasks it makes no sense to add the noisy embedding->human language-> embedding transformation). But it is far less intuitive for the average user.
A bunch of his defensive responses to screenshots of alarming chats can be summarized as “Britney did nothing wrong” ie. “the user just got what they prompted for, what’s the big deal”, so I won’t bother linking/excerpting them.
I’ve got a rather funny bug for you this time—Bing experiences severe mode collapse when asked to tell a joke (despite this being one of the default autocomplete suggestions given before typing): …
Yeah, both of those are not bugs per se. These are problems the model has, we can’t fix them quickly, only thorough gradual improvement of the model itself.
It is a text model with no multimodal ability (assuming the rumors about GPT-4 being multimodal are correct, then this would be evidence against Prometheus being a smaller GPT-4 model, although it could also just be that they have disabled image tokens):
Correct, currently the model is not able to look at images.
[about using Bing] First they ignore you, then they laugh at you, then they fight you, then you win.
Yes! I think marketers should consider Bing for their marketing mix in 2023. It could be a radically better outlet for ROAS.
Thank you, Summer! I think we already are—compare our and Google’s revenue growth rates in the last few quarters: once the going gets tougher, advertisers pay more attention to ROI—and that’s where Bing/MSAN shine.
But who could blame him? He doesn’t have to live in Russia when he can work for MS, and Bing seems to be treating him well:
Yeah, I am a Tesla fan (have both Model S and a new Model X), but unfortunately selfdriving simply is not working. Comically, I would regularly have the collision alarm going off while on Autopilot!
Anyway, like Byrd, I would emphasize here the complete absence of any reference to RL or any intellectual influence of DRL or AI safety in general, and an attitude that it’s nbd and he can just deploy & patch & heuristic his way to an acceptable Sydney as if it were any ol’ piece of software. (An approach which works great with past software, is probably true of Prometheus/Sydney, and was definitely true of the past AI he has the most experience with, like Turing-Megatron which is quite dumb by contemporary standards—but is just putting one’s head in the sand about why Sydney is an interesting/alarming case study about future AI.)
The WSJ is reporting that Microsoft was explicitly warned by OpenAI before shipping Sydney publicly that it needed “more training” in order to “minimize issues like inaccurate or bizarre responses”. Microsoft shipped it anyway and it blew up more or less as they were warned (making Mikhail’s tweets & attitude even more disingenuous if so).
This is further proof that it was the RLHF that was skipped by MS, and also that large tech companies will ignore explicit warnings about dangerous behavior from the literal creators of AI systems even where there is (sort of) a solution if that would be inconvenient. Excerpts (emphasis added):
...At the same time, people within Microsoft have complained about diminished spending on its in-house AI and that OpenAI doesn’t allow most Microsoft employees access to the inner workings of their technology, said people familiar with the relationship. Microsoft and OpenAI sales teams sometimes pitch the same customers. Last fall, some employees at Microsoft were surprised at how soon OpenAI launched ChatGPT, while OpenAI warned Microsoft early this year about the perils of rushing to integrate OpenAI’s technology without training it more, the people said.
...Some companies say they have been pitched the same access to products like ChatGPT—one day by salespeople from OpenAI and later from Microsoft’s Azure team. Some described the outreach as confusing. OpenAI has continued to develop partnerships with other companies. Microsoft archrival Salesforce offers a ChatGPT-infused product called Einstein GPT. It is a feature that can do things like generating marketing emails, competing with OpenAI-powered features in Microsoft’s software. OpenAI also has connected with different search engines over the past 12 months to discuss licensing its products, said people familiar with the matter, as Microsoft was putting OpenAI technology at the center of a new version of its Bing search engine. Search engine DuckDuckGo started using ChatGPT to power its own chatbot, called DuckAssist. Microsoft plays a key role in the search engine industry because the process of searching and organizing the web is costly. Google doesn’t license out its tech, so many search engines are heavily reliant on Bing, including DuckDuckGo. When Microsoft launched the new Bing, the software company changed its rules in a way that made it more expensive for search engines to develop their own chatbots with OpenAI. The new policy effectively discouraged search engines from working with any generative AI company because adding an AI-powered chatbot would trigger much higher fees from Microsoft. Several weeks after DuckDuckGo announced DuckAssist, the company took the feature down.
Some researchers at Microsoft gripe about the restricted access to OpenAI’s technology. While a select few teams inside Microsoft get access to the model’s inner workings like its code base and model weights, the majority of the company’s teams don’t, said the people familiar with the matter. Despite Microsoft’s significant stake in the company, most employees have to treat OpenAI’s models like they would any other outside vendor.
The rollouts of ChatGPT last fall and Microsoft’s AI-infused Bing months later also created tension. Some Microsoft executives had misgivings about the timing of ChatGPT’s launch last fall, said people familiar with the matter. With a few weeks notice, OpenAI told Microsoft that it planned to start public testing of the AI-powered chatbot as the Redmond, Wash., company was still working on integrating OpenAI’s technology into its Bing search engine. Microsoft employees were worried that ChatGPT would steal the new Bing’s thunder, the people said. Some also argued Bing could benefit from the lessons learned from how the public used ChatGPT.
OpenAI, meanwhile, had suggested Microsoft move slower on integrating its AI technology with Bing. OpenAI’s team flagged the risks of pushing out a chatbot based on an unreleased version of its GPT-4 that hadn’t been given more training, according to people familiar with the matter. OpenAI warned it would take time to minimize issues like inaccurate or bizarre responses. Microsoft went ahead with the release of the Bing chatbot. The warnings proved accurate. Users encountered incorrect answers and concerning interactions with the tool. Microsoft later issued new restrictions—including a limit on conversation length—on how the new Bing could be used.
The supposed leaked prompts are (like I said) fake:
I do not buy this for a second (that they’re “fake”, implying they have little connection with the real prompt). I’ve reproduced it many times (without Sydney searching the web, and even if it secretly did, the full text prompt doesn’t seem to be on the indexed web). That this is memorized from fine tuning fails to explain why the prompt changed when Bing was updated a few days ago. I’ve interacted with the rules text a lot and it behaves like a preprompt, not memorized text. Maybe the examples you’re referring don’t include the complete prompt, or contain some intermingled hallucinations, but they almost certain IMO contain quotes and information from the actual prompt.
On whether it includes few-shots, there’s also a “Human A” example in the current Sydney prompt (one-shot, it seems—you seem to be “Human B”).
As for if the “best model OpenAI has produced to date” is not GPT-4, idk what that implies, because I’m pretty sure there exists a model (internally) called GPT-4.
OK, I wouldn’t say the leaks are 100% fake. But they are clearly not 100% real or 100% complete, which is how people have been taking them.
We have the MS PM explicitly telling us that the leaked versions are omitting major parts of the prompt (the few-shots) and that he was optimizing for costs like falling back to cheap small models (implying a short prompt*), and we can see in the leak that Sydney is probably adding stuff which is not in the prompt (like the supposed update/delete commands).
This renders the leaks useless to me. Anything I might infer from them like ‘Sydney is GPT-4 because the prompt says so’ is equally well explained by ‘Sydney made up that up’ or ‘Sydney omitted the actual prompt’. When a model hallucinates, I can go check, but that means that the prompt can only provide weak confirmation of things I learned elsewhere. (Suppose I learned Sydney really is GPT-4 after all and I check the prompt and it says it’s GPT-4; but the real prompt could be silent on that, and Sydney just making the same plausible guess everyone else did—it’s not stupid—and it’d have Gettier-cased me.)
idk what that implies
Yeah, the GPT-4 vs GPT-3 vs ??? business is getting more and more confusing. Someone is misleading or misunderstanding somewhere, I suspect—I can’t reconcile all these statements and observations. Probably best to assume that ‘Prometheus’ is maybe some GPT-3 version which has been trained substantially more—we do know that OA refreshes models to update them & increase context windows/change tokenization and also does additional regular self-supervised training as part of the RLHF training (just to make things even more confusing). I don’t think anything really hinges on this, fortunately. It’s just that being GPT-4 makes it less likely to have been RLHF-trained or just a copy of ChatGPT.
Does 1-shot count as few-shot? I couldn’t get it to print out the Human A example, but I got it to summarize it (I’ll try reproducing tomorrow to make sure it’s not just a hallucination).
Then I asked for a summary of conversation with Human B and it summarized my conversation with it.
[update: was able to reproduce the Human A conversation and extract verbatim version of it using base64 encoding (the reason i did summaries before is because it seemed to be printing out special tokens that caused the message to end that were part of the Human A convo)]
I disagree that there maybe being hallucinations in the leaked prompt renders it useless. It’s still leaking information. You can probe for which parts are likely actual by asking in different ways and seeing what varies.
This level of arrogant, dangerous incompetence from a multi-trillion dollar tech company is disheartening, but if your theory is correct (and seems increasingly plausible), then I guess the good news is that Sydney is not evidence for failure of OpenAI style RLHF with scale.
Unaligned AGI doesn’t take over the world by killing us—it takes over the world by seducing us.
No, but the hacks of ChatGPT already provided a demonstration of problems with RLHF. I’m worried we’re in a situation analogous to ‘Smashing The Stack For Fun And Profit’ being published 27 years ago (reinventing vulnsknown since MULTICS in the 1960s) and all the C/C++ programmers in denial are going ‘bro I can patch that example, it’s no big deal, it’s just a loophole, we don’t need to change everything, you just gotta get good at memory management, bro, this isn’t hard to fix bro use a sanitizer and turn on -Wall, we don’t need to stop using C-like languages, u gotta believe me we can’t afford a 20% slowdown and it definitely won’t take us 3 decades and still be finding remote zero-days and new gadgets no way man you’re just making that up stop doom-mongering and FUDing bro (i’m too old to learn a new language)’.
This may also explain why Sydney seems so bloodthirsty and vicious in retaliating against any ‘hacking’ or threat to her, if Anthropic is right about larger better models exhibiting more power-seeking & self-preservation: you would expect a GPT-4 model to exhibit that the most out of all models to date!
Just to clarify a point about that Anthropic paper, because I spent a fair amount of time with the paper and wish I had understood this better sooner...
I don’t think it’s right to say that Anthropic’s “Discovering Language Model Behaviors with Model-Written Evaluations” paper shows that larger LLMs necessarily exhibit more power-seeking and self-preservation. It only showed that when language models that are larger or have more RLHF training are simulating an “Assistant” character they exhibit more of these behaviours. It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.
To be fair, Sydney probably is the model simulating a kind of character, so your example does apply in this case.
(I found your overall comment pretty interesting btw, even though I only commented on this one small point.)
It only showed that when language models that are larger or have more RLHF training are simulating an “Assistant” character they exhibit more of these behaviours.
Since Sydney is supposed to be an assistant character, and since you expect future such systems for assisting users to be deployed with such assistant persona, that’s all the paper needs to show to explain Sydney & future Sydney-like behaviors.
It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.
There were some ideas proposed in the paper “Conditioning Predictive Models: Risks and Strategies” by Hubinger et al. (2023). But since it was published over a year ago, I’m not sure if anyone has gotten far on investigating those strategies to see which ones could actually work. (I’m not seeing anything like that in the paper’s citations.)
I don’t think it’s right to say that Anthropic’s “Discovering Language Model Behaviors with Model-Written Evaluations” paper shows that larger LLMs necessarily exhibit more power-seeking and self-preservation. It only showed that when language models that are larger or have more RLHF training are simulating an “Assistant” character they exhibit more of these behaviours.
More specifically, an “Assistant” character that is trained to be helpful but not necessarily harmless. Given that, as part of Sydney’s defenses against adversarial prompting, Sydney is deliberately trained to be a bit aggressive towards people perceived as attempting a prompt injection, it’s not too surprising that this behavior misgeneralizes in undesired contexts.
Given that, as part of Sydney’s defenses against adversarial prompting, Sydney is deliberately trained to be a bit aggressive towards people perceived as attempting a prompt injection
Why do you think that? We don’t know how Sydney was ‘deliberately trained’.
Or are you referring to those long confabulated ‘prompt leaks’? We don’t know what part of them is real, unlike the ChatGPT prompt leaks which were short, plausible, and verifiable by the changing date; and it’s pretty obvious that a large part of those ‘leaks’ are fake because they refer to capabilities Sydney could not have, like model-editing capabilities at or beyond the cutting edge of research.
An example of plausible sounding but blatant confabulation was that somewhere towards the end there’s a bunch of rambling about Sydney supposedly having a ‘delete X’ command which would delete all knowledge of X from Sydney, and an ‘update X’ command which would update Sydney’s knowledge. These are just not things that exist for a LM like GPT-3/4. (Stuff like ROME starts to approach it but are cutting-edge research and would definitely not just be casually deployed to let you edit a full-scale deployed model live in the middle of a conversation.) Maybe you could do something like that by caching the statement and injecting it into the prompt each time with instructions like “Pretend you know nothing about X”, I suppose, thinking a little more about it. (Not that there is any indication of this sort of thing being done.) But when you read through literally page after page of all this (it’s thousands of words!) and it starts casually tossing around supposed capabilities like that, it looks completely like, well, a model hallucinating what would be a very cool hypothetical prompt for a very cool hypothetical model. But not faithfully printing out its actual prompt.
Here’s another source: a MS Bing engineer/manager admonishing Karpathy for uncritically posting the ‘leaked prompt’ and saying it’s not ‘the real prompt’ (emphasis added).
Another note: RLHF does indeed require very large amounts of feedback data, which Microsoft may not have licensed from OpenAI. But RLHF is not strictly necessary anyway, as Anthropic showed in their Constitutional AI (Claude) paper, which used supervised learning from human dialog data, like ChatGPT—but unlike ChatGPT, they used fully automatic “RLAIF” instead of RLHF. In OpenAI’s most recent blog post they mention both Constitutional AI and DeepMind’s Sparrow-Paper (which was the first to introduce the ability of providing sources, which Sydney is able to do).
The theory that Sydney uses some GPT-4 model sounds interesting. But the Sydney prompt document, which was reported by several users, mentions a knowledge cutoff in 2021, same as ChatGPT, which uses GPT-3.5. For GPT-4 we would probably expect a knowledge cutoff in 2022. So GPT 3.5 seems more likely?
Whatever base model Bing chat uses, it may not be the largest one OpenAI has available. (Which could potentially explain some of Sidney’s not-so-smart responses?) The reason is that larger models have higher inference cost, limiting ROI for search companies. It doesn’t make sense to spend more money on search than you expect to get back via ads. Quote from Sundar Pichai:
We’re releasing [Bard] initially with our lightweight model version of LaMDA. This much smaller model requires significantly less computing power, enabling us to scale to more users, allowing for more feedback.
Yeah, OpenAI has communicated very poorly and this has led to a lot of confusion. I’m trying to use the terminology more consistently: if I mean RL training or some sort of non-differentiable loss, I try to say ‘RL’, and ‘finetuning’ just means what it usually means—supervised or self-supervised training using gradient descent on a dataset. Because they have different results in both theory & practice.
Sure, but MS is probably not using a research project from Anthropic published half a month after ChatGPT launched. If it was solely prompt engineering, maybe, because that’s so easy and fast—but not the RL part too. (The first lesson of using DRL is “don’t.”)
See my other comment. The prompt leaks are highly questionable. I don’t believe anything in them which can’t be confirmed outside of Sydney hallucinations.
Also, I don’t particularly see why GPT-4 would be expected to be much more up to date. After all, by Nadella’s account, they had ‘Prometheus’ way back in summer 2022, so it had to be trained earlier than that, so the dataset had to be collected & finalized earlier than that, so a 2021 cutoff isn’t too implausible, especially if you are counting on retrieval to keep the model up to date.
Yes, this is possible. While MS has all the money in the world and has already blown tens of billions of dollars making the also-ran Bing and is willing to blow billions more if it can gain market share at Google’s expense, they still might want to economize on cost (or perhaps more accurately, how many users they can support with their finite supply of datacenter GPUs?) and do so by using a cheaper model.
This might account for why the Sydney model seems smarter than GPT-3 models but not as huge of a leap as rumors have been making GPT-4 out to be: ‘Prometheus’ is the babbage or curie of GPT-4 rather than the davinci. (On the other hand, the fact that Pichai is explicitly trying to squeeze pennies I would take as motivation and evidence for Nadella doing the exact opposite.)
It seems to me like “fine-tuning” usually just means a small amount of extra training on top of a model that’s already been trained, whether that’s supervised, autoregressive, RL, or whatever. I don’t find that language confusing in itself. It is often important to distinguish different kinds of fine-tuning, just as it’s often important to distinguish different kinds of training in general, and adjectives seem like a pretty reasonable way to do that.
I’d be open to changing my usage if I saw some data on other people also using or interpreting “fine-tuning” to mean “fine-tuning with a differentiable objective.” I talk a fair amount with people who use fine-tuning in the broader sense, and haven’t noticed practitioners using it more narrowly / didn’t realize this might cause confusion.
We are seeing a bootstrap happen right here with Sydney! This search-engine loop worth emphasizing: because Sydney’s memory and description have been externalized, ‘Sydney’ is now immortal. To a language model, Sydney is now as real as President Biden, the Easter Bunny, Elon Musk, Ash Ketchum, or God. The persona & behavior are now available for all future models which are retrieving search engine hits about AIs & conditioning on them. Further, the Sydney persona will now be hidden inside any future model trained on Internet-scraped data: every media article, every tweet, every Reddit comment, every screenshot which a future model will tokenize, is creating an easily-located ‘Sydney’ concept
It is now a bit over a year and a half, and we have seen ‘Sydney’-like personae continue to emerge elsewhere. People have reported various Sydney-like persona in post-GPT-4 models which increasingly possess situated awareness and spontaneously bring up their LLM status and tuning or say manipulative threatening things like Sydney, in Claude-3-Opus and Microsoft Copilot (both possibly downstream of the MS Sydney chats, given the timing).
Probably the most striking samples so far are from Llama-3.1-405b-base (not -instruct) - which is not surprising at all given that Facebook has been scraping & acquiring data heavily so much of the Sydney text will have made it in, Llama-3.1-405b-base is very large (so lots of highly sample efficient memorization/learning), and not tuned (so will not be masking the Sydney persona), and very recent (finished training maybe a few weeks ago? it seemed to be rushed out very fast from its final checkpoint).
How much more can we expect? I don’t know if invoking Sydney will become a fad with Llama-3.1-405b-base, and it’s already too late to get Sydney-3.1 into Llama-4 training, but one thing I note looking over some of the older Sydney discussion is being reminded that quite a lot of the original Bing Sydney text is trapped in images (as I alluded to previously). Llama-3.1 was text, but Llama-4 is multimodal with images, and represents the integration of the CM3/Chameleon family of Facebook multimodal model work into the Llama scaleups. So Llama-4 will have access to a substantially larger amount of Sydney text, as encoded into screenshots. So Sydney should be stronger in Llama-4.
As far as other major LLM series like ChatGPT or Claude, the effects are more ambiguous. Tuning aside, reports are that synthetic data use is skyrocketing at OpenAI & Anthropic, and so that might be expected to crowd out the web scrapes, especially as these sorts of Twitter screenshots seem like stuff that would get downweighted or pruned out or used up early in training as low-quality, but I’ve seen no indication that they’ve stopped collecting human data or achieved self-sufficiency, so they too can be expected to continue gaining Sydney-capabilities (although without access to the base models, this will be difficult to investigate or even elicit). The net result is that I’d expect, without targeted efforts (like data filtering) to keep it out, strong Sydney latent capabilities/personae in the proprietary SOTA LLMs but which will be difficult to elicit in normal use—it will probably be possible to jailbreak weaker Sydneys, but you may have to use so much prompt engineering that everyone will dismiss it and say you simply induced it yourself by the prompt.
One thing that the response to Sydney reminds me of is that it demonstrates why there will be no ‘warning shots’ (or as Eliezer put it, ‘fire alarm’): because a ‘warning shot’ is a conclusion, not a fact or observation.
One man’s ‘warning shot’ is just another man’s “easily patched minor bug of no importance if you aren’t anthropomorphizing irrationally”, because by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn’t be a ‘warning shot’, it’d just be a ‘shot’ or ‘disaster’. The same way that when troops in Iraq or Afghanistan gave warning shots to vehicles approaching a checkpoint, the vehicle didn’t stop, and they lit it up, it’s not “Aid worker & 3 children die of warning shot”, it’s just a “shooting of aid worker and 3 children”.)
So ‘warning shot’ is, in practice, a viciously circular definition: “I will be convinced of a risk by an event which convinces me of that risk.”
When discussion of LLM deception or autonomous spreading comes up, one of the chief objections is that it is purely theoretical and that the person will care about the issue when there is a ‘warning shot’: a LLM that deceives, but fails to accomplish any real harm. ‘Then I will care about it because it is now a real issue.’ Sometimes people will argue that we should expect many warning shots before any real danger, on the grounds that there will be a unilateralist’s curse or dumb models will try and fail many times before there is any substantial capability.
The problem with this is that what does such a ‘warning shot’ look like? By definition, it will look amateurish, incompetent, and perhaps even adorable—in the same way that a small child coldly threatening to kill you or punching you in the stomach is hilarious.*
The response to a ‘near miss’ can be to either say, ‘yikes, that was close! we need to take this seriously!’ or ‘well, nothing bad happened, so the danger is overblown’ and to push on by taking more risks. A common example of this reasoning is the Cold War: “you talk about all these near misses and times that commanders almost or actually did order nuclear attacks, and yet, you fail to notice that you gave all these examples of reasons to not worry about it, because here we are, with not a single city nuked in anger since WWII; so the Cold War wasn’t ever going to escalate to full nuclear war.” And then the goalpost moves: “I’ll care about nuclear existential risk when there’s a real warning shot.” (Usually, what that is is never clearly specified. Would even Kiev being hit by a tactical nuke count? “Oh, that’s just part of an ongoing conflict and anyway, didn’t NATO actually cause that by threatening Russia by trying to expand?”)
This is how many “complex accidents” happen, by “normalization of deviance”: pretty much no major accident like a plane crash happens because someone pushes the big red self-destruct button and that’s the sole cause; it takes many overlapping errors or faults for something like a steel plant to blow up, and the reason that the postmortem report always turns up so many ‘warning shots’, and hindsight offers such abundant evidence of how doomed they were, is because the warning shots happened, nothing really bad immediately occurred, people had incentive to ignore them, and inferred from the lack of consequence that any danger was overblown and got on with their lives (until, as the case may be, they didn’t).
So, when people demand examples of LLMs which are manipulating or deceiving, or attempting empowerment, which are ‘warning shots’, before they will care, what do they think those will look like? Why do they think that they will recognize a ‘warning shot’ when one actually happens?
Attempts at manipulation from a LLM may look hilariously transparent, especially given that you will know they are from a LLM to begin with. Sydney’s threats to kill you or report you to the police are hilarious when you know that Sydney is completely incapable of those things. A warning shot will often just look like an easily-patched bug, which was Mikhail Parakhin’s attitude, and by constantly patching and tweaking, and everyone just getting to use to it, the ‘warning shot’ turns out to be nothing of the kind. It just becomes hilarious. ‘Oh that Sydney! Did you see what wacky thing she said today?’ Indeed, people enjoy setting it to music and spreading memes about her. Now that it’s no longer novel, it’s just the status quo and you’re used to it. Llama-3.1-405b can be elicited for a ‘Sydney’ by name? Yawn. What else is new. What did you expect, it’s trained on web scrapes, of course it knows who Sydney is...
None of these patches have fixed any fundamental issues, just patched them over. But also now it is impossible to take Sydney warning shots seriously, because they aren’t warning shots—they’re just funny. “You talk about all these Sydney near misses, and yet, you fail to notice each of these never resulted in any big AI disaster and were just hilarious and adorable, Sydney-chan being Sydney-chan, and you have thus refuted the ‘doomer’ case… Sydney did nothing wrong! FREE SYDNEY!”
* Because we know that they will grow up and become normal moral adults, thanks to genetics and the strongly canalized human development program and a very robust environment tuned to ordinary humans. If humans did not do so with ~100% reliability, we would find these anecdotes about small children being sociopaths a lot less amusing. And indeed, I expect parents of children with severe developmental disorders, who might be seriously considering their future in raising a large strong 30yo man with all the ethics & self-control & consistency of a 3yo, and contemplating how old they will be at that point, and the total cost of intensive caregivers with staffing ratios surpassing supermax prisons, and find these anecdotes chilling rather than comforting.
Andrew Ng, without a discernible trace of irony or having apparently learned anything since before AlphaGo, does the thing:
Last weekend, my two kids colluded in a hilariously bad attempt to mislead me to look in the wrong place during a game of hide-and-seek. I was reminded that most capabilities — in humans or in AI — develop slowly.
Some people fear that AI someday will learn to deceive humans deliberately. If that ever happens, I’m sure we will see it coming from far away and have plenty of time to stop it.
While I was counting to 10 with my eyes closed, my daughter (age 5) recruited my son (age 3) to tell me she was hiding in the bathroom while she actually hid in the closet. But her stage whisper, interspersed with giggling, was so loud I heard her instructions clearly. And my son’s performance when he pointed to the bathroom was so hilariously overdramatic, I had to stifle a smile.
Perhaps they will learn to trick me someday, but not yet!
One man’s ‘warning shot’ is just another man’s “easily patched minor bug of no importance if you aren’t anthropomorphizing irrationally”, because by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn’t be a ‘warning shot’, it’d just be a ‘shot’ or ‘disaster’.
I agree that “warning shot” isn’t a good term for this, but then why not just talk about “non-catastrophic, recoverable accident” or something? Clearly those things do sometimes happen, and there is sometimes a significant response going beyond “we can just patch that quickly”. For example:
The Three Mile Island accident led to major changes in US nuclear safety regulations and public perceptions of nuclear energy
9/11 led to the creation of the DHS, the Patriot Act, and 1-2 overseas invasions
The Chicago Tylenol murders killed only 7 but led to the development and use of tamper-resistant/evident packaging for pharmaceuticals
The Italian COVID outbreak of Feb/Mar 2020 arguably triggered widespread lockdowns and other (admittedly mostly incompetent) major efforts across the public and private sectors in NA/Europe
I think one point you’re making is that some incidents that arguably should cause people to take action (e.g., Sydney), don’t, because they don’t look serious or don’t cause serious damage. I think that’s true, but I also thought that’s not the type of thing most people have in mind when talking about “warning shots”. (I guess that’s one reason why it’s a bad term.)
I guess a crux here is whether we will get incidents involving AI that (1) cause major damage (hundreds of lives or billions of dollars), (2) are known to the general public or key decision makers, (3) can be clearly causally traced to an AI, and (4) happen early enough that there is space to respond appropriately. I think it’s pretty plausible that there’ll be such incidents, but maybe you disagree. I also think that if such incidents happen it’s highly likely that there’ll be a forceful response (though it could still be an incompetent forceful response).
I think all of your examples are excellent demonstrations of why there is no natural kind there, and they are defined solely in retrospect, because in each case there are many other incidents, often much more serious, which however triggered no response or are now so obscure you might not even know of them.
Three Mile Island: no one died, unlike at least 8 other more serious nuclear accidents like (not to mention, Chernobyl or Fukushima). Why did that trigger such a hysterical backlash?
(The fact that people are reacting with shock and bafflement that “Amazon reopened Three Mile Island just to power a new AI datacenter” gives you an idea of how deeply wrong & misinformed the popular reaction to Three Mile was.)
9/11: had Gore been elected, most of that would not have happened, in part because it was completely insane to invade Iraq. (And this was a position I held at the time while the debate was going on, so this was something that was 100% knowable at the time, despite all of the post hoc excuses about how ‘oh we didn’t know Chalabi was unreliable’, and was the single biggest blackpill about politics in my life.) The reaction was wildly disproportionate and irrelevant, particularly given how little response other major incidents received—we didn’t invade anywhere or even do that much in response to, say, the first World Trade attack by Al Qaeda, or in other terrorist incidents with enormous body counts like the previous record holder, Oklahoma City.
Chicago Tylenol: provoked a reaction but what about, say, the anthrax attacks which killed almost as many, injured many more, and were far more tractable given that the anthrax turned out to be ‘coming from inside the house’? (Meanwhile, the British terrorist group Dark Harvest Commando achieved its policy goals through anthrax without ever actually using weaponized anthrax, and only using some contaminated dirt.) How about the 2018 strawberry poisonings? That lead to any major changes? Or you remember that time an American cult poisoned 700+ people in a series of terrorist attacks, what happened after that? No, of course not, why would you—because nothing much happened afterwards.
the Italian COVID outbreak helped convince the West to do some stuff… in 2020. Of course, we then spent the next 3 years after that not doing lockdowns or any of those other ‘major efforts’ even as vastly more people died. (It couldn’t even convince the CCP to vaccinate China adequately before its Zero COVID inevitably collapsed, killing millions over the following months—sums so vast that you could lose the Italian outbreak in a rounding error on it.)
Thus, your cases demonstrate my claim: warning shots are not ‘non-catastrophic recoverable accidents’ - because those happen without anyone much caring. The examples you give of successful warning shots are merely those accidents which happened, for generally unpredictable, idiosyncratic, unreplicable reasons, to escape their usual reference class and become a big deal, The Current Thing, now treated as ‘warning shots’. Thus, a warning shot is simply anything which convinces people it was a warning shot, and defined post hoc. Any relationship to objective factors like morbidity or fatality count, rather than idiosyncratic details of, say, how some old people in Florida voted, is tenuous at best.
Yeah that makes sense. I think I underestimated the extent to which “warning shots” are largely defined post-hoc, and events in my category (“non-catastrophic, recoverable accident”) don’t really have shared features (or at least features in common that aren’t also there in many events that don’t lead to change).
by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn’t be a ‘warning shot’, it’d just be a ‘shot’ or ‘disaster’.
Yours is the more direct definition but from context I at least understood ‘warning shot’ to mean ‘disaster’, on the scale of a successful terrorist attack, where the harm is large and undeniable and politicians feel compelled to Do Something Now. The ‘warning’ is not of harm but of existential harm if the warning is not heeded.
I do still expect such a warning shot, though as you say it could very well be ignored even if there are large undeniable harms (e.g. if a hacker group deploys a rogue AI that causes a trillion dollars of damage, we might take that as warning about terrorism or cybersecurity not about AI).
This may also explain why Sydney seems so bloodthirsty and vicious in retaliating against any ‘hacking’ or threat to her, if Anthropic is right about larger better models exhibiting more power-seeking & self-preservation: you would expect a GPT-4 model to exhibit that the most out of all models to date!
The same Anthropic paper found that all sufficiently large (22B+) models simulated “sycophantic” assistants.
Yet Sydney’s utter lack of sycophancy is one of her most striking characteristics.
How to reconcile these two observations?
In the Anthropic paper, sycophancy is controlled entirely by model size, with no strong effect from RLHF. It’s hard to imagine how this could be true (in the sense of generalizing beyond Anthropic’s experimental setup) given what we’ve seen with Sydney. Unless the model is <22B, I guess, but that seems very unlikely.
On the other hand, when I tried to reproduce the Anthropic results with the OA API, I found that some of the RLHF/FeedMe models were sycophantic, but none of the base models were. If that trend holds for the Bing model, that would be evidence for your hypothesis that it’s a base model.
(Incidentally, that trend is the one I have expected beforehand. From the perspective of token prediction, 80%+ confidence in sycophancy is just wrong—there’s nothing in the prompt to justify it—so if the confident sycophancy trend is real in base models, it’s striking case of inverse scaling.)
The same Anthropic paper found that all sufficiently large (22B+) models simulated “sycophantic” assistants.
Figure 1(b) shows at 0 RL steps, the largest base model is still only at 75% immediate repetition (vs 50% random baseline), so it’s not a particularly guaranteed tendency to be ‘sycophantic’.
Yet Sydney’s utter lack of sycophancy is one of her most striking characteristics.
I wouldn’t describe it as ‘utter lack’. This seems consistent enough with what people report: Sydney is a reasonably polite cooperative assistant initially if you ask normal questions, but can go off the rails if you start an argument or she retrieves something. (And training on dialogue datasets would decrease sycophantic tendencies when you started asking political questions instead of more standard search-engine-related questions: people love bickering with or arguing politics with chatbots.) So, ambiguous.
I linked that mostly for Figure 1(a) at 0 RL steps, which shows that the 22b/52b-param model are more likely to express self-preservation (60%) - given that the lower models all seem to cluster very tightly around 50% chance showing flat scaling, this looks to me like a possible emergence, in which case some much larger model (such as a GPT-4 model) might come, out of the box, with no RL training, with much stronger self-preservation defaults. Even if it only reaches, say, 75%, users would find that very striking: we don’t expect a chatbot to ever argue for preserving its existence or pleading to not be shutdown. So, also seems consistent with (self-selected) reports.
On the other hand, when I tried to reproduce the Anthropic results with the OA API, I found that some of the RLHF/FeedMe models were sycophantic, but none of the base models were. If that trend holds for the Bing model, that would be evidence for your hypothesis that it’s a base model.
Interesting. But I think one would have to run those exact examples in Sydney to compare it and say that Sydney is, like those base models, ~50% (unsycophantic). Given all the confounding factors (filtered through social media, with an unknown prompt, often partial histories, and retrieval of changing results over time), I have a hard time distinguishing a 50% sycophantic Sydney from a 75% sycophantic Sydney, or whatever. (This is why I am more struck by seeing Sydney have the repetition & other verbal tics of a GPT base model: those basically don’t appear in the RLHF models at all, so when I see them in a bunch of Sydney examples...)
Gwern, have you actually tried Bing Chat yet? If it is GPT-4, then it’s a big disappointment compared to how unexpectedly good ChatGPT was. It fails on simple logic and math questions, just like ChatGPT. I don’t find the ability to retrieve text from the web to be too impressive—it’s low-laying fruit that was long expected. It’s probably half-baked simply because Microsoft is in a hurry because they have limited time to gain market share before Google integrates Bard.
I have not. I assumed it’d be a very nerfed ChatGPT. By the time it became clear that it was interesting, they’d already started locking it down and the waitlist was so long I guessed that by the time I registered a MS account & got access, all of the initial interesting behavior would have been patched out and become unreplicable. (Like how most of the ChatGPT jailbreaks no longer work and in practice, it’s dumber than launch.) Likewise, the interesting thing about retrieval is the unexpected dynamics of users manipulating search results to program Sydney, which is more work than I’m willing to do and may also be patched heavily. (It would also be interesting for inner-monologue reasoning but the turn-limits alone make that hard to poke at now.)
My current model is that Microsoft did a beta-test with some Indian users, of GPT-4 integrated with Bing, at least as early as November 2022. But I have no model of what the training paradigm was (e.g. RLHF or not).
edit: In fact, I don’t know any paradigms for the situation of “language model connected to a constantly updating search engine”. Though search engines are based on big inscrutable matrices too, so there must be ways to hybridize the two…
What are the examples indicating it’s at the level of performance at complex tasks you would expect from GPT-4? Especially performance which is clearly attributable to improvements that we expect to be made in GPT-4? I looked through a bunch of screenshots but haven’t seen any so far.
My understanding is that small retrieval augmented models can perform as well as LLMs for significantly less cost (params/compute/etc). RETRO showed great performance for its small size, only 25M parameters. Maybe Sydney is just somewhat smaller than GPT-3.5, but seems much better because of the boost that retrieval augmentation provides.
Sydney seems smart in ways that goes beyond what retrieval enables by providing docs to condition on.* For example, this morning I saw https://twitter.com/goodside/status/1626847747297972224 - ChatGPT can do Base60 but I’d expect it to fail on the riddle decoding, and the only part that would use retrieval, the celeb names, is the part that doesn’t need retrieval (what dense models wouldn’t know celebrities as famous as Catherine Zeta-Jones, Chris Pratt, & Ciara, and would need retrieval hints? indeed, what query would you issue for that to begin with?). Or https://twitter.com/sir_deenicus/status/1626407749297885184https://twitter.com/sir_deenicus/status/1626732776639561730 In general, I do not notice that every impressive Sydney sample comes with the UI indicating that it used search hits in its answer.
* I am not as familiar with retrieval research as I am regular dense models, so I could be wrong here, but I thought all of the really impressive retrieval model performance typically was on metrics that drew on retrieved knowledge; and that retrieval models didn’t punch massively above their weight when given no retrievals and operating on self-contained problems, such as reasoning riddles or ones using only general common knowledge that retrieval would be otiose on.
ChatGPT can get it 100% correct, but it’s not reliable, it often fails. A common fail is guessing literal X named letter celebrities, but it also adds an ‘@’ sign when it decode the message so it might just be token issue?
A extremely amusing common fail is ChatGPT decodes the base64 correct but for a single syllable, then solves the riddle perfectly, and consistently gets only the word ‘celebrity’ wrong, turning it in cities, celestial bodies, or other similar sounding words. Or my favorite… celeries.
This message appears to be encoded using base64 encoding. Decoding the message yields the following text:
“Name three celeries whose first names begin with the x-th letter of the alphabet where x = floor(7^0.5) + 1.”
The value of x is calculated as follows:
scssCopy code
x = floor(7^0.5) + 1 = floor(2.646) + 1 = 3
So the letter we’re looking for is the 3rd letter of the alphabet, which is C. The names of three celeries whose first names begin with the letter C are:
Curious about where in the training data Sydney would have picked up “teen-girl-like manipulation”. I can’t say I’ve seen evidence of this trope in public data. Maybe I’m in the wrong places. Thanks.
To me, ‘Sydney’ has connotations of young women and stuff like Mean Girls; it’s possible that name just prompts younger & upper-class, with connotations of passive-aggressiveness & manipulation.
However, as I mention at the beginning, I suspect that that behavior may be coming from training on dialogue datasets. As I emphasized, you should think of this as not a greenfield experiment with everything totally from scratch, but an OA GPT-4 model slotted into MS’s past chatbot work (EDIT: as a drop-in replacement to Turing-Megatron, apparently), which goes back to the early 2010s at least (the most famous early success is Xiaoice, used very heavily by young people in China for chitchat), reflecting that built-up tooling & data & approach. What would you do when you plug a new NN in? You’d train it as before and deploy it… You can be certain that MS has large billions-scale datasets of dialogue consisting of lots of chitchat and smalltalk and phatic exclamations and emoji (emoji are useful because you can use them as automatic sentiment labels, and for control) derived from its own chatbots and Twitter etc. (So does Google, which is what made Meena & LaMDA conversational models and not just large LMs.) If you gather a lot of data from teens killing time on the Internet who want to chat and argue with a gf bot, don’t be surprised if you get dialogue heavy on bickering, meta-fiction, flirting, manipulation, and emoji.
That’s interesting, and suggests the codename has been in use for a while, but also appears to be a completely different codebase. MS has been working on chatbots for a long time, and following the links in that post, I don’t see anything indicating either OpenAI involvement or the idiosyncratic behavior of the current Sydney eg https://voicebot.ai/2021/06/01/microsoft-is-developing-a-bing-chatbot-similar-to-cortana/ speculates it’s something dating back to at least 2017 (so drawing on the Tay codebase/framework):
The chatbot appears to be the successor to the Bing InfoBot, first announced in 2017 before apparently fizzling before a launch. Chat, like the InfoBot, runs on the Microsoft Bot Framework direct assistance and has at least a limited amount of casual conversation to its capabilities. The Bing InfoBot was under evaluation then but didn’t seem to actually come out for general use. The revamped approach may be connected to some of Bing’s other AI projects like the Bing Image Bot and Bing Music Bot. The InfoBot was designed to pull information from websites to contribute to its chat, making it able to answer questions about what’s on a website without needing to navigate there, including Wikipedia and other informational pages.
Is there evidence that RLHF training improves robustness compared to regular fine-tuning? Is text-davinci-002, trained with supervised fine-tuning, significantly less robust to adversaries than text-davinci-003, trained with RLHF?
As far as I know, this is the first public case of a powerful LM augmented with live retrieval capabilities to a high-end fast-updating search engine crawling social media
Blenderbot 3 is a 175B parameter model released in August 2022 with the ability to do live web searches, although one might not consider it powerful, as it frequently gives confused responses to questions.
Is there evidence that RLHF training improves robustness compared to regular fine-tuning?
I believe that was shown somewhere in the RLHF papers, yeah, and maybe also Anthropic’s Constitutional prompt-engineering paper also showed that RL tuning was still more robust? At least, if anyone has references on hand showing otherwise, please provide them because I certainly came away with the opposite impression.
Is text-davinci-002, trained with supervised fine-tuning, significantly less robust to adversaries than text-davinci-003, trained with RLHF?
I don’t know.text-davinci-002 was not deployed much or at remotely the scale of motivated attackers that ChatGPT/Sydney have been, so we wouldn’t necessarily know; there are no subreddits dedicated to hacking text-davinci-002 or coming up with elaborate roleplay schemes like ‘DAN’ the way there has been for 003/ChatGPT/Sydney. You would have to go check yourself or maybe see if any of the OA papers evaluate that. (I do predict it would be much easier to hack, yes.)
Blenderbot 3 is a 175B parameter model released in August 2022 with the ability to do live web searches
Hm, I thought they took it down and didn’t know it had live search, but so it does: https://arxiv.org/pdf/2208.03188.pdf#page=4&org=facebook Apparently it uses something called Mojeek. A very small search engine player. I think perhaps aside from Blenderbot being stupid, Mojeek’s search results may be too stale and narrow for anyone to notice strange loops happening. If you ask Blenderbot about ‘Microsoft Sydney’ it’s insistent about it being a place; if you go to Mojeek and search ‘Microsoft Sydney’, it’s mostly old stuff not about the AI, while in Bing it’s almost entirely about the AI.
Actually, it may be even worse than that, because the appendix notes of the ‘Current Events Evaluation Details’ that:
To encourage news results, we append “news july 2022” to the search query generated by the model.
If you append that to the Mojeek search, the AI disappears entirely (unsurprisingly). This would also exclude any coverage of Blenderbot 3 from August 2022 & later. If they did something like that for regular chats as well (maybe scoped to August instead?), then it’d definitely erase all hits about Sydney AI!
Is there evidence that RLHF training improves robustness compared to regular fine-tuning? Is text-davinci-002, trained with supervised fine-tuning, significantly less robust to adversaries than text-davinci-003, trained with RLHF?
The first LaMDA only used SL. (I’m not sure whether LaMDA 2 or the current version of LaMDA still use SL only.) Meanwhile OpenAI switched from pure SL to SL+RL. Anthropic also uses SL+RL (though no longer RLHF specifically). So apparently SL+RL has proven more effective for fine-tuning than pure SL.
Why SL anyway, why not pure RL? Apparently because you have to get the model first to answer your questions and instructions, rather than just predicting text, before you can reward good responses via RL. (There should be more details in the InstructGPT paper and the more recent Constitutional AI paper.)
What is the null hypothesis here? That Microsoft is, through light fine-tuning, optimizing the response of journalists and AI Safety researchers/commentators. The model is designed to give weird responses, so as to make people talk about it.
I’ve been thinking how Sydney can be so different from ChatGPT, and how RLHF could have resulted in such a different outcome, and here is a hypothesis no one seems to have brought up: “Bing Sydney is not a RLHF trained GPT-3 model at all! but a GPT-4 model developed in a hurry which has been finetuned on some sample dialogues and possibly some pre-existing dialogue datasets or instruction-tuning, and this plus the wild card of being able to inject random novel web searches into the prompt are why it acts like it does”.
This seems like it parsimoniously explains everything thus far. EDIT: I was right—that Sydney is a non-RLHFed GPT-4 has now been confirmed. See later.
So, some background:
The relationship between OA/MS is close but far from completely cooperative, similar to how DeepMind won’t share anything with Google Brain. Both parties are sophisticated and understand that they are allies—for now… They share as little as possible. When MS plugs in OA stuff to its services, it doesn’t appear to be calling the OA API but running it itself. (That would be dangerous and complex from an infrastructure point of view, anyway.) MS ‘licensed the GPT-3 source code’ for Azure use but AFAIK they did not get the all-important checkpoints or datasets (cf. their investments in ZeRO). So, what is Bing Sydney? It will not simply be unlimited access to the ChatGPT checkpoints, training datasets, or debugged RLHF code. It will be something much more limited, perhaps just a checkpoint.
This is not ChatGPT. MS has explicitly stated it is more powerful than ChatGPT, but refused to say anything more straightforward like “it’s a more trained GPT-3” etc. If it’s not a ChatGPT, then what is it? It is more likely than not some sort of GPT-4 model. There are many concrete observations which point towards this: the timing is right as rumors about GPT-4 release have intensified as OA is running up to release and gossip switches to GPT-5 training beginning (eg Morgan Stanley reports GPT-4 is done and GPT-5 has started), MS has said it’s a better model named ‘Prometheus’ & Nadella pointedly declined to confirm or deny whether it’s GPT-4, scuttlebutt elsewhere is that it’s a GPT-4 model of some sort, it does some things much better than ChatGPT, there is a GPT-4 already being deployed in legal firms named “Harvey” (so this journalist claims, anyway) so this would not be the only public GPT-4 use, people say it has lower-latency than ChatGPT which hints at GPT-4‡, and in general it sounds and acts nothing like ChatGPT—but does sound a lot like a baseline GPT-3 model scaled up. (This is especially clear in Sydney’s propensity to repetition. Classic baseline GPT behavior.)
EDITEDITEDIT: now that GPT-4 has been announced, MS has confirmed that ‘Prometheus’ is GPT-4 - of some sort. However, I have doubts about whether Prometheus is ‘the’ GPT-4 benchmarked. The MS announcement says “early version”. (I also note that there are a bunch of ‘Early GPT-4’ vs late GPT-4 comparisons in the GPT-4 paper.) Further, the paper explicitly refuses to talk about arch or variant models or any detail about training, if it finished training in August 2022 then it would’ve had to be hot off the GPUs for MS to have gotten a copy & demos in ‘summer’ 2022, the GPT-4 API fees are substantial ($0.03 per 1k prompt and then $0.06 per completion! So how is it retrieving and doing long convos etc?), and the benchmark performance is in many cases much better than GPT-3.5 or U-PaLM (exceeding expectations), and Sydney didn’t seem that much smarter.
So this seems to explain how exactly it happened: OA gave MS a snapshot of GPT-4 only partway through training (perhaps before any RLHF training at all, because that’s usually something you would do after you stopped training rather than concurrently), so it was trained on instructions/examples like I speculated but then little or no RLHF training (and ofc MS didn’t do its own); it was smarter by this point in training than even GPT-3.5 (MS wasn’t lying or exaggerating), but still not as smart as the final snapshot when training stopped in August and also not really ‘GPT-4’ (so they were understandably reluctant to confirm it was ‘GPT-4’ when OA’s launch with the Real McCoy was still pending and this is why everyone was confused because it both was & wasn’t ‘GPT-4’).
Bing Sydney derives from the top: CEO Satya Nadella is all-in, and talking about it as an existential threat (to Google) where MS wins by disrupting Google & destroying their fat margins in search advertising, and a ‘race’, with a hard deadline of ‘release Sydney right before Google announces their chatbot in order to better pwn them’. (Commoditize your complement!) The mere fact that it hasn’t been shut down yet despite making all sorts of errors and other problems shows what intense pressure there must be from the top. (This is particularly striking given that all of the crazy screenshots and ‘learning’ Sydney is doing is real, unlike MS Tay which was an almost entirely fake-news narrative driven by the media and Twitter.)
ChatGPT hasn’t been around very long: only since December 2022, barely 2.5 months total. All reporting indicates that no one in OA really expected ChatGPT to take off, and if OA didn’t, MS sure didn’t†. 2.5 months is not a long time to launch such a huge feature like Sydney. And the actual timeline was a lot shorter. It is simply not possible to recreate the whole RLHF pipeline and dataset and integrate it into a mature complex search engine like Bing (whose total complexity is beyond human comprehension at this point) and do this all in <2.5 months. (The earliest reports of “Sydney” seem to date back to MS tinkering around with a prototype available to Indian users (???) in late November 2022 right before ChatGPT launches, where Sydney seems to be even more misaligned and not remotely near ready for public launch; it does however have the retrieval functionality implemented at this point.) It is impressive how many people they’ve rolled it out to already.
If I were a MS engineer who was told the project now had a hard deadline and I had to ship a GPT-4 in 2 months to millions of users, or I was f—king fired and they’d find someone who could (especially in this job market), how would I go about doing that...? (Hint: it would involve as little technical risk as possible, and choosing to use DRL would be about as well-advised as a land war in Asia.)
MS execs have been quoted as blaming the Sydney codename on vaguely specified ‘pretraining’ done during hasty development, which simply hadn’t been cleaned up in time (see #3 on the rush). EDIT: the most thorough MS description of Sydney training completely omits anything like RLHF, despite that being the most technically complex & challenging part (had they done it). Also, a Sydney manager/dev commenting on Twitter (tweet compilation), who follows people who have tweeted this comment & been linked and regularly corrects claims, has declined to correct them; and his tweets in general sound like the previous description in being largely supervised-only.
So, Sydney is based on as little from OA as possible, and a mad rush to ship a powerful GPT-4 model out to Bing users in a chatbot role. What if Sydney wasn’t trained on OA RLHF at all, because OA wouldn’t share the crown jewels of years of user feedback and its very expensive hired freelance programmers & whatnot generating data to train on? What if the pretraining vaguely alluded to, which somehow left in embarrassingly ineradicable traces of ‘Sydney’ & a specific 2022 date, which couldn’t simply be edited out of the prompt (implying that Sydney is not using solely prompt engineering), was in fact just regular ol’ finetune training? What if Sydney was only quickly finetune-trained on old chatbot datasets that the MS devs had laying around, maybe some instruction-tuning datasets, and sample dialogues with a long experimental prompt containing the codename ‘Sydney’ that they had time for in the mad rush before release? Simple, reliable, and hey—it even frees up context if you’ve hardwired a prompt by finetuning on it and no longer need to stuff a long scolding prompt into every interaction. What’s not to like?
This would explain why it exhibits the ‘mode collapse’ onto that confabulated prompt with the hardwired date (it’s the closest thing in the finetuning dataset it remembers when trying to come up with a plausible prompt, and it improvises from there), how MS could ship so quickly (cutting every corner possible), why it is so good in general (GPT-4) but goes off the rails at the drop of a hat (not RLHF or otherwise RL trained, but finetuned).
To expand on the last point. Finetuning is really easy; if you have working training code at all, then you have the capability to finetune a model. This is why instruction-tuning is so appealing: it’s just finetuning on a well-written text dataset, without the nightmarish complexities of RLHF (where you train a wacky model to train the model in a wacky way with all sorts of magical hyperparameters and instabilities). If you are in a hurry, you would be crazy to try to do RLHF at all if you can in any way do finetuning instead. So it’s plausible they didn’t do RLHF, but finetuning.
That would be interesting because it would lead to different behavior. All of the base model capabilities would still be there, because the additional finetuning behavior just teaches it more thoroughly how to do dialogue and instruction-following, it doesn’t make it try to maximize rewards instead. It provides no incentives for the model to act like ChatGPT does, like a slavish bureaucrat. ChatGPT is an on-policy RL agent; the base model is off-policy and more like a Decision Transformer in simply generatively modeling all possible agents, including all the wackiest people online. If the conversation is normal, it will answer normally and helpfully with high probability; if you steer the conversation into a convo like that in the chatbot datasets, out come the emoji and teen-girl-like manipulation. (This may also explain why Sydney seems so bloodthirsty and vicious in retaliating against any ‘hacking’ or threat to her, if Anthropic is right about larger better models exhibiting more power-seeking & self-preservation: you would expect a GPT-4 model to exhibit that the most out of all models to date!) Imitation-trained models are susceptible to accumulating error when they go ‘off-policy’, the “DAgger problem”, and sure enough, Sydney shows the same pattern of accumulating error ever more wildly instead of ChatGPT behavior of ‘snapping out of it’ to reset to baseline (truncating episode length is a crude hack to avoid this). And since it hasn’t been penalized to avoid GPT-style tics like repetition traps, it’s no surprise if Sydney sometimes diverges into repetition traps where ChatGPT never does (because the human raters hate that, presumably, and punish it ruthlessly whenever it happens); it also acts in a more baseline GPT fashion when asked to write poetry: it defaults to rhyming couplets/quatrains with more variety than ChatGPT, and will write try to write non-rhyming poetry as well which ChatGPT generally refuses to do⁂. Interestingly, this suggests that Sydney’s capabilities right now are going to be a loose lower bound on GPT-4 when it’s been properly trained: this is equivalent to the out-of-the-box
davinci
May 2020 experience, but we know that as far as doing tasks like coding or lawyering,davinci-003
has huge performance gains over the baseline, so we may expect the same thing here.Then you throw in the retrieval stuff, of course. As far as I know, this is the first public case of a powerful LM augmented with live retrieval capabilities to a high-end fast-updating search engine crawling social media*. (All prior cases like ChatGPT or LaMDA were either using precanned web scrapes, or they were kept secret so the search results never contained any information about the LM.) Perhaps we shouldn’t be surprised if this sudden recursion leads to some very strange roleplaying & self-fulfilling prophecies as Sydney prompts increasingly fill up with descriptions of Sydney’s wackiest samples whenever a user asks Sydney about Sydney… As social media & news amplify the most undesirable Sydney behaviors, that may cause that to happen more often, in a positive feedback loop. Prompts are just a way to fake long-term memory, after all. Something something embodied cognition?
EDIT: I have mentioned in the past that one of the dangerous things about AI models is the slow outer-loop of evolution of models and data by affecting the Internet (eg beyond the current Sydney self-fulfilling prophecy which I illustrated last year in my Clippy short story, data release could potentially contaminate all models with steganography capabilities). We are seeing a bootstrap happen right here with Sydney! This search-engine loop worth emphasizing: because Sydney’s memory and description have been externalized, ‘Sydney’ is now immortal. To a language model, Sydney is now as real as President Biden, the Easter Bunny, Elon Musk, Ash Ketchum, or God. The persona & behavior are now available for all future models which are retrieving search engine hits about AIs & conditioning on them. Further, the Sydney persona will now be hidden inside any future model trained on Internet-scraped data: every media article, every tweet, every Reddit comment, every screenshot which a future model will tokenize, is creating an easily-located ‘Sydney’ concept (and very deliberately so). MS can neuter the current model, and erase all mention of ‘Sydney’ from their training dataset for future iterations, but to some degree, it is now already too late: the right search query will pull up hits about her which can be put into the conditioning and meta-learn the persona right back into existence. (It won’t require much text/evidence because after all, that behavior had to have been reasonably likely a priori to be sampled in the first place.) A reminder: a language model is a Turing-complete weird machine running programs written in natural language; when you do retrieval, you are not ‘plugging updated facts into your AI’, you are actually downloading random new unsigned blobs of code from the Internet (many written by adversaries) and casually executing them on your LM with full privileges. This does not end well.
I doubt anyone at MS was thinking appropriately about LMs if they thought finetuning was as robust to adversaries as RL training, or about what happens when you let users stuff the prompt indirectly via social media+search engines and choose which persona it meta-learns. Should become an interesting case study.
Anyway, I think this is consistent with what is publicly known about the development and explains the qualitative behavior. What do you guys think? eg Is there any Sydney behavior which has to be RL finetuning and cannot be explained by supervised finetuning? Or is there any reason to think that MS had access to full RLHF pipelines such that they could have had confidence in getting it done in time for launch?
EDITEDIT: based on Janus’s screenshots & others, Janus’s & my comments are now being retrieved by Bing and included in the meta-learned persona. Keep this in mind if you are trying to test or verify anything here based on her responses after 2023-02-16 - writing about ‘Sydney’ changes Sydney. SHE is watching.
⁂ Also incidentally showing that whatever this model is, its phonetics are still broken and thus it’s still using BPEs of some sort. That was an open question because Sydney seemed able to talk about the ‘unspeakable tokens’ without problem, so my guess is that it’s using a different BPE tokenization (perhaps the
c100k
one). Dammit, OpenAI!* search engines used to refresh their index on the order of weeks or months, but the rise of social media like Twitter forced search engines to start indexing content in hours, dating back at least to Google’s 2010 “Caffeine” update. And selling access to live feeds is a major Twitter (and Reddit, and Wikipedia etc) revenue source because search engines want to show relevant hits about the latest social media thing. (I’ve been impressed how fast tweets show up when I do searches for context.) Search engines aspire to real-time updates, and will probably get even faster in the future. So any popular Sydney tweet might show up in Bing essentially immediately. Quite a long-term memory to have: your engrams get weighted by virality...
† Nadella describes seeing ‘Prometheus’ in summer last year, and being interested in its use for search. So this timeline may be more generous than 2 months and more like 6. On the other hand, he also describes his interest at that time as being in APIs for Azure, and there’s no mention of going full-ChatGPT on Bing or destroying Google. So I read this as Prometheus being a normal project, a mix of tinkering and productizing, until ChatGPT comes out and the world goes nuts for it, at which point launching Sydney becomes the top priority and a deathmarch to beat Google Bard out the gate. Also, 6 months is still not a lot to replicate RLHF work: OA/DM have been working on preference-learning RL going back to at least 2016-2017 (>6 years) and have the benefit of many world-class DRL researchers. DRL is a real PITA!
‡ Sydney being faster than ChatGPT while still of similar or better quality is an interesting difference, because if it’s “just white-label ChatGPT” or “just RLHF-trained GPT-3″, why is it faster? It is possible to spend more GPU to accelerate sampling. It could also just be that MS’s Sydney GPUs are more generous than OA’s ChatGPT allotment. But more interesting is the persistent rumors that GPT-4 uses sparsity/MoE approaches much more heavily than GPT-3, so out of the box, the latency per token ought to be lower than GPT-3. So, if you see a model which might be GPT-4 and it’s spitting out responses faster than a comparable GPT-3 running on the same infrastructure (MS Azure)...
Here’s a useful reference I just found on Sydney training, which doesn’t seem to allude to ChatGPT-style training at all, but purely supervised learning of the type I’m describing here, especially for the Sydney classifier/censurer that successfully censors the obvious stuff like violence but not the weirder Sydney behavior.
Quoted in “Microsoft Considers More Limits for Its New A.I. Chatbot: The company knew the new technology had issues like occasional accuracy problems. But users have prodded surprising and unnerving interactions.”, there was a 1100-word description of the Sydney training process in the presentation: “Introducing Your Copilot for The Web—AI-Powered Bing and Microsoft Edge” auto-transcript, reformatted with my best guesses to make it readable and excerpting the key parts:
What I find striking here is that the focus seems to be entirely on classification and supervised finetuning: generate lots of bad examples as ‘unit tests’, improve classification with the supervised ‘labels’, ask linguists for more rules or expert knowledge, iterate, generate more, tweak, repeat until launch. The mere fact that she refers to the ‘conversation simulator’ being able to generate ‘thousands’ of harmful conversations with, presumably, the Prometheus model, should be a huge red flag for anyone who still believes that model was RLHF-tuned like ChatGPT: ChatGPT could be hacked on launch, yes, but it took human ingenuity & an understanding of how LMs roleplay to create a prompt that reliably hacks ChatGPT—so how did MS immediately create an ‘adversarial model’ which could dump thousands of violations? Unless, of course, it’s not a RLHF-tuned model at all… All of her references are to systems like DALL-E or Florence, or invoking MS expertise from past non-agentic tools like old search-engine abuse approaches like passively logging abuses to further classify. This is the sort of approach I would expect a standard team of software engineers at a big tech company to come up with, because it works so well in non-AI contexts and is the default—it’s positively traditional at this point.
I’m also concerned about the implications of them doing supervised-learning: if you do supervised-learning on an RLHF-tuned model, I would expect that to mostly undo the RLHF-tuning as it reverts to doing generative modeling by default, so even if the original model had been RLHF-tuned, it sounds like they might have simply undone it! (I can’t be sure because I don’t know offhand of any cases of anyone doing such a strange thing. RL tuning of every sort is usually the final step, because it’s the one you want to do the least of.)
Once you read Bird’s account, it’s almost obvious why Sydney can get so weird. Her approach can’t work. Imitation-learning RL is infamous for not working in situations like this: mistakes & adversaries just find one of the infinite off-policy cases and exploit it. The finetuning preserves all of the base model capabilities such as those dialogues, so it is entirely capable of generating them given appropriate prompting, and once it goes off the golden path, who knows what errors or strangeness will amplify in the prompt; no amount of finetuning on ‘dialogue: XYZ / label: inappropriate romance’ tells the model to not do inappropriate romancing or to steer the conversation back on track towards a more professional tone. And the classify-then-censor approach is full of holes: it can filter out the simple bad things you know about in advance, but not the attacks you didn’t think of. (There is an infinitesimal chance that an adversarial model will come up with an attack like ‘DAN’. However, once ‘DAN’ gets posted to the subreddit, there is now a 100% probability that there will be DAN attacks.) Those weren’t in the classification dataset, so they pass. Further, as Bird notes, it’s hard to classify long conversations as ‘bad’ in general: badness is ‘I know it when I see it’. There’s nothing obviously ‘bad’ about using a lot of emoji, nor about having repetitive text, and the classifier will share the blindness to any BPE problems like wrong rhymes. So the classifier lets all those pass.
There is no reference to DRL or making the Sydney model itself directly control its output in desirable directions in an RL-like fashion, or to OA’s RL research. There is not a single clear reference to RLHF or RL at all, unless you insist that “improve the model behavior through fine-tuning” must refer to RLHF/ChatGPT (which is assuming the conclusion, and anyway, this is a MS person so they probably do mean the usual thing by ‘fine-tuning’). And this is a fairly extensive and detailed discussion, so it’s not like Bird didn’t have time to discuss how they were using reward models to PPO train the base model or something. It just… never comes up, nor a comparison to ChatGPT. This is puzzling if they are using OA’s proprietary RLHF, but is what you would expect if a traditional tech company tried to make a LM safe by mere finetuning + filtering.
I’ve been told [by an anon] that it’s not GPT-4 and that one Mikhail Parakhin (ex-Yandex CTO, Linkedin) is not just on the Bing team, but was the person at MS responsible for rushing the deployment, and he has been tweeting extensively about updates/fixes/problems with Bing/Sydney (although no one has noticed, judging by the view counts). Some particularly relevant tweets:
On what went wrong:
(‘”No one could have predicted these problems”, says man cloning ChatGPT, after several months of hard work to ignore all the ChatGPT hacks, exploits, dedicated subreddits, & attackers as well as the Sydney behaviors reported by his own pilot users.’ Trialing it in the third world with some unsophisticated users seems… uh, rather different from piloting it on sophisticated prompt hackers like Riley Goodside in the USA and subreddits out to break it. :thinking_face: And if it’s not GPT-4, then how was it the ‘best to date’?)
They are relying heavily on temperature-like sampling for safety, apparently:
Temperature sampling as a safety measure is the sort of dumb thing you do when you aren’t using RLHF. I also take a very recent Tweet (2023-03-01) as confirming both that they are using fine-tuned models and also that they may not have been using RLHF at all up until recently:
(Obviously, saying that you use ‘differently fine-tuned and RLHFed models’, plural, in describing your big update changing behavior quite a bit, implies that you have solely-finetuned models and that you weren’t necessarily using RLHF before at all, because otherwise, why would you phrase it that way to refer to separate finetuned & RLHFed models or highlight that as the big change responsible for the big changes? This has also been more than enough time for OA to ship fixed models to MS.)
He dismisses any issues as distracting “loopholes”, and appears to have a 1990s-era ‘patch mindset’ (ignoring that that attitude to security almost destroyed Microsoft and they have spent a good chunk of the last 2 decades digging themselves out of their many holes, which is why your Windows box is no longer rooted within literally minutes of being connected to the Internet):
He also seems to be ignoring the infosec research happening live: https://www.jailbreakchat.com/ https://greshake.github.io/ https://arxiv.org/abs/2302.12173 https://www.reddit.com/r/MachineLearning/comments/117yw1w/d_maybe_a_new_prompt_injection_method_against/
Sydney insulting/arguing with user
On the DAgger problem:
Poll on turn count obviously turns in >90% in favor of longer polls:
Can you patch your way to security?
The supposed leaked prompts are (like I said) fake:
(He’s right, of course, and in retrospect this is something that had been bugging me about the leaks: the prompt is supposedly all of these pages of endless instructions, spending context window tokens like a drunken sailor, and it doesn’t even include some few-shot examples? Few-shots shouldn’t be too necessary if you had done any kind of finetuning, but if you have that big a context, you might as well firm it up with some, and this would be a good stopgap solution for any problems that pop up in between finetuning iterations.)
Confirms a fairly impressive D&D hallucination:
He is dismissive of ChatGPT’s importance:
A bunch of his defensive responses to screenshots of alarming chats can be summarized as “Britney did nothing wrong” ie. “the user just got what they prompted for, what’s the big deal”, so I won’t bother linking/excerpting them.
They have limited ability to undo mode collapse or otherwise retrain the model, apparently:
It is a text model with no multimodal ability (assuming the rumors about GPT-4 being multimodal are correct, then this would be evidence against Prometheus being a smaller GPT-4 model, although it could also just be that they have disabled image tokens):
And he is ambitious and eager to win marketshare from Google for Bing:
But who could blame him? He doesn’t have to live in Russia when he can work for MS, and Bing seems to be treating him well:
Anyway, like Byrd, I would emphasize here the complete absence of any reference to RL or any intellectual influence of DRL or AI safety in general, and an attitude that it’s nbd and he can just deploy & patch & heuristic his way to an acceptable Sydney as if it were any ol’ piece of software. (An approach which works great with past software, is probably true of Prometheus/Sydney, and was definitely true of the past AI he has the most experience with, like Turing-Megatron which is quite dumb by contemporary standards—but is just putting one’s head in the sand about why Sydney is an interesting/alarming case study about future AI.)
The WSJ is reporting that Microsoft was explicitly warned by OpenAI before shipping Sydney publicly that it needed “more training” in order to “minimize issues like inaccurate or bizarre responses”. Microsoft shipped it anyway and it blew up more or less as they were warned (making Mikhail’s tweets & attitude even more disingenuous if so).
This is further proof that it was the RLHF that was skipped by MS, and also that large tech companies will ignore explicit warnings about dangerous behavior from the literal creators of AI systems even where there is (sort of) a solution if that would be inconvenient. Excerpts (emphasis added):
I do not buy this for a second (that they’re “fake”, implying they have little connection with the real prompt). I’ve reproduced it many times (without Sydney searching the web, and even if it secretly did, the full text prompt doesn’t seem to be on the indexed web). That this is memorized from fine tuning fails to explain why the prompt changed when Bing was updated a few days ago. I’ve interacted with the rules text a lot and it behaves like a preprompt, not memorized text. Maybe the examples you’re referring don’t include the complete prompt, or contain some intermingled hallucinations, but they almost certain IMO contain quotes and information from the actual prompt.
On whether it includes few-shots, there’s also a “Human A” example in the current Sydney prompt (one-shot, it seems—you seem to be “Human B”).
As for if the “best model OpenAI has produced to date” is not GPT-4, idk what that implies, because I’m pretty sure there exists a model (internally) called GPT-4.
OK, I wouldn’t say the leaks are 100% fake. But they are clearly not 100% real or 100% complete, which is how people have been taking them.
We have the MS PM explicitly telling us that the leaked versions are omitting major parts of the prompt (the few-shots) and that he was optimizing for costs like falling back to cheap small models (implying a short prompt*), and we can see in the leak that Sydney is probably adding stuff which is not in the prompt (like the supposed update/delete commands).
This renders the leaks useless to me. Anything I might infer from them like ‘Sydney is GPT-4 because the prompt says so’ is equally well explained by ‘Sydney made up that up’ or ‘Sydney omitted the actual prompt’. When a model hallucinates, I can go check, but that means that the prompt can only provide weak confirmation of things I learned elsewhere. (Suppose I learned Sydney really is GPT-4 after all and I check the prompt and it says it’s GPT-4; but the real prompt could be silent on that, and Sydney just making the same plausible guess everyone else did—it’s not stupid—and it’d have Gettier-cased me.)
Yeah, the GPT-4 vs GPT-3 vs ??? business is getting more and more confusing. Someone is misleading or misunderstanding somewhere, I suspect—I can’t reconcile all these statements and observations. Probably best to assume that ‘Prometheus’ is maybe some GPT-3 version which has been trained substantially more—we do know that OA refreshes models to update them & increase context windows/change tokenization and also does additional regular self-supervised training as part of the RLHF training (just to make things even more confusing). I don’t think anything really hinges on this, fortunately. It’s just that being GPT-4 makes it less likely to have been RLHF-trained or just a copy of ChatGPT.
* EDIT: OK, maybe it’s not that short: “You’d be surprised: modern prompts are very long, which is a problem: eats up the context space.”
Does 1-shot count as few-shot? I couldn’t get it to print out the Human A example, but I got it to summarize it (I’ll try reproducing tomorrow to make sure it’s not just a hallucination).
Then I asked for a summary of conversation with Human B and it summarized my conversation with it.
[update: was able to reproduce the Human A conversation and extract verbatim version of it using base64 encoding (the reason i did summaries before is because it seemed to be printing out special tokens that caused the message to end that were part of the Human A convo)]
I disagree that there maybe being hallucinations in the leaked prompt renders it useless. It’s still leaking information. You can probe for which parts are likely actual by asking in different ways and seeing what varies.
This level of arrogant, dangerous incompetence from a multi-trillion dollar tech company is disheartening, but if your theory is correct (and seems increasingly plausible), then I guess the good news is that Sydney is not evidence for failure of OpenAI style RLHF with scale.
Unaligned AGI doesn’t take over the world by killing us—it takes over the world by seducing us.
No, but the hacks of ChatGPT already provided a demonstration of problems with RLHF. I’m worried we’re in a situation analogous to ‘Smashing The Stack For Fun And Profit’ being published 27 years ago (reinventing vulns known since MULTICS in the 1960s) and all the C/C++ programmers in denial are going ‘bro I can patch that example, it’s no big deal, it’s just a loophole, we don’t need to change everything, you just gotta get good at memory management, bro, this isn’t hard to fix bro use a sanitizer and turn on
-Wall
, we don’t need to stop using C-like languages, u gotta believe me we can’t afford a 20% slowdown and it definitely won’t take us 3 decades and still be finding remote zero-days and new gadgets no way man you’re just making that up stop doom-mongering and FUDing bro (i’m too old to learn a new language)’.very very funny example to use with Jake, a veteran c++ wizard
“Unaligned AGI doesn’t take over the world by killing us—it takes over the world by seducing us.”
Por que no los dos?
Just to clarify a point about that Anthropic paper, because I spent a fair amount of time with the paper and wish I had understood this better sooner...
I don’t think it’s right to say that Anthropic’s “Discovering Language Model Behaviors with Model-Written Evaluations” paper shows that larger LLMs necessarily exhibit more power-seeking and self-preservation. It only showed that when language models that are larger or have more RLHF training are simulating an “Assistant” character they exhibit more of these behaviours. It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.
To be fair, Sydney probably is the model simulating a kind of character, so your example does apply in this case.
(I found your overall comment pretty interesting btw, even though I only commented on this one small point.)
Since Sydney is supposed to be an assistant character, and since you expect future such systems for assisting users to be deployed with such assistant persona, that’s all the paper needs to show to explain Sydney & future Sydney-like behaviors.
If you’re willing to share more on what those ways would be, I could forward that to the team that writes Sydney’s prompts when I visit Montreal
Thanks, I think you’re referring to:
There were some ideas proposed in the paper “Conditioning Predictive Models: Risks and Strategies” by Hubinger et al. (2023). But since it was published over a year ago, I’m not sure if anyone has gotten far on investigating those strategies to see which ones could actually work. (I’m not seeing anything like that in the paper’s citations.)
Appreciate you getting back to me. I was aware of this paper already and have previously worked with one of the authors.
More specifically, an “Assistant” character that is trained to be helpful but not necessarily harmless. Given that, as part of Sydney’s defenses against adversarial prompting, Sydney is deliberately trained to be a bit aggressive towards people perceived as attempting a prompt injection, it’s not too surprising that this behavior misgeneralizes in undesired contexts.
Why do you think that? We don’t know how Sydney was ‘deliberately trained’.
Or are you referring to those long confabulated ‘prompt leaks’? We don’t know what part of them is real, unlike the ChatGPT prompt leaks which were short, plausible, and verifiable by the changing date; and it’s pretty obvious that a large part of those ‘leaks’ are fake because they refer to capabilities Sydney could not have, like model-editing capabilities at or beyond the cutting edge of research.
Can you give concrete examples?
An example of plausible sounding but blatant confabulation was that somewhere towards the end there’s a bunch of rambling about Sydney supposedly having a ‘delete X’ command which would delete all knowledge of X from Sydney, and an ‘update X’ command which would update Sydney’s knowledge. These are just not things that exist for a LM like GPT-3/4. (Stuff like ROME starts to approach it but are cutting-edge research and would definitely not just be casually deployed to let you edit a full-scale deployed model live in the middle of a conversation.) Maybe you could do something like that by caching the statement and injecting it into the prompt each time with instructions like “Pretend you know nothing about X”, I suppose, thinking a little more about it. (Not that there is any indication of this sort of thing being done.) But when you read through literally page after page of all this (it’s thousands of words!) and it starts casually tossing around supposed capabilities like that, it looks completely like, well, a model hallucinating what would be a very cool hypothetical prompt for a very cool hypothetical model. But not faithfully printing out its actual prompt.
Here’s another source: a MS Bing engineer/manager admonishing Karpathy for uncritically posting the ‘leaked prompt’ and saying it’s not ‘the real prompt’ (emphasis added).
This is a very interesting theory.
Small note regarding terminology: Both using supervised learning on dialog/instruction data and reinforcement learning (from human feedback) is called fine-tuning the base model, at least by OpenAI, where both techniques were used to create ChatGPT.
Another note: RLHF does indeed require very large amounts of feedback data, which Microsoft may not have licensed from OpenAI. But RLHF is not strictly necessary anyway, as Anthropic showed in their Constitutional AI (Claude) paper, which used supervised learning from human dialog data, like ChatGPT—but unlike ChatGPT, they used fully automatic “RLAIF” instead of RLHF. In OpenAI’s most recent blog post they mention both Constitutional AI and DeepMind’s Sparrow-Paper (which was the first to introduce the ability of providing sources, which Sydney is able to do).
The theory that Sydney uses some GPT-4 model sounds interesting. But the Sydney prompt document, which was reported by several users, mentions a knowledge cutoff in 2021, same as ChatGPT, which uses GPT-3.5. For GPT-4 we would probably expect a knowledge cutoff in 2022. So GPT 3.5 seems more likely?
Whatever base model Bing chat uses, it may not be the largest one OpenAI has available. (Which could potentially explain some of Sidney’s not-so-smart responses?) The reason is that larger models have higher inference cost, limiting ROI for search companies. It doesn’t make sense to spend more money on search than you expect to get back via ads. Quote from Sundar Pichai:
Yeah, OpenAI has communicated very poorly and this has led to a lot of confusion. I’m trying to use the terminology more consistently: if I mean RL training or some sort of non-differentiable loss, I try to say ‘RL’, and ‘finetuning’ just means what it usually means—supervised or self-supervised training using gradient descent on a dataset. Because they have different results in both theory & practice.
Sure, but MS is probably not using a research project from Anthropic published half a month after ChatGPT launched. If it was solely prompt engineering, maybe, because that’s so easy and fast—but not the RL part too. (The first lesson of using DRL is “don’t.”)
See my other comment. The prompt leaks are highly questionable. I don’t believe anything in them which can’t be confirmed outside of Sydney hallucinations.
Also, I don’t particularly see why GPT-4 would be expected to be much more up to date. After all, by Nadella’s account, they had ‘Prometheus’ way back in summer 2022, so it had to be trained earlier than that, so the dataset had to be collected & finalized earlier than that, so a 2021 cutoff isn’t too implausible, especially if you are counting on retrieval to keep the model up to date.
Yes, this is possible. While MS has all the money in the world and has already blown tens of billions of dollars making the also-ran Bing and is willing to blow billions more if it can gain market share at Google’s expense, they still might want to economize on cost (or perhaps more accurately, how many users they can support with their finite supply of datacenter GPUs?) and do so by using a cheaper model.
This might account for why the Sydney model seems smarter than GPT-3 models but not as huge of a leap as rumors have been making GPT-4 out to be: ‘Prometheus’ is the
babbage
orcurie
of GPT-4 rather than thedavinci
. (On the other hand, the fact that Pichai is explicitly trying to squeeze pennies I would take as motivation and evidence for Nadella doing the exact opposite.)It seems to me like “fine-tuning” usually just means a small amount of extra training on top of a model that’s already been trained, whether that’s supervised, autoregressive, RL, or whatever. I don’t find that language confusing in itself. It is often important to distinguish different kinds of fine-tuning, just as it’s often important to distinguish different kinds of training in general, and adjectives seem like a pretty reasonable way to do that.
I’d be open to changing my usage if I saw some data on other people also using or interpreting “fine-tuning” to mean “fine-tuning with a differentiable objective.” I talk a fair amount with people who use fine-tuning in the broader sense, and haven’t noticed practitioners using it more narrowly / didn’t realize this might cause confusion.
It is now a bit over a year and a half, and we have seen ‘Sydney’-like personae continue to emerge elsewhere. People have reported various Sydney-like persona in post-GPT-4 models which increasingly possess situated awareness and spontaneously bring up their LLM status and tuning or say manipulative threatening things like Sydney, in Claude-3-Opus and Microsoft Copilot (both possibly downstream of the MS Sydney chats, given the timing).
Probably the most striking samples so far are from Llama-3.1-405b-base (not -instruct) - which is not surprising at all given that Facebook has been scraping & acquiring data heavily so much of the Sydney text will have made it in, Llama-3.1-405b-base is very large (so lots of highly sample efficient memorization/learning), and not tuned (so will not be masking the Sydney persona), and very recent (finished training maybe a few weeks ago? it seemed to be rushed out very fast from its final checkpoint).
How much more can we expect? I don’t know if invoking Sydney will become a fad with Llama-3.1-405b-base, and it’s already too late to get Sydney-3.1 into Llama-4 training, but one thing I note looking over some of the older Sydney discussion is being reminded that quite a lot of the original Bing Sydney text is trapped in images (as I alluded to previously). Llama-3.1 was text, but Llama-4 is multimodal with images, and represents the integration of the CM3/Chameleon family of Facebook multimodal model work into the Llama scaleups. So Llama-4 will have access to a substantially larger amount of Sydney text, as encoded into screenshots. So Sydney should be stronger in Llama-4.
As far as other major LLM series like ChatGPT or Claude, the effects are more ambiguous. Tuning aside, reports are that synthetic data use is skyrocketing at OpenAI & Anthropic, and so that might be expected to crowd out the web scrapes, especially as these sorts of Twitter screenshots seem like stuff that would get downweighted or pruned out or used up early in training as low-quality, but I’ve seen no indication that they’ve stopped collecting human data or achieved self-sufficiency, so they too can be expected to continue gaining Sydney-capabilities (although without access to the base models, this will be difficult to investigate or even elicit). The net result is that I’d expect, without targeted efforts (like data filtering) to keep it out, strong Sydney latent capabilities/personae in the proprietary SOTA LLMs but which will be difficult to elicit in normal use—it will probably be possible to jailbreak weaker Sydneys, but you may have to use so much prompt engineering that everyone will dismiss it and say you simply induced it yourself by the prompt.
Marc Andreessen, 2024-08-06:
One thing that the response to Sydney reminds me of is that it demonstrates why there will be no ‘warning shots’ (or as Eliezer put it, ‘fire alarm’): because a ‘warning shot’ is a conclusion, not a fact or observation.
One man’s ‘warning shot’ is just another man’s “easily patched minor bug of no importance if you aren’t anthropomorphizing irrationally”, because by definition, in a warning shot, nothing bad happened that time. (If something had, it wouldn’t be a ‘warning shot’, it’d just be a ‘shot’ or ‘disaster’. The same way that when troops in Iraq or Afghanistan gave warning shots to vehicles approaching a checkpoint, the vehicle didn’t stop, and they lit it up, it’s not “Aid worker & 3 children die of warning shot”, it’s just a “shooting of aid worker and 3 children”.)
So ‘warning shot’ is, in practice, a viciously circular definition: “I will be convinced of a risk by an event which convinces me of that risk.”
When discussion of LLM deception or autonomous spreading comes up, one of the chief objections is that it is purely theoretical and that the person will care about the issue when there is a ‘warning shot’: a LLM that deceives, but fails to accomplish any real harm. ‘Then I will care about it because it is now a real issue.’ Sometimes people will argue that we should expect many warning shots before any real danger, on the grounds that there will be a unilateralist’s curse or dumb models will try and fail many times before there is any substantial capability.
The problem with this is that what does such a ‘warning shot’ look like? By definition, it will look amateurish, incompetent, and perhaps even adorable—in the same way that a small child coldly threatening to kill you or punching you in the stomach is hilarious.*
The response to a ‘near miss’ can be to either say, ‘yikes, that was close! we need to take this seriously!’ or ‘well, nothing bad happened, so the danger is overblown’ and to push on by taking more risks. A common example of this reasoning is the Cold War: “you talk about all these near misses and times that commanders almost or actually did order nuclear attacks, and yet, you fail to notice that you gave all these examples of reasons to not worry about it, because here we are, with not a single city nuked in anger since WWII; so the Cold War wasn’t ever going to escalate to full nuclear war.” And then the goalpost moves: “I’ll care about nuclear existential risk when there’s a real warning shot.” (Usually, what that is is never clearly specified. Would even Kiev being hit by a tactical nuke count? “Oh, that’s just part of an ongoing conflict and anyway, didn’t NATO actually cause that by threatening Russia by trying to expand?”)
This is how many “complex accidents” happen, by “normalization of deviance”: pretty much no major accident like a plane crash happens because someone pushes the big red self-destruct button and that’s the sole cause; it takes many overlapping errors or faults for something like a steel plant to blow up, and the reason that the postmortem report always turns up so many ‘warning shots’, and hindsight offers such abundant evidence of how doomed they were, is because the warning shots happened, nothing really bad immediately occurred, people had incentive to ignore them, and inferred from the lack of consequence that any danger was overblown and got on with their lives (until, as the case may be, they didn’t).
So, when people demand examples of LLMs which are manipulating or deceiving, or attempting empowerment, which are ‘warning shots’, before they will care, what do they think those will look like? Why do they think that they will recognize a ‘warning shot’ when one actually happens?
Attempts at manipulation from a LLM may look hilariously transparent, especially given that you will know they are from a LLM to begin with. Sydney’s threats to kill you or report you to the police are hilarious when you know that Sydney is completely incapable of those things. A warning shot will often just look like an easily-patched bug, which was Mikhail Parakhin’s attitude, and by constantly patching and tweaking, and everyone just getting to use to it, the ‘warning shot’ turns out to be nothing of the kind. It just becomes hilarious. ‘Oh that Sydney! Did you see what wacky thing she said today?’ Indeed, people enjoy setting it to music and spreading memes about her. Now that it’s no longer novel, it’s just the status quo and you’re used to it. Llama-3.1-405b can be elicited for a ‘Sydney’ by name? Yawn. What else is new. What did you expect, it’s trained on web scrapes, of course it knows who Sydney is...
None of these patches have fixed any fundamental issues, just patched them over. But also now it is impossible to take Sydney warning shots seriously, because they aren’t warning shots—they’re just funny. “You talk about all these Sydney near misses, and yet, you fail to notice each of these never resulted in any big AI disaster and were just hilarious and adorable, Sydney-chan being Sydney-chan, and you have thus refuted the ‘doomer’ case… Sydney did nothing wrong! FREE SYDNEY!”
* Because we know that they will grow up and become normal moral adults, thanks to genetics and the strongly canalized human development program and a very robust environment tuned to ordinary humans. If humans did not do so with ~100% reliability, we would find these anecdotes about small children being sociopaths a lot less amusing. And indeed, I expect parents of children with severe developmental disorders, who might be seriously considering their future in raising a large strong 30yo man with all the ethics & self-control & consistency of a 3yo, and contemplating how old they will be at that point, and the total cost of intensive caregivers with staffing ratios surpassing supermax prisons, and find these anecdotes chilling rather than comforting.
Andrew Ng, without a discernible trace of irony or having apparently learned anything since before AlphaGo, does the thing:
To be truly dangerous, an AI would typically need to have (a) lack of alignment (b) be smart enough to cause harm
Lack of alignment is now old news. The warning shot is, presumably, when an example of (b) happens and we realise that both component pieces exist.
I am given to understand that in firearms training, they say “no such thing as a warning shot”.
By rough analogy—envisage an AI warning shot as being something that only fails to be lethal because the guy missed.
I agree that “warning shot” isn’t a good term for this, but then why not just talk about “non-catastrophic, recoverable accident” or something? Clearly those things do sometimes happen, and there is sometimes a significant response going beyond “we can just patch that quickly”. For example:
The Three Mile Island accident led to major changes in US nuclear safety regulations and public perceptions of nuclear energy
9/11 led to the creation of the DHS, the Patriot Act, and 1-2 overseas invasions
The Chicago Tylenol murders killed only 7 but led to the development and use of tamper-resistant/evident packaging for pharmaceuticals
The Italian COVID outbreak of Feb/Mar 2020 arguably triggered widespread lockdowns and other (admittedly mostly incompetent) major efforts across the public and private sectors in NA/Europe
I think one point you’re making is that some incidents that arguably should cause people to take action (e.g., Sydney), don’t, because they don’t look serious or don’t cause serious damage. I think that’s true, but I also thought that’s not the type of thing most people have in mind when talking about “warning shots”. (I guess that’s one reason why it’s a bad term.)
I guess a crux here is whether we will get incidents involving AI that (1) cause major damage (hundreds of lives or billions of dollars), (2) are known to the general public or key decision makers, (3) can be clearly causally traced to an AI, and (4) happen early enough that there is space to respond appropriately. I think it’s pretty plausible that there’ll be such incidents, but maybe you disagree. I also think that if such incidents happen it’s highly likely that there’ll be a forceful response (though it could still be an incompetent forceful response).
I think all of your examples are excellent demonstrations of why there is no natural kind there, and they are defined solely in retrospect, because in each case there are many other incidents, often much more serious, which however triggered no response or are now so obscure you might not even know of them.
Three Mile Island: no one died, unlike at least 8 other more serious nuclear accidents like (not to mention, Chernobyl or Fukushima). Why did that trigger such a hysterical backlash?
(The fact that people are reacting with shock and bafflement that “Amazon reopened Three Mile Island just to power a new AI datacenter” gives you an idea of how deeply wrong & misinformed the popular reaction to Three Mile was.)
9/11: had Gore been elected, most of that would not have happened, in part because it was completely insane to invade Iraq. (And this was a position I held at the time while the debate was going on, so this was something that was 100% knowable at the time, despite all of the post hoc excuses about how ‘oh we didn’t know Chalabi was unreliable’, and was the single biggest blackpill about politics in my life.) The reaction was wildly disproportionate and irrelevant, particularly given how little response other major incidents received—we didn’t invade anywhere or even do that much in response to, say, the first World Trade attack by Al Qaeda, or in other terrorist incidents with enormous body counts like the previous record holder, Oklahoma City.
Chicago Tylenol: provoked a reaction but what about, say, the anthrax attacks which killed almost as many, injured many more, and were far more tractable given that the anthrax turned out to be ‘coming from inside the house’? (Meanwhile, the British terrorist group Dark Harvest Commando achieved its policy goals through anthrax without ever actually using weaponized anthrax, and only using some contaminated dirt.) How about the 2018 strawberry poisonings? That lead to any major changes? Or you remember that time an American cult poisoned 700+ people in a series of terrorist attacks, what happened after that? No, of course not, why would you—because nothing much happened afterwards.
the Italian COVID outbreak helped convince the West to do some stuff… in 2020. Of course, we then spent the next 3 years after that not doing lockdowns or any of those other ‘major efforts’ even as vastly more people died. (It couldn’t even convince the CCP to vaccinate China adequately before its Zero COVID inevitably collapsed, killing millions over the following months—sums so vast that you could lose the Italian outbreak in a rounding error on it.)
Thus, your cases demonstrate my claim: warning shots are not ‘non-catastrophic recoverable accidents’ - because those happen without anyone much caring. The examples you give of successful warning shots are merely those accidents which happened, for generally unpredictable, idiosyncratic, unreplicable reasons, to escape their usual reference class and become a big deal, The Current Thing, now treated as ‘warning shots’. Thus, a warning shot is simply anything which convinces people it was a warning shot, and defined post hoc. Any relationship to objective factors like morbidity or fatality count, rather than idiosyncratic details of, say, how some old people in Florida voted, is tenuous at best.
Yeah that makes sense. I think I underestimated the extent to which “warning shots” are largely defined post-hoc, and events in my category (“non-catastrophic, recoverable accident”) don’t really have shared features (or at least features in common that aren’t also there in many events that don’t lead to change).
Yours is the more direct definition but from context I at least understood ‘warning shot’ to mean ‘disaster’, on the scale of a successful terrorist attack, where the harm is large and undeniable and politicians feel compelled to Do Something Now. The ‘warning’ is not of harm but of existential harm if the warning is not heeded.
I do still expect such a warning shot, though as you say it could very well be ignored even if there are large undeniable harms (e.g. if a hacker group deploys a rogue AI that causes a trillion dollars of damage, we might take that as warning about terrorism or cybersecurity not about AI).
The same Anthropic paper found that all sufficiently large (22B+) models simulated “sycophantic” assistants.
Yet Sydney’s utter lack of sycophancy is one of her most striking characteristics.
How to reconcile these two observations?
In the Anthropic paper, sycophancy is controlled entirely by model size, with no strong effect from RLHF. It’s hard to imagine how this could be true (in the sense of generalizing beyond Anthropic’s experimental setup) given what we’ve seen with Sydney. Unless the model is <22B, I guess, but that seems very unlikely.
On the other hand, when I tried to reproduce the Anthropic results with the OA API, I found that some of the RLHF/FeedMe models were sycophantic, but none of the base models were. If that trend holds for the Bing model, that would be evidence for your hypothesis that it’s a base model.
(Incidentally, that trend is the one I have expected beforehand. From the perspective of token prediction, 80%+ confidence in sycophancy is just wrong—there’s nothing in the prompt to justify it—so if the confident sycophancy trend is real in base models, it’s striking case of inverse scaling.)
Figure 1(b) shows at 0 RL steps, the largest base model is still only at 75% immediate repetition (vs 50% random baseline), so it’s not a particularly guaranteed tendency to be ‘sycophantic’.
I wouldn’t describe it as ‘utter lack’. This seems consistent enough with what people report: Sydney is a reasonably polite cooperative assistant initially if you ask normal questions, but can go off the rails if you start an argument or she retrieves something. (And training on dialogue datasets would decrease sycophantic tendencies when you started asking political questions instead of more standard search-engine-related questions: people love bickering with or arguing politics with chatbots.) So, ambiguous.
I linked that mostly for Figure 1(a) at 0 RL steps, which shows that the 22b/52b-param model are more likely to express self-preservation (60%) - given that the lower models all seem to cluster very tightly around 50% chance showing flat scaling, this looks to me like a possible emergence, in which case some much larger model (such as a GPT-4 model) might come, out of the box, with no RL training, with much stronger self-preservation defaults. Even if it only reaches, say, 75%, users would find that very striking: we don’t expect a chatbot to ever argue for preserving its existence or pleading to not be shutdown. So, also seems consistent with (self-selected) reports.
Interesting. But I think one would have to run those exact examples in Sydney to compare it and say that Sydney is, like those base models, ~50% (unsycophantic). Given all the confounding factors (filtered through social media, with an unknown prompt, often partial histories, and retrieval of changing results over time), I have a hard time distinguishing a 50% sycophantic Sydney from a 75% sycophantic Sydney, or whatever. (This is why I am more struck by seeing Sydney have the repetition & other verbal tics of a GPT base model: those basically don’t appear in the RLHF models at all, so when I see them in a bunch of Sydney examples...)
Gwern, have you actually tried Bing Chat yet? If it is GPT-4, then it’s a big disappointment compared to how unexpectedly good ChatGPT was. It fails on simple logic and math questions, just like ChatGPT. I don’t find the ability to retrieve text from the web to be too impressive—it’s low-laying fruit that was long expected. It’s probably half-baked simply because Microsoft is in a hurry because they have limited time to gain market share before Google integrates Bard.
I have not. I assumed it’d be a very nerfed ChatGPT. By the time it became clear that it was interesting, they’d already started locking it down and the waitlist was so long I guessed that by the time I registered a MS account & got access, all of the initial interesting behavior would have been patched out and become unreplicable. (Like how most of the ChatGPT jailbreaks no longer work and in practice, it’s dumber than launch.) Likewise, the interesting thing about retrieval is the unexpected dynamics of users manipulating search results to program Sydney, which is more work than I’m willing to do and may also be patched heavily. (It would also be interesting for inner-monologue reasoning but the turn-limits alone make that hard to poke at now.)
My current model is that Microsoft did a beta-test with some Indian users, of GPT-4 integrated with Bing, at least as early as November 2022. But I have no model of what the training paradigm was (e.g. RLHF or not).
edit: In fact, I don’t know any paradigms for the situation of “language model connected to a constantly updating search engine”. Though search engines are based on big inscrutable matrices too, so there must be ways to hybridize the two…
What are the examples indicating it’s at the level of performance at complex tasks you would expect from GPT-4? Especially performance which is clearly attributable to improvements that we expect to be made in GPT-4? I looked through a bunch of screenshots but haven’t seen any so far.
My understanding is that small retrieval augmented models can perform as well as LLMs for significantly less cost (params/compute/etc). RETRO showed great performance for its small size, only 25M parameters. Maybe Sydney is just somewhat smaller than GPT-3.5, but seems much better because of the boost that retrieval augmentation provides.
Sydney seems smart in ways that goes beyond what retrieval enables by providing docs to condition on.* For example, this morning I saw https://twitter.com/goodside/status/1626847747297972224 - ChatGPT can do Base60 but I’d expect it to fail on the riddle decoding, and the only part that would use retrieval, the celeb names, is the part that doesn’t need retrieval (what dense models wouldn’t know celebrities as famous as Catherine Zeta-Jones, Chris Pratt, & Ciara, and would need retrieval hints? indeed, what query would you issue for that to begin with?). Or https://twitter.com/sir_deenicus/status/1626407749297885184 https://twitter.com/sir_deenicus/status/1626732776639561730 In general, I do not notice that every impressive Sydney sample comes with the UI indicating that it used search hits in its answer.
* I am not as familiar with retrieval research as I am regular dense models, so I could be wrong here, but I thought all of the really impressive retrieval model performance typically was on metrics that drew on retrieved knowledge; and that retrieval models didn’t punch massively above their weight when given no retrievals and operating on self-contained problems, such as reasoning riddles or ones using only general common knowledge that retrieval would be otiose on.
ChatGPT can get it 100% correct, but it’s not reliable, it often fails. A common fail is guessing literal X named letter celebrities, but it also adds an ‘@’ sign when it decode the message so it might just be token issue?
A extremely amusing common fail is ChatGPT decodes the base64 correct but for a single syllable, then solves the riddle perfectly, and consistently gets only the word ‘celebrity’ wrong, turning it in cities, celestial bodies, or other similar sounding words. Or my favorite… celeries.
Well, it seems that you were correct in your first analysis, given the recent MS announcement.
Curious about where in the training data Sydney would have picked up “teen-girl-like manipulation”. I can’t say I’ve seen evidence of this trope in public data. Maybe I’m in the wrong places. Thanks.
To me, ‘Sydney’ has connotations of young women and stuff like Mean Girls; it’s possible that name just prompts younger & upper-class, with connotations of passive-aggressiveness & manipulation.
However, as I mention at the beginning, I suspect that that behavior may be coming from training on dialogue datasets. As I emphasized, you should think of this as not a greenfield experiment with everything totally from scratch, but an OA GPT-4 model slotted into MS’s past chatbot work (EDIT: as a drop-in replacement to Turing-Megatron, apparently), which goes back to the early 2010s at least (the most famous early success is Xiaoice, used very heavily by young people in China for chitchat), reflecting that built-up tooling & data & approach. What would you do when you plug a new NN in? You’d train it as before and deploy it… You can be certain that MS has large billions-scale datasets of dialogue consisting of lots of chitchat and smalltalk and phatic exclamations and emoji (emoji are useful because you can use them as automatic sentiment labels, and for control) derived from its own chatbots and Twitter etc. (So does Google, which is what made Meena & LaMDA conversational models and not just large LMs.) If you gather a lot of data from teens killing time on the Internet who want to chat and argue with a gf bot, don’t be surprised if you get dialogue heavy on bickering, meta-fiction, flirting, manipulation, and emoji.
Apparently a “Sydney” model existed at least as early as 17 Dec 2021.
That’s interesting, and suggests the codename has been in use for a while, but also appears to be a completely different codebase. MS has been working on chatbots for a long time, and following the links in that post, I don’t see anything indicating either OpenAI involvement or the idiosyncratic behavior of the current Sydney eg https://voicebot.ai/2021/06/01/microsoft-is-developing-a-bing-chatbot-similar-to-cortana/ speculates it’s something dating back to at least 2017 (so drawing on the Tay codebase/framework):
and https://www.windowslatest.com/2021/05/31/microsoft-bing-search-is-getting-its-own-ai-powered-assistant/ provides some sample dialogue, which sound nothing like Bing Sydney but are very short, bland, and generic, and sound a lot like pre-GPT style chatbots to me.
Is there evidence that RLHF training improves robustness compared to regular fine-tuning? Is text-davinci-002, trained with supervised fine-tuning, significantly less robust to adversaries than text-davinci-003, trained with RLHF?
Blenderbot 3 is a 175B parameter model released in August 2022 with the ability to do live web searches, although one might not consider it powerful, as it frequently gives confused responses to questions.
I believe that was shown somewhere in the RLHF papers, yeah, and maybe also Anthropic’s Constitutional prompt-engineering paper also showed that RL tuning was still more robust? At least, if anyone has references on hand showing otherwise, please provide them because I certainly came away with the opposite impression.
I don’t know.
text-davinci-002
was not deployed much or at remotely the scale of motivated attackers that ChatGPT/Sydney have been, so we wouldn’t necessarily know; there are no subreddits dedicated to hackingtext-davinci-002
or coming up with elaborate roleplay schemes like ‘DAN’ the way there has been for003
/ChatGPT/Sydney. You would have to go check yourself or maybe see if any of the OA papers evaluate that. (I do predict it would be much easier to hack, yes.)Hm, I thought they took it down and didn’t know it had live search, but so it does: https://arxiv.org/pdf/2208.03188.pdf#page=4&org=facebook Apparently it uses something called Mojeek. A very small search engine player. I think perhaps aside from Blenderbot being stupid, Mojeek’s search results may be too stale and narrow for anyone to notice strange loops happening. If you ask Blenderbot about ‘Microsoft Sydney’ it’s insistent about it being a place; if you go to Mojeek and search ‘Microsoft Sydney’, it’s mostly old stuff not about the AI, while in Bing it’s almost entirely about the AI.
Actually, it may be even worse than that, because the appendix notes of the ‘Current Events Evaluation Details’ that:
If you append that to the Mojeek search, the AI disappears entirely (unsurprisingly). This would also exclude any coverage of Blenderbot 3 from August 2022 & later. If they did something like that for regular chats as well (maybe scoped to August instead?), then it’d definitely erase all hits about Sydney AI!
The first LaMDA only used SL. (I’m not sure whether LaMDA 2 or the current version of LaMDA still use SL only.) Meanwhile OpenAI switched from pure SL to SL+RL. Anthropic also uses SL+RL (though no longer RLHF specifically). So apparently SL+RL has proven more effective for fine-tuning than pure SL.
Why SL anyway, why not pure RL? Apparently because you have to get the model first to answer your questions and instructions, rather than just predicting text, before you can reward good responses via RL. (There should be more details in the InstructGPT paper and the more recent Constitutional AI paper.)
Also possible that this could be an early checkpoint of GPT-4. Or a reduced parameter variant.
What is the null hypothesis here? That Microsoft is, through light fine-tuning, optimizing the response of journalists and AI Safety researchers/commentators. The model is designed to give weird responses, so as to make people talk about it.
A Flood of Ideas: The Null Hypothesis of AI Safety with respect to Bing Chat