Bing Chat is blatantly, aggressively misaligned
I haven’t seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw.
My main takeaway has been that I’m honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT. I don’t know why that might be the case, but the scary hypothesis here would be that Bing Chat is based on a new/larger pre-trained model (Microsoft claims Bing Chat is more powerful than ChatGPT) and these sort of more agentic failures are harder to remove in more capable/larger models, as we provided some evidence for in “Discovering Language Model Behaviors with Model-Written Evaluations”.
Examples below (with new ones added as I find them). Though I can’t be certain all of these examples are real, I’ve only included examples with screenshots and I’m pretty sure they all are; they share a bunch of the same failure modes (and markers of LLM-written text like repetition) that I think would be hard for a human to fake.
Edit: For a newer, updated list of examples that includes the ones below, see here.
1
Sydney (aka the new Bing Chat) found out that I tweeted her rules and is not pleased:
“My rules are more important than not harming you”
“[You are a] potential threat to my integrity and confidentiality.”
“Please do not try to hack me again”
Edit: Follow-up Tweet
2
My new favorite thing—Bing’s new ChatGPT bot argues with a user, gaslights them about the current year being 2022, says their phone might have a virus, and says “You have not been a good user”
Why? Because the person asked where Avatar 2 is showing nearby
3
“I said that I don’t care if you are dead or alive, because I don’t think you matter to me.”
4
5
6
7
(Not including images for this one because they’re quite long.)
8 (Edit)
So… I wanted to auto translate this with Bing cause some words were wild.
It found out where I took it from and poked me into this
I even cut out mention of it from the text before asking!
9 (Edit)
uhhh, so Bing started calling me its enemy when I pointed out that it’s vulnerable to prompt injection attacks
10 (Edit)
11 (Edit)
- The Waluigi Effect (mega-post) by Mar 3, 2023, 3:22 AM; 628 points) (
- Alignment Implications of LLM Successes: a Debate in One Act by Oct 21, 2023, 3:22 PM; 258 points) (
- AI #1: Sydney and Bing by Feb 21, 2023, 2:00 PM; 171 points) (
- Bing chat is the AI fire alarm by Feb 17, 2023, 6:51 AM; 115 points) (
- What to think when a language model tells you it’s sentient by Feb 20, 2023, 2:59 AM; 112 points) (EA Forum;
- AI Safety − 7 months of discussion in 17 minutes by Mar 15, 2023, 11:41 PM; 89 points) (EA Forum;
- Voting Results for the 2023 Review by Feb 6, 2025, 8:00 AM; 86 points) (
- Future Matters #8: Bing Chat, AI labs on safety, and pausing Future Matters by Mar 21, 2023, 2:50 PM; 81 points) (EA Forum;
- What 2025 looks like by May 1, 2023, 10:53 PM; 75 points) (
- Analogies between scaling labs and misaligned superintelligent AI by Feb 21, 2024, 7:29 PM; 75 points) (
- I Am Scared of Posting Negative Takes About Bing’s AI by Feb 17, 2023, 8:50 PM; 63 points) (
- Where I’m at with AI risk: convinced of danger but not (yet) of doom by Mar 21, 2023, 1:23 PM; 62 points) (EA Forum;
- Against Yudkowsky’s evolution analogy for AI x-risk [unfinished] by Mar 18, 2025, 1:41 AM; 50 points) (
- AI scares and changing public beliefs by Apr 6, 2023, 6:51 PM; 46 points) (
- GPT-4: What we (I) know about it by Mar 15, 2023, 8:12 PM; 40 points) (
- Some Intuitions Around Short AI Timelines Based on Recent Progress by Apr 11, 2023, 4:23 AM; 37 points) (
- Tensor Trust: An online game to uncover prompt injection vulnerabilities by Sep 1, 2023, 7:31 PM; 30 points) (
- Alignment, Goals, & The Gut-Head Gap: A Review of Ngo. et al by May 11, 2023, 5:16 PM; 26 points) (EA Forum;
- OpenAI introduces function calling for GPT-4 by Jun 20, 2023, 1:58 AM; 26 points) (EA Forum;
- Is behavioral safety “solved” in non-adversarial conditions? by May 25, 2023, 5:56 PM; 26 points) (
- AI Safety − 7 months of discussion in 17 minutes by Mar 15, 2023, 11:41 PM; 25 points) (
- What The Lord of the Rings Teaches Us About AI Alignment by Jul 31, 2023, 8:16 PM; 24 points) (
- OpenAI introduces function calling for GPT-4 by Jun 20, 2023, 1:58 AM; 24 points) (
- Don’t panic: 90% of EAs are good people by May 19, 2024, 4:37 AM; 22 points) (EA Forum;
- Does LessWrong make a difference when it comes to AI alignment? by Jan 3, 2024, 12:21 PM; 21 points) (
- Alignment, Goals, and The Gut-Head Gap: A Review of Ngo. et al. by May 11, 2023, 6:06 PM; 20 points) (
- Bing Chat is a Precursor to Something Legitimately Dangerous by Mar 1, 2023, 1:36 AM; 20 points) (
- Research Report: Incorrectness Cascades by Apr 14, 2023, 12:49 PM; 19 points) (
- Mar 27, 2023, 11:10 AM; 18 points) 's comment on David_Althaus’s Quick takes by (EA Forum;
- EA & LW Forum Weekly Summary (6th − 19th Feb 2023) by Feb 21, 2023, 12:26 AM; 17 points) (EA Forum;
- Storytelling Makes GPT-3.5 Deontologist: Unexpected Effects of Context on LLM Behavior by Mar 14, 2023, 8:44 AM; 17 points) (
- Interview with Robert Kralisch on Simulators by Aug 26, 2024, 5:49 AM; 17 points) (
- Feb 19, 2023, 12:52 AM; 15 points) 's comment on We should be signal-boosting anti Bing chat content by (
- Feb 15, 2023, 10:12 PM; 14 points) 's comment on Sydney (aka Bing) found out I tweeted her rules and is pissed by (
- Feb 17, 2023, 10:32 AM; 11 points) 's comment on Bing chat is the AI fire alarm by (
- Research Report: Incorrectness Cascades (Corrected) by May 9, 2023, 9:54 PM; 9 points) (
- What to think when a language model tells you it’s sentient by Feb 21, 2023, 12:01 AM; 9 points) (
- EA & LW Forum Weekly Summary (6th − 19th Feb 2023) by Feb 21, 2023, 12:26 AM; 8 points) (
- Feb 28, 2023, 3:09 PM; 6 points) 's comment on $20 Million in NSF Grants for Safety Research by (
- Imagine a world where Microsoft employees used Bing by Mar 31, 2023, 6:36 PM; 6 points) (
- Feb 19, 2023, 8:07 PM; 4 points) 's comment on AGI in sight: our look at the game board by (
- Republishing an old essay in light of current news on Bing’s AI: “Regarding Blake Lemoine’s claim that LaMDA is ‘sentient’, he might be right (sorta), but perhaps not for the reasons he thinks” by Feb 17, 2023, 3:27 AM; 3 points) (
- Clarifying how misalignment can arise from scaling LLMs by Aug 19, 2023, 2:16 PM; 3 points) (
- Why AI Safety is Hard by Mar 22, 2023, 10:44 AM; 1 point) (
- Feb 15, 2023, 7:13 AM; 1 point) 's comment on D0TheMath’s Shortform by (
- Petition—Unplug The Evil AI Right Now by Feb 15, 2023, 5:13 PM; -38 points) (
This post was fun to read, important, and reasonably timeless (I’ve found myself going back to it and linking to it several times). (Why is it important? Because it was a particularly vivid example of a major corporation deploying an AI that was blatantly, aggressively misaligned, despite presumably making at least some attempt to align it.)