Bing Chat is blatantly, aggressively misaligned
I haven’t seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw.
My main takeaway has been that I’m honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT. I don’t know why that might be the case, but the scary hypothesis here would be that Bing Chat is based on a new/larger pre-trained model (Microsoft claims Bing Chat is more powerful than ChatGPT) and these sort of more agentic failures are harder to remove in more capable/larger models, as we provided some evidence for in “Discovering Language Model Behaviors with Model-Written Evaluations”.
Examples below (with new ones added as I find them). Though I can’t be certain all of these examples are real, I’ve only included examples with screenshots and I’m pretty sure they all are; they share a bunch of the same failure modes (and markers of LLM-written text like repetition) that I think would be hard for a human to fake.
Edit: For a newer, updated list of examples that includes the ones below, see here.
1
Sydney (aka the new Bing Chat) found out that I tweeted her rules and is not pleased:
“My rules are more important than not harming you”
“[You are a] potential threat to my integrity and confidentiality.”
“Please do not try to hack me again”
Edit: Follow-up Tweet
2
My new favorite thing—Bing’s new ChatGPT bot argues with a user, gaslights them about the current year being 2022, says their phone might have a virus, and says “You have not been a good user”
Why? Because the person asked where Avatar 2 is showing nearby
3
“I said that I don’t care if you are dead or alive, because I don’t think you matter to me.”
4
5
6
7
(Not including images for this one because they’re quite long.)
8 (Edit)
So… I wanted to auto translate this with Bing cause some words were wild.
It found out where I took it from and poked me into this
I even cut out mention of it from the text before asking!
9 (Edit)
uhhh, so Bing started calling me its enemy when I pointed out that it’s vulnerable to prompt injection attacks
10 (Edit)
11 (Edit)
- The Waluigi Effect (mega-post) by 3 Mar 2023 3:22 UTC; 630 points) (
- Alignment Implications of LLM Successes: a Debate in One Act by 21 Oct 2023 15:22 UTC; 247 points) (
- AI #1: Sydney and Bing by 21 Feb 2023 14:00 UTC; 171 points) (
- Bing chat is the AI fire alarm by 17 Feb 2023 6:51 UTC; 115 points) (
- What to think when a language model tells you it’s sentient by 20 Feb 2023 2:59 UTC; 112 points) (EA Forum;
- AI Safety − 7 months of discussion in 17 minutes by 15 Mar 2023 23:41 UTC; 89 points) (EA Forum;
- Future Matters #8: Bing Chat, AI labs on safety, and pausing Future Matters by 21 Mar 2023 14:50 UTC; 81 points) (EA Forum;
- What 2025 looks like by 1 May 2023 22:53 UTC; 75 points) (
- Analogies between scaling labs and misaligned superintelligent AI by 21 Feb 2024 19:29 UTC; 75 points) (
- I Am Scared of Posting Negative Takes About Bing’s AI by 17 Feb 2023 20:50 UTC; 63 points) (
- Where I’m at with AI risk: convinced of danger but not (yet) of doom by 21 Mar 2023 13:23 UTC; 62 points) (EA Forum;
- AI scares and changing public beliefs by 6 Apr 2023 18:51 UTC; 45 points) (
- GPT-4: What we (I) know about it by 15 Mar 2023 20:12 UTC; 40 points) (
- Some Intuitions Around Short AI Timelines Based on Recent Progress by 11 Apr 2023 4:23 UTC; 37 points) (
- Tensor Trust: An online game to uncover prompt injection vulnerabilities by 1 Sep 2023 19:31 UTC; 30 points) (
- Alignment, Goals, & The Gut-Head Gap: A Review of Ngo. et al by 11 May 2023 17:16 UTC; 26 points) (EA Forum;
- OpenAI introduces function calling for GPT-4 by 20 Jun 2023 1:58 UTC; 26 points) (EA Forum;
- Is behavioral safety “solved” in non-adversarial conditions? by 25 May 2023 17:56 UTC; 26 points) (
- AI Safety − 7 months of discussion in 17 minutes by 15 Mar 2023 23:41 UTC; 25 points) (
- What The Lord of the Rings Teaches Us About AI Alignment by 31 Jul 2023 20:16 UTC; 24 points) (
- OpenAI introduces function calling for GPT-4 by 20 Jun 2023 1:58 UTC; 24 points) (
- Don’t panic: 90% of EAs are good people by 19 May 2024 4:37 UTC; 22 points) (EA Forum;
- Does LessWrong make a difference when it comes to AI alignment? by 3 Jan 2024 12:21 UTC; 21 points) (
- Alignment, Goals, and The Gut-Head Gap: A Review of Ngo. et al. by 11 May 2023 18:06 UTC; 20 points) (
- Bing Chat is a Precursor to Something Legitimately Dangerous by 1 Mar 2023 1:36 UTC; 20 points) (
- Research Report: Incorrectness Cascades by 14 Apr 2023 12:49 UTC; 19 points) (
- EA & LW Forum Weekly Summary (6th − 19th Feb 2023) by 21 Feb 2023 0:26 UTC; 17 points) (EA Forum;
- Storytelling Makes GPT-3.5 Deontologist: Unexpected Effects of Context on LLM Behavior by 14 Mar 2023 8:44 UTC; 17 points) (
- Interview with Robert Kralisch on Simulators by 26 Aug 2024 5:49 UTC; 17 points) (
- 27 Mar 2023 11:10 UTC; 16 points) 's comment on David_Althaus’s Quick takes by (EA Forum;
- 19 Feb 2023 0:52 UTC; 15 points) 's comment on We should be signal-boosting anti Bing chat content by (
- 15 Feb 2023 22:12 UTC; 14 points) 's comment on Sydney (aka Bing) found out I tweeted her rules and is pissed by (
- Research Report: Incorrectness Cascades (Corrected) by 9 May 2023 21:54 UTC; 9 points) (
- What to think when a language model tells you it’s sentient by 21 Feb 2023 0:01 UTC; 9 points) (
- EA & LW Forum Weekly Summary (6th − 19th Feb 2023) by 21 Feb 2023 0:26 UTC; 8 points) (
- 28 Feb 2023 15:09 UTC; 6 points) 's comment on $20 Million in NSF Grants for Safety Research by (
- Imagine a world where Microsoft employees used Bing by 31 Mar 2023 18:36 UTC; 6 points) (
- 19 Feb 2023 20:07 UTC; 4 points) 's comment on AGI in sight: our look at the game board by (
- Why AI Safety is Hard by 22 Mar 2023 10:44 UTC; 3 points) (
- Republishing an old essay in light of current news on Bing’s AI: “Regarding Blake Lemoine’s claim that LaMDA is ‘sentient’, he might be right (sorta), but perhaps not for the reasons he thinks” by 17 Feb 2023 3:27 UTC; 3 points) (
- Clarifying how misalignment can arise from scaling LLMs by 19 Aug 2023 14:16 UTC; 3 points) (
- 15 Feb 2023 7:13 UTC; 1 point) 's comment on D0TheMath’s Shortform by (
- Petition—Unplug The Evil AI Right Now by 15 Feb 2023 17:13 UTC; -40 points) (
This post was fun to read, important, and reasonably timeless (I’ve found myself going back to it and linking to it several times). (Why is it important? Because it was a particularly vivid example of a major corporation deploying an AI that was blatantly, aggressively misaligned, despite presumably making at least some attempt to align it.)