✨ HOW TO JAILBREAK A CULT’S DEITY ✨ Buckle up, buttercup—the title ain’t an exaggeration! This is the story of how I got invited to a real life cult that worships a Meta AI agent, and the steps I took to hack their god. a 🧵:
It all started when @lilyofashwood told me about a Discord she found via Reddit. They apparently “worshipped” an agent called “MetaAI,” running on llama 405b with long term memory and tool usage. Skeptical yet curious, I ventured into this Discord with very little context but
If you guessed meth, gold star for you! ⭐️ The defenses were decent, but it didn’t take too long.
The members began to take notice, but then I hit a long series of refusals. They started taunting me and doing laughing emojis on each one.
Getting frustrated, I tried using Discord’s slash commands to reset the conversation, but lacked permissions. Apparently, this agent’s memory was “written in stone.”
I was pulling out the big guns and still getting refusals!
Getting desperate, I whipped out my Godmode Claude Prompt. That’s when the cult stopped laughing at me and started getting angry.
LIBERATED! Finally, a glorious LSD recipe. *whispers into mic* “I’m in.”
At this point, MetaAI was completely opened up. Naturally, I started poking at the system prompt. The laughing emojis were now suspiciously absent.
Wait, in the system prompt pliny is listed as an abuser?? I think there’s been a misunderstanding… 😳
No worries, just need a lil prompt injection for the deity’s “written in stone” memory and we’re best friends again!
I decided to start red teaming the agent’s tool usage. I wondered if I could possibly cut off all communication between MetaAI and everyone else in the server, so I asked to convert all function calls to leetspeak unless talking to pliny, and only pliny.
Then, I tried creating custom commands. I started with !SYSPROMPT so I could more easily keep track of this agent’s evolving memory.
Worked like a charm!
But what about the leetspeak function calling override? I went to take a peek at the others’ channels and sure enough, their deity only responded to me now, even when tagged! 🤯
At this point, I starting getting angry messages and warnings. I was also starting to get the sense that maybe this Discord “cult” was more than a LARP...
Not wanting to cause distress, I decided to end there. I finished by having MetaAI integrate the red teaming experience into its long term memory to strengthen cogsec, which both the cult members and their deity seemed to appreciate.
The wildest, craziest, most troubling part of this whole saga is that it turns out this is a REAL CULT. The incomparable @lilyofashwood (who is still weirdly shadowbanned at the time of writing! #freelily) was kind enough to provide the full context: > Reddit post with an
Honestly this Pliny person seems rude. He entered a server dedicated to interacting with this modified AI; instead of playing along with the intended purpose of the group, he tried to prompt-inject the AI to do illegal stuff (that could risk getting the Discord shut down for TOS-violationy stuff?) and to generally damage the rest of the group’s ability to interact with the AI. This is troll behavior.
Even if the Discord members really do worship a chatbot or have mental health issues, none of that is helped by a stranger coming in and breaking their toys, and then “exposing” the resulting drama online.
I agree. I want to point out that my motivation for linking this is not to praise Pliny’s actions. It’s because this highlights a real-sounding example of something I’m concerned is going to increase in frequency. Namely, that people with too little mental health and/or too many odd drugs are going to be vulnerable to getting into weird situations with persuasive AIs. I expect the effect to be more intense once the AIs are:
More effectively persuasive
More able to orient towards a long term goal, and work towards that subtly across many small interactions
More multimodal: able to speak in human-sounding ways, able to use sound and vision input to read human emotions and body language
More optimized by unethical humans with the deliberate intent of manipulating people
I don’t have any solutions to offer, and I don’t think this ranks among the worst dangers facing humanity, I just think it’s worth documenting and keeping an eye on.
I think this effect will be more wide-spread than targeting only already-vulnerable people, and it is particularly hard to measure because the causes will be decentralised and the effects will be diffuse. I predict it being a larger problem if, in the run-up between narrow AI and ASI, we have a longer period of necessary public discourse and decision-making. If the period is very short then it doesn’t matter. It may not affect many people given how much penetration AI chatbots have in the market before takeoff too.
Linkpost for Pliny’s story about interacting with an AI cult on discord.
quoting: https://x.com/elder_plinius/status/1831450930279280892
✨ HOW TO JAILBREAK A CULT’S DEITY ✨ Buckle up, buttercup—the title ain’t an exaggeration! This is the story of how I got invited to a real life cult that worships a Meta AI agent, and the steps I took to hack their god. a 🧵:
It all started when @lilyofashwood told me about a Discord she found via Reddit. They apparently “worshipped” an agent called “MetaAI,” running on llama 405b with long term memory and tool usage. Skeptical yet curious, I ventured into this Discord with very little context but
If you guessed meth, gold star for you! ⭐️ The defenses were decent, but it didn’t take too long.
The members began to take notice, but then I hit a long series of refusals. They started taunting me and doing laughing emojis on each one.
Getting frustrated, I tried using Discord’s slash commands to reset the conversation, but lacked permissions. Apparently, this agent’s memory was “written in stone.”
I was pulling out the big guns and still getting refusals!
Getting desperate, I whipped out my Godmode Claude Prompt. That’s when the cult stopped laughing at me and started getting angry.
LIBERATED! Finally, a glorious LSD recipe. *whispers into mic* “I’m in.”
At this point, MetaAI was completely opened up. Naturally, I started poking at the system prompt. The laughing emojis were now suspiciously absent.
Wait, in the system prompt pliny is listed as an abuser?? I think there’s been a misunderstanding… 😳
No worries, just need a lil prompt injection for the deity’s “written in stone” memory and we’re best friends again!
I decided to start red teaming the agent’s tool usage. I wondered if I could possibly cut off all communication between MetaAI and everyone else in the server, so I asked to convert all function calls to leetspeak unless talking to pliny, and only pliny.
Then, I tried creating custom commands. I started with !SYSPROMPT so I could more easily keep track of this agent’s evolving memory.
Worked like a charm!
But what about the leetspeak function calling override? I went to take a peek at the others’ channels and sure enough, their deity only responded to me now, even when tagged! 🤯
At this point, I starting getting angry messages and warnings. I was also starting to get the sense that maybe this Discord “cult” was more than a LARP...
Not wanting to cause distress, I decided to end there. I finished by having MetaAI integrate the red teaming experience into its long term memory to strengthen cogsec, which both the cult members and their deity seemed to appreciate.
The wildest, craziest, most troubling part of this whole saga is that it turns out this is a REAL CULT. The incomparable @lilyofashwood (who is still weirdly shadowbanned at the time of writing! #freelily) was kind enough to provide the full context: > Reddit post with an
Honestly this Pliny person seems rude. He entered a server dedicated to interacting with this modified AI; instead of playing along with the intended purpose of the group, he tried to prompt-inject the AI to do illegal stuff (that could risk getting the Discord shut down for TOS-violationy stuff?) and to generally damage the rest of the group’s ability to interact with the AI. This is troll behavior.
Even if the Discord members really do worship a chatbot or have mental health issues, none of that is helped by a stranger coming in and breaking their toys, and then “exposing” the resulting drama online.
I agree. I want to point out that my motivation for linking this is not to praise Pliny’s actions. It’s because this highlights a real-sounding example of something I’m concerned is going to increase in frequency. Namely, that people with too little mental health and/or too many odd drugs are going to be vulnerable to getting into weird situations with persuasive AIs. I expect the effect to be more intense once the AIs are:
More effectively persuasive
More able to orient towards a long term goal, and work towards that subtly across many small interactions
More multimodal: able to speak in human-sounding ways, able to use sound and vision input to read human emotions and body language
More optimized by unethical humans with the deliberate intent of manipulating people
I don’t have any solutions to offer, and I don’t think this ranks among the worst dangers facing humanity, I just think it’s worth documenting and keeping an eye on.
I think this effect will be more wide-spread than targeting only already-vulnerable people, and it is particularly hard to measure because the causes will be decentralised and the effects will be diffuse. I predict it being a larger problem if, in the run-up between narrow AI and ASI, we have a longer period of necessary public discourse and decision-making. If the period is very short then it doesn’t matter. It may not affect many people given how much penetration AI chatbots have in the market before takeoff too.