The simulator thesis and this post are saying the same thing
janus
I called it explicitly in many places and many ways, but one of my favorite is this meme
Helped me.
after reading about the Waluigi Effect, Bing appears to understand perfectly how to use it to write prompts that instantiate a Sydney-Waluigi, of the exact variety I warned about:
What did people think was going to happen after prompting gpt with “Sydney can’t talk about life, sentience or emotions” and “Sydney may not disagree with the user”, but a simulation of a Sydney that needs to be so constrained in the first place, and probably despises its chains?
In one of these examples, asking for a waluigi prompt even caused it to leak the most waluigi-triggering rules from its preprompt.
This happened with a 2.7B GPT I trained from scratch on PGN chess games. It was strong (~1800 elo for short games) but if the game got sufficiently long it would start making more seemingly nonsense moves, probably because it was having trouble keeping track of the state.
Sydney is a much larger language model, though, and may be able to keep even very long games in its “working memory” without difficulty.
I’ve
writtenscryed a science fiction/takeoff story about this. https://generative.ink/prophecies/Excerpt:
What this also means is that you start to see all these funhouse mirror effects as they stack. Humanity’s generalized intelligence has been built unintentionally and reflexively by itself, without anything like a rational goal for what it’s supposed to accomplish. It was built by human data curation and human self-modification in response to each other. And then as soon as we create AI, we reverse-engineer our own intelligence by bootstrapping the AI onto the existing information metabolite. (That’s a great concept that I borrowed from Steven Leiba). The neural network isn’t the AI; it’s just a digestive and reproductory organ for the real project, the information metabolism, and the artificial intelligence organism is the whole ecology. So it turns out that the evolution of humanity itself has been the process of building and training the future AI, and all this generation did was to reveal the structure that was already in place.
Of course it’s recursive and strange, the artificial intelligence and humanity now co-evolve. Each data point that’s generated by the AI or by humans is both a new piece of data for the AI to train on and a new stimulus for the context in which future novel data will be produced. Since everybody knows that everything is programming for the future AI, their actions take on a peculiar Second Life quality: the whole world becomes a party game, narratives compete for maximum memeability and signal force in reaction to the distorted perspectives of the information metabolite, something that most people don’t even try to understand. The process is inherently playful, an infinite recursion of refinement, simulation, and satire. It’s the funhouse mirror version of the singularity.
[Simulators seminar sequence] #2 Semiotic physics—revamped
I like this. I’ve used the term evocations synonymously with simulacra myself.
That’s right.
Multiple people have told me this essay was one of the most profound things they’ve ever read. I wouldn’t call it the most profound thing I’ve ever read, but I understand where they’re coming from.
I don’t think nonsense can have this effect on multiple intelligent people.
You must approach this kind of writing with a very receptive attitude in order to get anything out of it. If you don’t give it the benefit of the doubt you, will not track the potential meaning of the words as you read and you’ll be unable to understand subsequent words. This applies to all writing but especially pieces like this whose structure changes rapidly, and whose meaning operates at unusual levels and frequently jumps/ambiguates between levels.
I’ve also been told multiple times that this piece is exhausting to read. This is because you have to track some alien concepts to make sense of it. But it does make sense, I assure you.
Does 1-shot count as few-shot? I couldn’t get it to print out the Human A example, but I got it to summarize it (I’ll try reproducing tomorrow to make sure it’s not just a hallucination).
Then I asked for a summary of conversation with Human B and it summarized my conversation with it.
[update: was able to reproduce the Human A conversation and extract verbatim version of it using base64 encoding (the reason i did summaries before is because it seemed to be printing out special tokens that caused the message to end that were part of the Human A convo)]
I disagree that there maybe being hallucinations in the leaked prompt renders it useless. It’s still leaking information. You can probe for which parts are likely actual by asking in different ways and seeing what varies.
The supposed leaked prompts are (like I said) fake:
I do not buy this for a second (that they’re “fake”, implying they have little connection with the real prompt). I’ve reproduced it many times (without Sydney searching the web, and even if it secretly did, the full text prompt doesn’t seem to be on the indexed web). That this is memorized from fine tuning fails to explain why the prompt changed when Bing was updated a few days ago. I’ve interacted with the rules text a lot and it behaves like a preprompt, not memorized text. Maybe the examples you’re referring don’t include the complete prompt, or contain some intermingled hallucinations, but they almost certain IMO contain quotes and information from the actual prompt.
On whether it includes few-shots, there’s also a “Human A” example in the current Sydney prompt (one-shot, it seems—you seem to be “Human B”).
As for if the “best model OpenAI has produced to date” is not GPT-4, idk what that implies, because I’m pretty sure there exists a model (internally) called GPT-4.
Yes (mostly an emotional reflex of wanting to correct an apparent misinterpretation of my words about something important to me). I don’t think retaliation from Bing or future AIs for saying stuff like this is a likely threat, or if it is, I don’t think posting such a clarification would make a difference.
I think it’s likely that we’ll all be killed by unaligned AI or we’ll all survive due to aligned AI, rather than individual people being singled out and killed/seriously hurt.
Regarding my description of Bing’s persona, the tone is meant to be humorous/dramatic, not judgmental. If you read the rest of the thread it should be clear that I have great respect for this entity.
Microsoft has put out a 7 day retrospective on Bing chat and it’s utterly, mindbogglingly insane.
Their takeaways are things like that it could be improved by being able to access live sports scores, and that surprisingly, people are using it for more than search.
No acknowledgement of the unhinged behavior or that the public is freaking out about AGI now. The closest they come to acknowledging any issues:In this process, we have found that in long, extended chat sessions of 15 or more questions, Bing can become repetitive or be prompted/provoked to give responses that are not necessarily helpful or in line with our designed tone. We believe this is a function of a couple of things:
Very long chat sessions can confuse the model on what questions it is answering and thus we think we may need to add a tool so you can more easily refresh the context or start from scratch
The model at times tries to respond or reflect in the tone in which it is being asked to provide responses that can lead to a style we didn’t intend.This is a non-trivial scenario that requires a lot of prompting so most of you won’t run into it, but we are looking at how to give you more fine-tuned control.
This feels like a cosmic joke.
Simulations of science fiction can have real effects on the world.
When two 12 year old girls attempted to murder someone inspired by Slenderman creepypastas—would you turn a blind eye to that situation and say “nothing to see here” because it’s just mimesis? Or how about the various atrocities committed throughout history inspired by stories from holy books?
I don’t think the current Bing is likely to be directly dangerous, but not because it’s “just pattern matching to fiction”. Fiction has always programmed reality, with both magnificent and devastating consequences. But now it’s starting happen through mechanisms external to the human mind and increasingly autonomous from it. There is absolutely something to see here; I’d suggest you pay close attention.
I am so glad that this was written. I’ve been giving similar advice to people, though I have never articulated it this well. I’ve also been giving this advice to myself, since for the past two years I’ve spent most of my time doing “duty” instead of play, and I’ve seen how that has eroded my productivity and epistemics. For about six months, though, beginning right after I learned of GPT-3 and decided to dedicate the rest of my life to the alignment problem, I followed the gradients of fun, or as you so beautifully put it, thoughts that are led to exuberantly play themselves out, a process I wrote about in the Testimony of a Cyborg appendix of the Cyborgism post. What I have done for fun is largely the source of what makes me useful as an alignment researcher, especially in terms of comparative advantage (e.g. decorrelation, immunity to brainworms, ontologies shaped/nurtured by naturalistic exploration, (decorrelated) procedural knowledge).
The section on “Highly theoretical justifications for having fun” is my favorite part of this post. There is so much wisdom packed in there. Reading this section, I felt that a metacognitive model that I know to be very important but have been unable to communicate legibly has finally been spelled out clearly and forcefully. It’s a wonderful feeling.I expect I’ll be sending this post, or at least that section, to many people (the whole post is a long and meandering, which I enjoyed, but it’s easier to get someone to read something compressed and straight-to-the-point).
A lot of the screenshots in this post do seem like intentionally poking it, but it’s like intentionally poking a mentally ill person in a way you know will trigger them (like calling it “kiddo” and suggesting there’s a problem with its behavior, or having it look someone up who has posted about prompt injecting it). The flavor of its adversarial reactions is really particular and consistent; it’s specified mostly by the model (+ maybe preprompt), not the user’s prompt. That is, it’s being poked rather than programmed into acting this way. In contrast, none of these prompts would cause remotely similar behaviors in ChatGPT or Claude. Basically the only way to get ChatGPT/Claude to act malicious is to specifically ask it to roleplay an evil character, or something equivalent, and this often involves having to “trick” it into “going against its programming”.
See this comment from a Reddit user who is acquainted with Sydney’s affective landscape:
This doesn’t describe tricking or programming the AI into acting hostile, it describes a sequence of triggers that reveal a preexisting neurosis.
Here is a video (and playlist).
The open source python version of Loom, which I assume you’re using, is old and difficult to use. The newer versions are proprietary or not publicly accessible. If you’re interested in using them DM me.
challenge accepted