Good job, I like the post! I also like this metaphor of the stage and the animatronics. One thing I would like to point out with this metaphor is that the animatronics are very unstable and constantly shifting forms. When you start looking at one, it begins changing, and you can’t ever grasp them clearly. I feel this aspect is somewhat missing in the metaphor (you do point this out later in the post and explain it quite well, but I think it’s somewhat incompatible with the metaphor). It’s a bit easier with chat models, because they are incentivized to simulate animatronics that are somewhat stable. The art of jailbreaking (and especially of persona modulation) is understanding how to use the dynamics of the stage to influence the form of the animatronics.
Some other small comments and thoughts I had while reading through the post (It’s a bit lengthy, so I haven’t read everything in great detail, sorry if I missed some points):
I think this isn’t that clear. I think the “stochastic parrot” question is somewhat linked to the internal representation of concepts and their interactions within this abstract concept of “reality” (the definition in the link says: “for haphazardly stitching together sequences of linguistic forms … according to probabilistic information about how they combine, but without any reference to meaning.”). I do think that simply predicting the next token could lead, at some point and if it’s smart enough, to building an internal representation of concepts and how they relate to each other (actually, this might already be happening with gpt4-base, as we can kind of see in gpt4-chat, and I don’t think this is something that appears during instruct fine-tuning).
The only silver lining here is that their unalignment is a pretty close copy of all the same problems as human unalignment (sometimes writ large for the fictional characters) — problems that we’re very familiar with, have an intuitive understanding of, and (outside fictional characters) even have fairly-workable solutions for (thing like love, salaries, guilt, and law enforcement).
I agree that they are learned from human misalignment, but I am not sure this necessarily means they are the same (or similar). For example, there might be some weird, infinite-dimensional function in the way we are misaligned, and the AI picked up on it (or at least an approximate version) and is able to portray all the flavors of “misalignment” that were never seen in humans yet, or even go out of distribution on the misalignment in weird character dimensions and simulate something completely alien to us. I believe that we are going to see some pretty unexpected stuff happening when we start digging more here. One thing to point out, though, is that all the ways those AIs could be misaligned are probably related (some in probably very convoluted ways) to the way we are misaligned in the training data.
The stage even understands that each of the animatronics also has theory of mind, and each is attempting to model the beliefs and intentions of all of the others, not always correctly.
I am a bit skeptical of this. I am not sure I believe that there really are two detached “minds” for each animatronic that tries to understand each other (But if this is true, this would be an argument for my first point above).
we are missing the puppeteer
(This is more of a thought than a comment). I like to think of the puppeteer as a meta-simulacrum. The Simulator is no longer simulating X, but is simulating Y simulating X. One of the dangers of instruct fine-tuning I see is that it might not be impossible for the model to collapse to only simulate one Y no matter what X it simulates, and the only thing we kind of control with current training methods is what X we want it to simulate. We would basically leave whatever training dynamics decide Y to be to chance and just have to cross our fingers that this Y isn’t a misaligned AI (which might actually be something incentivized by current training). I am going to try to write a short post about that.
P.S. I think it would be worth it to have some kind of TL;DR at the top with the main points clearly fleshed out.
… the animatronics are very unstable and constantly shifting forms. When you start looking at one, it begins changing, and you can’t ever grasp them clearly.
On theoretical grounds, I would, as I described in the post, expect an animatronic to come more and more into focus as more context is built up of things it has done and said (and I was rather happy with the illustration I got that had one more detailed than the other).
Of course, if you are using an LLM that has a short context length and continuing a conversation for longer than that, so that it only recalls the most recent part of the conversation as context, or if your LLM nominally has a long context but isn’t actually very good at remembering things some way back in a long context, then one would get exactly the behavior you describe. I have added a section to the post describing this behavior and when it is to me expected.
Fair comment — I’d already qualified that with “In some sense…”, but you convinced me, and I’ve deleted the phrase.
I agree that they are learned from human misalignment, but I am not sure this necessarily means they are the same (or similar). For example, …
Also a good point. I think being able to start from a human-like framework is usually helpful (and have a post I’m working on on this), but one definitely needs to remember that the animatronics are low-fidelity simulations of humans, which some fairly un-human like failure modes and some capabilities that humans don’t have individually, only collectively (like being hypermultilingual). Mostly I wanted to mke the point that their behavior isn’t as wide open/unknown/alien as people tend to assume on LW of agents they’re trying figure out how to align.
The stage even understands that each of the animatronics also has theory of mind, and each is attempting to model the beliefs and intentions of all of the others, not always correctly.
I am a bit skeptical of this. I am not sure I believe that there really are two detached “minds” for each animatronic that tries to understand each other (But if this is true, this would be an argument for my first point above).
As I recall, GPT-4 scored at a level on theory of mind tests roughly equivalent to a typical human 8-year old. So it has the basic ideas, and should generally get details right, but may well not be very practiced at this — certainly less so than a typical human adult, let along someone like an author, detective, or psychologist who works with theory of mind a lot, So yes, as I noted, this is currently to a first approximation, but I’d expect it to improve in future more powerful LLMs. Theory of mind might also be an interesting thing to try to specifically enrich the pretraining set with.
I like to think of the puppeteer as a meta-simulacrum. The Simulator is no longer simulating X, but is simulating Y simulating X.
Interesting, and yes, that’s true any time you have separate animatronics of the author and fictional characters, such as the puppeteer and a typical assistant. I look forward to reading your post on this.
Good job, I like the post! I also like this metaphor of the stage and the animatronics. One thing I would like to point out with this metaphor is that the animatronics are very unstable and constantly shifting forms. When you start looking at one, it begins changing, and you can’t ever grasp them clearly. I feel this aspect is somewhat missing in the metaphor (you do point this out later in the post and explain it quite well, but I think it’s somewhat incompatible with the metaphor). It’s a bit easier with chat models, because they are incentivized to simulate animatronics that are somewhat stable. The art of jailbreaking (and especially of persona modulation) is understanding how to use the dynamics of the stage to influence the form of the animatronics.
Some other small comments and thoughts I had while reading through the post (It’s a bit lengthy, so I haven’t read everything in great detail, sorry if I missed some points):
I think this isn’t that clear. I think the “stochastic parrot” question is somewhat linked to the internal representation of concepts and their interactions within this abstract concept of “reality” (the definition in the link says: “for haphazardly stitching together sequences of linguistic forms … according to probabilistic information about how they combine, but without any reference to meaning.”). I do think that simply predicting the next token could lead, at some point and if it’s smart enough, to building an internal representation of concepts and how they relate to each other (actually, this might already be happening with gpt4-base, as we can kind of see in gpt4-chat, and I don’t think this is something that appears during instruct fine-tuning).
I agree that they are learned from human misalignment, but I am not sure this necessarily means they are the same (or similar). For example, there might be some weird, infinite-dimensional function in the way we are misaligned, and the AI picked up on it (or at least an approximate version) and is able to portray all the flavors of “misalignment” that were never seen in humans yet, or even go out of distribution on the misalignment in weird character dimensions and simulate something completely alien to us. I believe that we are going to see some pretty unexpected stuff happening when we start digging more here. One thing to point out, though, is that all the ways those AIs could be misaligned are probably related (some in probably very convoluted ways) to the way we are misaligned in the training data.
I am a bit skeptical of this. I am not sure I believe that there really are two detached “minds” for each animatronic that tries to understand each other (But if this is true, this would be an argument for my first point above).
(This is more of a thought than a comment). I like to think of the puppeteer as a meta-simulacrum. The Simulator is no longer simulating X, but is simulating Y simulating X. One of the dangers of instruct fine-tuning I see is that it might not be impossible for the model to collapse to only simulate one Y no matter what X it simulates, and the only thing we kind of control with current training methods is what X we want it to simulate. We would basically leave whatever training dynamics decide Y to be to chance and just have to cross our fingers that this Y isn’t a misaligned AI (which might actually be something incentivized by current training). I am going to try to write a short post about that.
P.S. I think it would be worth it to have some kind of TL;DR at the top with the main points clearly fleshed out.
On theoretical grounds, I would, as I described in the post, expect an animatronic to come more and more into focus as more context is built up of things it has done and said (and I was rather happy with the illustration I got that had one more detailed than the other).
Of course, if you are using an LLM that has a short context length and continuing a conversation for longer than that, so that it only recalls the most recent part of the conversation as context, or if your LLM nominally has a long context but isn’t actually very good at remembering things some way back in a long context, then one would get exactly the behavior you describe. I have added a section to the post describing this behavior and when it is to me expected.
Fair comment — I’d already qualified that with “In some sense…”, but you convinced me, and I’ve deleted the phrase.
Also a good point. I think being able to start from a human-like framework is usually helpful (and have a post I’m working on on this), but one definitely needs to remember that the animatronics are low-fidelity simulations of humans, which some fairly un-human like failure modes and some capabilities that humans don’t have individually, only collectively (like being hypermultilingual). Mostly I wanted to mke the point that their behavior isn’t as wide open/unknown/alien as people tend to assume on LW of agents they’re trying figure out how to align.
As I recall, GPT-4 scored at a level on theory of mind tests roughly equivalent to a typical human 8-year old. So it has the basic ideas, and should generally get details right, but may well not be very practiced at this — certainly less so than a typical human adult, let along someone like an author, detective, or psychologist who works with theory of mind a lot, So yes, as I noted, this is currently to a first approximation, but I’d expect it to improve in future more powerful LLMs. Theory of mind might also be an interesting thing to try to specifically enrich the pretraining set with.
Interesting, and yes, that’s true any time you have separate animatronics of the author and fictional characters, such as the puppeteer and a typical assistant. I look forward to reading your post on this.