… the animatronics are very unstable and constantly shifting forms. When you start looking at one, it begins changing, and you can’t ever grasp them clearly.
On theoretical grounds, I would, as I described in the post, expect an animatronic to come more and more into focus as more context is built up of things it has done and said (and I was rather happy with the illustration I got that had one more detailed than the other).
Of course, if you are using an LLM that has a short context length and continuing a conversation for longer than that, so that it only recalls the most recent part of the conversation as context, or if your LLM nominally has a long context but isn’t actually very good at remembering things some way back in a long context, then one would get exactly the behavior you describe. I have added a section to the post describing this behavior and when it is to me expected.
Fair comment — I’d already qualified that with “In some sense…”, but you convinced me, and I’ve deleted the phrase.
I agree that they are learned from human misalignment, but I am not sure this necessarily means they are the same (or similar). For example, …
Also a good point. I think being able to start from a human-like framework is usually helpful (and have a post I’m working on on this), but one definitely needs to remember that the animatronics are low-fidelity simulations of humans, which some fairly un-human like failure modes and some capabilities that humans don’t have individually, only collectively (like being hypermultilingual). Mostly I wanted to mke the point that their behavior isn’t as wide open/unknown/alien as people tend to assume on LW of agents they’re trying figure out how to align.
The stage even understands that each of the animatronics also has theory of mind, and each is attempting to model the beliefs and intentions of all of the others, not always correctly.
I am a bit skeptical of this. I am not sure I believe that there really are two detached “minds” for each animatronic that tries to understand each other (But if this is true, this would be an argument for my first point above).
As I recall, GPT-4 scored at a level on theory of mind tests roughly equivalent to a typical human 8-year old. So it has the basic ideas, and should generally get details right, but may well not be very practiced at this — certainly less so than a typical human adult, let along someone like an author, detective, or psychologist who works with theory of mind a lot, So yes, as I noted, this is currently to a first approximation, but I’d expect it to improve in future more powerful LLMs. Theory of mind might also be an interesting thing to try to specifically enrich the pretraining set with.
I like to think of the puppeteer as a meta-simulacrum. The Simulator is no longer simulating X, but is simulating Y simulating X.
Interesting, and yes, that’s true any time you have separate animatronics of the author and fictional characters, such as the puppeteer and a typical assistant. I look forward to reading your post on this.
On theoretical grounds, I would, as I described in the post, expect an animatronic to come more and more into focus as more context is built up of things it has done and said (and I was rather happy with the illustration I got that had one more detailed than the other).
Of course, if you are using an LLM that has a short context length and continuing a conversation for longer than that, so that it only recalls the most recent part of the conversation as context, or if your LLM nominally has a long context but isn’t actually very good at remembering things some way back in a long context, then one would get exactly the behavior you describe. I have added a section to the post describing this behavior and when it is to me expected.
Fair comment — I’d already qualified that with “In some sense…”, but you convinced me, and I’ve deleted the phrase.
Also a good point. I think being able to start from a human-like framework is usually helpful (and have a post I’m working on on this), but one definitely needs to remember that the animatronics are low-fidelity simulations of humans, which some fairly un-human like failure modes and some capabilities that humans don’t have individually, only collectively (like being hypermultilingual). Mostly I wanted to mke the point that their behavior isn’t as wide open/unknown/alien as people tend to assume on LW of agents they’re trying figure out how to align.
As I recall, GPT-4 scored at a level on theory of mind tests roughly equivalent to a typical human 8-year old. So it has the basic ideas, and should generally get details right, but may well not be very practiced at this — certainly less so than a typical human adult, let along someone like an author, detective, or psychologist who works with theory of mind a lot, So yes, as I noted, this is currently to a first approximation, but I’d expect it to improve in future more powerful LLMs. Theory of mind might also be an interesting thing to try to specifically enrich the pretraining set with.
Interesting, and yes, that’s true any time you have separate animatronics of the author and fictional characters, such as the puppeteer and a typical assistant. I look forward to reading your post on this.