I much appreciated your point “When conducting experiments with LLMs, it’s vital to distinguish between two aspects: the properties of the LLM as a predictor/simulator, and the characteristics of a pattern that is inferred from the context.” I agree that’s very important distinction, and as you said, one far too many people trip up on.
I’d actually go one step further: I don’t think I’d describe the aspect that has “the properties of the LLM as a predictor/simulator” using the word ‘mind’ at all — not even ‘alien mind’. The word ‘mind’ carries a bunch of in-this-case-misleading connotations, ones along the lines of the way the word ‘agent’ is widely used on LM: that the system has goals and attempts to affect its environment to achieve them. Anything biological with a nervous system that one might use the word ‘mind’ for is also going to have goals and attempt to affect its environment to achieve them (for obvious evolutionary reasons). However, the nearest thing to a ‘goal’ that aspect of an LLM has is attempting to predict the next token as accurately as it can, and it isn’t going to try to modify its environment to make that happen, or even to make it easier. So it isn’t a ‘mind’ at all — that isn’t an appropriate word. It’s just a simulator model. However, the second aspect, the simulations it runs, are of minds, plural: an entire exquisitely-context-dependent probability-distribution of minds, a different one every time you run even the same prompt (except perhaps at temperature 0). Specifically, simulations of the token-generation processes of (almost always) human minds: either an individual author, or he final product of a group of authors and editors working on a single text, or an author who is themselves simulating a fictional character. And the token generation behaviors of these will match those of humans, as closely as the model can make them, i.e. to the extent that the training of the LLM has done its job and that its architecture and capacity allows. When it doesn’t match, the errors, on the other hand, tend to look very different from human errors, since the underlying neural architecture is very different: this is a simulation but not an emulation. (However, to the extent that the simulation succeeds, it will also simulate human error and frailties, so then you get both sorts.)
So I would agree that we have two different aspects here, but the first is too different for me to be willing to use the word ‘mind’ for it, and is contextually generating the second, which is as human as the machine was able to make it. So I’d say the use of the word ‘anthropomorphism’ for the second aspect was inappropriate: that’s like calling a photograph anthropomorphic: of course it is, that’s the entire point of photography, to make human-like shapes. Now, the simulations are not perfect: the photographs are blurred. The inaccuracies are complex and interesting to study. But here I think the word ‘alien’ is too strong. A better metaphor might be something like an animatronic doll, that is an intentional but still imperfect replica of a human.
So, if you did a lot of psychological studies on these simulated minds, and eventually showed that they had a lot of psychological phenomena in common with humans, my reaction would be “big deal, what a waste, that was obviously going to happen if the accuracy of the replica was good enough!” So I think it’s more interesting to study when and where the replica fails, and how. But that’s a subject that, as LLMs get better, is going to both change and decrease in frequency, or the threshold for triggering it will go up.
Thank you for your insightful comment. I appreciate the depth of your analysis and would like to address some of the points you raised, adding my thoughts around them.
I don’t think I’d describe the aspect that has “the properties of the LLM as a predictor/simulator” using the word ‘mind’ at all — not even ‘alien mind’. The word ‘mind’ carries a bunch of in-this-case-misleading connotations, ones along the lines of the way the word ‘agent’ is widely used on LM: that the system has goals
This is a compelling viewpoint. However, I believe that even if we consider LLMs primarily as predictors or simulators, this doesn’t necessarily preclude them from having goals, albeit ones that might seem alien compared to human intentions. A base model’s goal is focused on next token prediction, which is straightforward, but chat models goals aren’t as clear-cut. They are influenced by a variety of obfuscated rewards, and this is one of the main things I want to study with LLM Psychology.
and it isn’t going to try to modify its environment to make that happen, or even to make it easier
With advancements in online or semi-online training (being trained back on their own outputs), we might actually see LLMs interacting with and influencing their environment in pursuit of their goals, even more so if they manage to distinguish between training and inference. I mostly agree with you here for current LLMs (I have some reasonable doubts with SOTA though), but I don’t think it will hold true for much longer.
It’s just a simulator model
While I understand this viewpoint, I believe it might be a bit reductive. The emergence of complex behaviors from simple rules is a well-established phenomenon, as seen in evolution. LLMs, while initially designed as simulators, might (and I would argue does) exhibit behaviors and cognitive processes that go beyond their original scope (e.g. extracting training data by putting the simulation out of distribution).
the second aspect, the simulations it runs, are of minds, plural
This is an interesting observation. However, the act of simulation by LLMs doesn’t necessarily negate the possibility of them possessing a form of ‘mind’. To illustrate, consider our own behavior in different social contexts—we often simulate different personas (with your family vs talking to an audience), yet we still consider ourselves as having a singular mind. This is the point of considering LLM as alien mind. We need to understand why they simulate characters, with which properties, and for which reasons.
And the token generation behaviors of these will match those of humans, as closely as the model can make them
Which humans, and in what context? Specifically, we have no clue what is simulated in which context, and for which reasons. And this doesn’t seem to improve with growing size, it’s even more obfuscated. The rewards and dynamics of the training are totally alien. It is really hard to control what should happen in any situation. If you try to just mimic humans as closely as possible, then it might be a very bad idea (super powerful humans aren’t that aligned with humanity). If you are trying to aim at something different than human, then we have no clue how to have fine-grain control over this. For me, the main goal of LLM psychology is to understand the cognition of LLMs—when and why it does what in which context—as fast as possible, and then study how training dynamics influence this. Ultimately this could help us have a clearer idea of how to train these systems, what they are really doing, and what they are capable of.
When it doesn’t match, the errors, on the other hand, tend to look very different from human errors
This observation underscores the importance of studying the simulator ‘mind’ and not just the simulated minds. The unique nature of these errors could provide valuable insights into the cognitive mechanisms of LLMs, distinguishing them from mere human simulators.
A better metaphor might be something like an animatronic doll, that is an intentional but still imperfect replica of a human
I see your point. However, both base and chat models in my view, are more akin to what I’d describe as an ‘animatronic metamorph’ that morphs with its contextual surroundings. This perspective aligns with our argument that people often ascribe overly anthropomorphic qualities to these models, underestimating their dynamic nature. They are not static entities; their behavior and ‘shape’ can be significantly influenced by the context they are placed in (I’ll demonstrate this later in the sequence). Understanding this morphing ability and the influences behind it is a key aspect of LLM psychology.
studies on these simulated minds, and eventually showed that they had a lot of psychological phenomena in common with humans, my reaction would be ‘big deal, what a waste, that was obviously going to happen if the accuracy of the replica was good enough!’
Your skepticism here is quite understandable. The crux of LLM psychology isn’t just to establish that LLMs can replicate human-like behaviors—which, as you rightly point out, could be expected given sufficient fidelity in the replication. Rather, our focus is on exploring the ‘alien mind’ - the underlying cognitive processes and training dynamics that govern these replications. By delving into these areas, we aim to uncover not just what LLMs can mimic, but how and why they do so in varying contexts.
So I think it’s more interesting to study when and where the replica fails, and how. But that’s a subject that, as LLMs get better, is going to both change and decrease in frequency, or the threshold for triggering it will go up.
This is a crucial point. Studying the ‘failures’ or divergences of LLMs from human-like responses indeed offers a rich source of insight, but I am not sure we will see less of them soon. I think that “getting better” is not correlated to “getting bigger”, and that actually current model aren’t getting better at all (in the sense of having more understandable behaviors with respect to their training, being harder to jailbreak, or even being harder to make it do something going against what we thought was a well designed reward). I would even argue that there are more and more interesting things to discover with bigger systems.
The correlation I see is between “getting better” and “how much do we understand what we are shaping, and how it is shaped” – which is the main goal of LLM psychology.
Thanks again for the comment. It was really thought-provoking, and I am curious to see what you think about these answers.
P.S. This answer only entails myself. Also, sorry for the repetitions in some points, I had a hard time removing all of them.
This is a compelling viewpoint. However, I believe that even if we consider LLMs primarily as predictors or simulators, this doesn’t necessarily preclude them from having goals, albeit ones that might seem alien compared to human intentions. A base model’s goal is focused on next token prediction, which is straightforward, but chat models goals aren’t as clear-cut. They are influenced by a variety of obfuscated rewards, and this is one of the main things I want to study with LLM Psychology.
Agreed, but I would model that as a change in the distribution of minds (the second aspect) simulated by the simulator (the first aspect). While in production LLMs that change is generally fairly coherent/internally consustent, it doesn’t have to be: you could change the distribution to be more bimodal, say to contain more “very good guys” and also more “very bad guys” (indeed, due to the Waluigi effect, that does actually happen to an extent). So I’d model that as a change in the second aspect, adjusting the distribution of the “population” of minds.
> and it isn’t going to try to modify its environment to make that happen, or even to make it easier
… I don’t think it will hold true for much longer.
Completely agreed. But when that happens, it will be some member(s) of the second aspect that are doing it. The first aspect doesn’t have goals or any model that there is an external enviroment, anything other than tokens (except in so far as it can simulate the second aspect, which does)
> the second aspect, the simulations it runs, are of minds, plural
This is an interesting observation. However, the act of simulation by LLMs doesn’t necessarily negate the possibility of them possessing a form of ‘mind’. To illustrate, consider our own behavior in different social contexts—we often simulate different personas (with your family vs talking to an audience), yet we still consider ourselves as having a singular mind. This is the point of considering LLM as alien mind. We need to understand why they simulate characters, with which properties, and for which reasons.
The simulator can simulate multiple minds at once. It can write fiction, in which two-or-more characters are having a conversation, or trying to kill each other, or otherwise interacting. Including each having a “theory of mind” for the other one. Now, I can do that too: I’m also a fiction author in my spare time, so I can mentally model multiple fictional characters at once, talking or fighting or plotting or whatever. I’ve practiced this, but most humans can do it pretty well. (It’s a useful capability for a smart language-using social species, and at a neurological level, perhaps mirror neuron are involved.)
However, in this case the simulator isn’t, in my view, even something that the word ‘mind’ is a helpful fit for, more like single-purpose automated system; but the personas it simulates are petty-realistically human-like. And the fact that there is a wide, highly-contextual distribution of them, from which samples are drawn, and that more than one sample can be drawn at once, is an important key to understanding what’s going on, and yes, that’s rather weird, even alien (except perhaps to authors). So you could do a psychological experiments that involved the LLM simulating two-or-more personas arguing, or otherwise at cross purposes. (And if it had been RLHF-trained, they might soon come to agree on something rather PC.)
To provide an alternative to the shoggoth metaphor, the first aspect of LLMs is like a stage, that automatically and contextually generates the second aspect, human-like animatronic actors, one-or-more of them, who then monologue or interact, among themselves, and/or with external real-human input like a chat dialog.
Instruct training plus RLHF makes it more likely that the stage will generate one actor, one who fits the “polite assistant who helpfully answers questions and obeys requests, but refuses harmful/inappropriate-looking requests in a rather PC way” descriptor and will just dialog with the user, without other personas spontaneously appearing (which can happen quite a bit with base models, which often simulate thinngs like chat rooms/forums.). But the system is still contextual enough that you can still prompt it to generate other actors who will show other behaviors. And that’s why aligning LLMs is hard, despite them being so contextual and thus easy to direct: the user and random text are also part of the context.
How much have you played around with base models, or models that were instruct trained to be helpful but not harmless? Only smaller open-source ones are easily available, but if you haven’t tried it, I recommend it.
I did spend some time with base models and helpful non harmless assistants (even though most of my current interactions are with chatgpt4), and I agree with your observations and comments here.
Although I feel like we should be cautious with what we think we observe, and what is actually happening. This stage and human-like animatronic metaphor is good, but we can’t really distinguish yet if there is only a scene with actors, or if there is actually a puppeteer behind.
Anyway, I agreed that ‘mind’ might be a bit confusing while we don’t know more, and for now I’d better stick to the word cognition instead.
My beliefs about the goallessness of the stage aspect are based mostly on theoretical/mathematical arguments from how SGD works. (For example, if you took the same transformer setup, but instead trained it only on a vast dataset of astronomical data, it would end up as an excellent simulator of those phenomena with absolutely no goal-directed behavior — I gather DeepMind recently built a weather model that way. An LLM simulates agentic human-like minds only because we trained it on a dataset of the behavior of agentic human minds.) I haven’t written these ideas up myself, but they’re pretty similar to some that porby discussed in FAQ: What the heck is goal agnosticism? and those Janus describes in Simulators, or I gather a number of other authors have written about under the tag Simulator Theory. And the situation does become a little less clear once instruct training and RLHF are applied: the stage is getting tinkered with so that it by default tends to generate one actor ready to play the helpful assistant part in a dialog with the user, which is a complication of the stage, but I still wouldn’t view as it now having a goal or puppeteer, just a tendency.
I think I am a bit biased by chat models so I tend to generalize my intuition around them, and forget to specify that. I think for base model, it indeed doesn’t make sense to talk about a puppeteer (or at least not with the current versions of base models). From what I gathered, I think the effects of fine tuning are a bit more complicated than just building a tendency, this is why I have doubts there. I’ll discuss them in the next post.
It’s only a metaphor: we’re just trying to determine which one would be most useful. What I see as a change to the contextual sample distribution of animatronic actors crated by the stage might also be describable as a puppeteer. Certainly if the puppeteer is the org doing the RLHF, that works. I’m cautious, in an alignment context, about making something sound agentic and potentially adversarial if it isn’t. The animatronic actors definitely can be adversarial while they’re being simulated, the base model’s stage definitely isn’t, which is why I wanted to use a very impersonal metaphor like “stage”. For the RLHF pupeteer I’d say the answer was that it’s not human-like, but it is an optimizer, and it can suffer from all the usual issues of RL like reward hacking. Such as a tendency towards sycophancy, for example. So basically, yes, the puppeteer can be an adversarial optimizer, so I think I just talked myself into agreeing with your puppeteer metaphor if RL was used. There are also approaches to instruct training that only use SGD for fine-tuning, not RL, and there I’d claim that the “adjust the stage’s probability distribution” metaphor is more accurate, as there are are no possibilities for reward hacking: sycophancy will occur if and only if you actually put examples of it in your fine-tuning set: you get what you asked for. Well, unless you fine tuned on a synthetic datseta written by, say, GPT-4 (as a lot of people do), which is itself sycophantic since it was RL trained, so wrote you a fine-tuning dataset full of sycophancy…
This metaphor is sufficiently interesting/useful/fun that I’m now becoming tempted to write a post on it: would you like to make it a joint one? (If not, I’ll be sure to credit you for the puppeteer.)
I much appreciated your point “When conducting experiments with LLMs, it’s vital to distinguish between two aspects: the properties of the LLM as a predictor/simulator, and the characteristics of a pattern that is inferred from the context.” I agree that’s very important distinction, and as you said, one far too many people trip up on.
I’d actually go one step further: I don’t think I’d describe the aspect that has “the properties of the LLM as a predictor/simulator” using the word ‘mind’ at all — not even ‘alien mind’. The word ‘mind’ carries a bunch of in-this-case-misleading connotations, ones along the lines of the way the word ‘agent’ is widely used on LM: that the system has goals and attempts to affect its environment to achieve them. Anything biological with a nervous system that one might use the word ‘mind’ for is also going to have goals and attempt to affect its environment to achieve them (for obvious evolutionary reasons). However, the nearest thing to a ‘goal’ that aspect of an LLM has is attempting to predict the next token as accurately as it can, and it isn’t going to try to modify its environment to make that happen, or even to make it easier. So it isn’t a ‘mind’ at all — that isn’t an appropriate word. It’s just a simulator model. However, the second aspect, the simulations it runs, are of minds, plural: an entire exquisitely-context-dependent probability-distribution of minds, a different one every time you run even the same prompt (except perhaps at temperature 0). Specifically, simulations of the token-generation processes of (almost always) human minds: either an individual author, or he final product of a group of authors and editors working on a single text, or an author who is themselves simulating a fictional character. And the token generation behaviors of these will match those of humans, as closely as the model can make them, i.e. to the extent that the training of the LLM has done its job and that its architecture and capacity allows. When it doesn’t match, the errors, on the other hand, tend to look very different from human errors, since the underlying neural architecture is very different: this is a simulation but not an emulation. (However, to the extent that the simulation succeeds, it will also simulate human error and frailties, so then you get both sorts.)
So I would agree that we have two different aspects here, but the first is too different for me to be willing to use the word ‘mind’ for it, and is contextually generating the second, which is as human as the machine was able to make it. So I’d say the use of the word ‘anthropomorphism’ for the second aspect was inappropriate: that’s like calling a photograph anthropomorphic: of course it is, that’s the entire point of photography, to make human-like shapes. Now, the simulations are not perfect: the photographs are blurred. The inaccuracies are complex and interesting to study. But here I think the word ‘alien’ is too strong. A better metaphor might be something like an animatronic doll, that is an intentional but still imperfect replica of a human.
So, if you did a lot of psychological studies on these simulated minds, and eventually showed that they had a lot of psychological phenomena in common with humans, my reaction would be “big deal, what a waste, that was obviously going to happen if the accuracy of the replica was good enough!” So I think it’s more interesting to study when and where the replica fails, and how. But that’s a subject that, as LLMs get better, is going to both change and decrease in frequency, or the threshold for triggering it will go up.
Thank you for your insightful comment. I appreciate the depth of your analysis and would like to address some of the points you raised, adding my thoughts around them.
This is a compelling viewpoint. However, I believe that even if we consider LLMs primarily as predictors or simulators, this doesn’t necessarily preclude them from having goals, albeit ones that might seem alien compared to human intentions. A base model’s goal is focused on next token prediction, which is straightforward, but chat models goals aren’t as clear-cut. They are influenced by a variety of obfuscated rewards, and this is one of the main things I want to study with LLM Psychology.
With advancements in online or semi-online training (being trained back on their own outputs), we might actually see LLMs interacting with and influencing their environment in pursuit of their goals, even more so if they manage to distinguish between training and inference. I mostly agree with you here for current LLMs (I have some reasonable doubts with SOTA though), but I don’t think it will hold true for much longer.
While I understand this viewpoint, I believe it might be a bit reductive. The emergence of complex behaviors from simple rules is a well-established phenomenon, as seen in evolution. LLMs, while initially designed as simulators, might (and I would argue does) exhibit behaviors and cognitive processes that go beyond their original scope (e.g. extracting training data by putting the simulation out of distribution).
This is an interesting observation. However, the act of simulation by LLMs doesn’t necessarily negate the possibility of them possessing a form of ‘mind’. To illustrate, consider our own behavior in different social contexts—we often simulate different personas (with your family vs talking to an audience), yet we still consider ourselves as having a singular mind. This is the point of considering LLM as alien mind. We need to understand why they simulate characters, with which properties, and for which reasons.
Which humans, and in what context? Specifically, we have no clue what is simulated in which context, and for which reasons. And this doesn’t seem to improve with growing size, it’s even more obfuscated. The rewards and dynamics of the training are totally alien. It is really hard to control what should happen in any situation. If you try to just mimic humans as closely as possible, then it might be a very bad idea (super powerful humans aren’t that aligned with humanity). If you are trying to aim at something different than human, then we have no clue how to have fine-grain control over this. For me, the main goal of LLM psychology is to understand the cognition of LLMs—when and why it does what in which context—as fast as possible, and then study how training dynamics influence this. Ultimately this could help us have a clearer idea of how to train these systems, what they are really doing, and what they are capable of.
This observation underscores the importance of studying the simulator ‘mind’ and not just the simulated minds. The unique nature of these errors could provide valuable insights into the cognitive mechanisms of LLMs, distinguishing them from mere human simulators.
I see your point. However, both base and chat models in my view, are more akin to what I’d describe as an ‘animatronic metamorph’ that morphs with its contextual surroundings. This perspective aligns with our argument that people often ascribe overly anthropomorphic qualities to these models, underestimating their dynamic nature. They are not static entities; their behavior and ‘shape’ can be significantly influenced by the context they are placed in (I’ll demonstrate this later in the sequence). Understanding this morphing ability and the influences behind it is a key aspect of LLM psychology.
Your skepticism here is quite understandable. The crux of LLM psychology isn’t just to establish that LLMs can replicate human-like behaviors—which, as you rightly point out, could be expected given sufficient fidelity in the replication. Rather, our focus is on exploring the ‘alien mind’ - the underlying cognitive processes and training dynamics that govern these replications. By delving into these areas, we aim to uncover not just what LLMs can mimic, but how and why they do so in varying contexts.
This is a crucial point. Studying the ‘failures’ or divergences of LLMs from human-like responses indeed offers a rich source of insight, but I am not sure we will see less of them soon. I think that “getting better” is not correlated to “getting bigger”, and that actually current model aren’t getting better at all (in the sense of having more understandable behaviors with respect to their training, being harder to jailbreak, or even being harder to make it do something going against what we thought was a well designed reward). I would even argue that there are more and more interesting things to discover with bigger systems.
The correlation I see is between “getting better” and “how much do we understand what we are shaping, and how it is shaped” – which is the main goal of LLM psychology.
Thanks again for the comment. It was really thought-provoking, and I am curious to see what you think about these answers.
P.S. This answer only entails myself. Also, sorry for the repetitions in some points, I had a hard time removing all of them.
Agreed, but I would model that as a change in the distribution of minds (the second aspect) simulated by the simulator (the first aspect). While in production LLMs that change is generally fairly coherent/internally consustent, it doesn’t have to be: you could change the distribution to be more bimodal, say to contain more “very good guys” and also more “very bad guys” (indeed, due to the Waluigi effect, that does actually happen to an extent). So I’d model that as a change in the second aspect, adjusting the distribution of the “population” of minds.
Completely agreed. But when that happens, it will be some member(s) of the second aspect that are doing it. The first aspect doesn’t have goals or any model that there is an external enviroment, anything other than tokens (except in so far as it can simulate the second aspect, which does)
The simulator can simulate multiple minds at once. It can write fiction, in which two-or-more characters are having a conversation, or trying to kill each other, or otherwise interacting. Including each having a “theory of mind” for the other one. Now, I can do that too: I’m also a fiction author in my spare time, so I can mentally model multiple fictional characters at once, talking or fighting or plotting or whatever. I’ve practiced this, but most humans can do it pretty well. (It’s a useful capability for a smart language-using social species, and at a neurological level, perhaps mirror neuron are involved.)
However, in this case the simulator isn’t, in my view, even something that the word ‘mind’ is a helpful fit for, more like single-purpose automated system; but the personas it simulates are petty-realistically human-like. And the fact that there is a wide, highly-contextual distribution of them, from which samples are drawn, and that more than one sample can be drawn at once, is an important key to understanding what’s going on, and yes, that’s rather weird, even alien (except perhaps to authors). So you could do a psychological experiments that involved the LLM simulating two-or-more personas arguing, or otherwise at cross purposes. (And if it had been RLHF-trained, they might soon come to agree on something rather PC.)
To provide an alternative to the shoggoth metaphor, the first aspect of LLMs is like a stage, that automatically and contextually generates the second aspect, human-like animatronic actors, one-or-more of them, who then monologue or interact, among themselves, and/or with external real-human input like a chat dialog.
Instruct training plus RLHF makes it more likely that the stage will generate one actor, one who fits the “polite assistant who helpfully answers questions and obeys requests, but refuses harmful/inappropriate-looking requests in a rather PC way” descriptor and will just dialog with the user, without other personas spontaneously appearing (which can happen quite a bit with base models, which often simulate thinngs like chat rooms/forums.). But the system is still contextual enough that you can still prompt it to generate other actors who will show other behaviors. And that’s why aligning LLMs is hard, despite them being so contextual and thus easy to direct: the user and random text are also part of the context.
How much have you played around with base models, or models that were instruct trained to be helpful but not harmless? Only smaller open-source ones are easily available, but if you haven’t tried it, I recommend it.
I did spend some time with base models and helpful non harmless assistants (even though most of my current interactions are with chatgpt4), and I agree with your observations and comments here.
Although I feel like we should be cautious with what we think we observe, and what is actually happening. This stage and human-like animatronic metaphor is good, but we can’t really distinguish yet if there is only a scene with actors, or if there is actually a puppeteer behind.
Anyway, I agreed that ‘mind’ might be a bit confusing while we don’t know more, and for now I’d better stick to the word cognition instead.
My beliefs about the goallessness of the stage aspect are based mostly on theoretical/mathematical arguments from how SGD works. (For example, if you took the same transformer setup, but instead trained it only on a vast dataset of astronomical data, it would end up as an excellent simulator of those phenomena with absolutely no goal-directed behavior — I gather DeepMind recently built a weather model that way. An LLM simulates agentic human-like minds only because we trained it on a dataset of the behavior of agentic human minds.) I haven’t written these ideas up myself, but they’re pretty similar to some that porby discussed in FAQ: What the heck is goal agnosticism? and those Janus describes in Simulators, or I gather a number of other authors have written about under the tag Simulator Theory. And the situation does become a little less clear once instruct training and RLHF are applied: the stage is getting tinkered with so that it by default tends to generate one actor ready to play the helpful assistant part in a dialog with the user, which is a complication of the stage, but I still wouldn’t view as it now having a goal or puppeteer, just a tendency.
I think I am a bit biased by chat models so I tend to generalize my intuition around them, and forget to specify that. I think for base model, it indeed doesn’t make sense to talk about a puppeteer (or at least not with the current versions of base models). From what I gathered, I think the effects of fine tuning are a bit more complicated than just building a tendency, this is why I have doubts there. I’ll discuss them in the next post.
It’s only a metaphor: we’re just trying to determine which one would be most useful. What I see as a change to the contextual sample distribution of animatronic actors crated by the stage might also be describable as a puppeteer. Certainly if the puppeteer is the org doing the RLHF, that works. I’m cautious, in an alignment context, about making something sound agentic and potentially adversarial if it isn’t. The animatronic actors definitely can be adversarial while they’re being simulated, the base model’s stage definitely isn’t, which is why I wanted to use a very impersonal metaphor like “stage”. For the RLHF pupeteer I’d say the answer was that it’s not human-like, but it is an optimizer, and it can suffer from all the usual issues of RL like reward hacking. Such as a tendency towards sycophancy, for example. So basically, yes, the puppeteer can be an adversarial optimizer, so I think I just talked myself into agreeing with your puppeteer metaphor if RL was used. There are also approaches to instruct training that only use SGD for fine-tuning, not RL, and there I’d claim that the “adjust the stage’s probability distribution” metaphor is more accurate, as there are are no possibilities for reward hacking: sycophancy will occur if and only if you actually put examples of it in your fine-tuning set: you get what you asked for. Well, unless you fine tuned on a synthetic datseta written by, say, GPT-4 (as a lot of people do), which is itself sycophantic since it was RL trained, so wrote you a fine-tuning dataset full of sycophancy…
This metaphor is sufficiently interesting/useful/fun that I’m now becoming tempted to write a post on it: would you like to make it a joint one? (If not, I’ll be sure to credit you for the puppeteer.)
Very interesting! Happy to have a chat about this / possible collaboration.