My beliefs about the goallessness of the stage aspect are based mostly on theoretical/mathematical arguments from how SGD works. (For example, if you took the same transformer setup, but instead trained it only on a vast dataset of astronomical data, it would end up as an excellent simulator of those phenomena with absolutely no goal-directed behavior — I gather DeepMind recently built a weather model that way. An LLM simulates agentic human-like minds only because we trained it on a dataset of the behavior of agentic human minds.) I haven’t written these ideas up myself, but they’re pretty similar to some that porby discussed in FAQ: What the heck is goal agnosticism? and those Janus describes in Simulators, or I gather a number of other authors have written about under the tag Simulator Theory. And the situation does become a little less clear once instruct training and RLHF are applied: the stage is getting tinkered with so that it by default tends to generate one actor ready to play the helpful assistant part in a dialog with the user, which is a complication of the stage, but I still wouldn’t view as it now having a goal or puppeteer, just a tendency.
I think I am a bit biased by chat models so I tend to generalize my intuition around them, and forget to specify that. I think for base model, it indeed doesn’t make sense to talk about a puppeteer (or at least not with the current versions of base models). From what I gathered, I think the effects of fine tuning are a bit more complicated than just building a tendency, this is why I have doubts there. I’ll discuss them in the next post.
It’s only a metaphor: we’re just trying to determine which one would be most useful. What I see as a change to the contextual sample distribution of animatronic actors crated by the stage might also be describable as a puppeteer. Certainly if the puppeteer is the org doing the RLHF, that works. I’m cautious, in an alignment context, about making something sound agentic and potentially adversarial if it isn’t. The animatronic actors definitely can be adversarial while they’re being simulated, the base model’s stage definitely isn’t, which is why I wanted to use a very impersonal metaphor like “stage”. For the RLHF pupeteer I’d say the answer was that it’s not human-like, but it is an optimizer, and it can suffer from all the usual issues of RL like reward hacking. Such as a tendency towards sycophancy, for example. So basically, yes, the puppeteer can be an adversarial optimizer, so I think I just talked myself into agreeing with your puppeteer metaphor if RL was used. There are also approaches to instruct training that only use SGD for fine-tuning, not RL, and there I’d claim that the “adjust the stage’s probability distribution” metaphor is more accurate, as there are are no possibilities for reward hacking: sycophancy will occur if and only if you actually put examples of it in your fine-tuning set: you get what you asked for. Well, unless you fine tuned on a synthetic datseta written by, say, GPT-4 (as a lot of people do), which is itself sycophantic since it was RL trained, so wrote you a fine-tuning dataset full of sycophancy…
This metaphor is sufficiently interesting/useful/fun that I’m now becoming tempted to write a post on it: would you like to make it a joint one? (If not, I’ll be sure to credit you for the puppeteer.)
My beliefs about the goallessness of the stage aspect are based mostly on theoretical/mathematical arguments from how SGD works. (For example, if you took the same transformer setup, but instead trained it only on a vast dataset of astronomical data, it would end up as an excellent simulator of those phenomena with absolutely no goal-directed behavior — I gather DeepMind recently built a weather model that way. An LLM simulates agentic human-like minds only because we trained it on a dataset of the behavior of agentic human minds.) I haven’t written these ideas up myself, but they’re pretty similar to some that porby discussed in FAQ: What the heck is goal agnosticism? and those Janus describes in Simulators, or I gather a number of other authors have written about under the tag Simulator Theory. And the situation does become a little less clear once instruct training and RLHF are applied: the stage is getting tinkered with so that it by default tends to generate one actor ready to play the helpful assistant part in a dialog with the user, which is a complication of the stage, but I still wouldn’t view as it now having a goal or puppeteer, just a tendency.
I think I am a bit biased by chat models so I tend to generalize my intuition around them, and forget to specify that. I think for base model, it indeed doesn’t make sense to talk about a puppeteer (or at least not with the current versions of base models). From what I gathered, I think the effects of fine tuning are a bit more complicated than just building a tendency, this is why I have doubts there. I’ll discuss them in the next post.
It’s only a metaphor: we’re just trying to determine which one would be most useful. What I see as a change to the contextual sample distribution of animatronic actors crated by the stage might also be describable as a puppeteer. Certainly if the puppeteer is the org doing the RLHF, that works. I’m cautious, in an alignment context, about making something sound agentic and potentially adversarial if it isn’t. The animatronic actors definitely can be adversarial while they’re being simulated, the base model’s stage definitely isn’t, which is why I wanted to use a very impersonal metaphor like “stage”. For the RLHF pupeteer I’d say the answer was that it’s not human-like, but it is an optimizer, and it can suffer from all the usual issues of RL like reward hacking. Such as a tendency towards sycophancy, for example. So basically, yes, the puppeteer can be an adversarial optimizer, so I think I just talked myself into agreeing with your puppeteer metaphor if RL was used. There are also approaches to instruct training that only use SGD for fine-tuning, not RL, and there I’d claim that the “adjust the stage’s probability distribution” metaphor is more accurate, as there are are no possibilities for reward hacking: sycophancy will occur if and only if you actually put examples of it in your fine-tuning set: you get what you asked for. Well, unless you fine tuned on a synthetic datseta written by, say, GPT-4 (as a lot of people do), which is itself sycophantic since it was RL trained, so wrote you a fine-tuning dataset full of sycophancy…
This metaphor is sufficiently interesting/useful/fun that I’m now becoming tempted to write a post on it: would you like to make it a joint one? (If not, I’ll be sure to credit you for the puppeteer.)
Very interesting! Happy to have a chat about this / possible collaboration.