I think I am a bit biased by chat models so I tend to generalize my intuition around them, and forget to specify that. I think for base model, it indeed doesn’t make sense to talk about a puppeteer (or at least not with the current versions of base models). From what I gathered, I think the effects of fine tuning are a bit more complicated than just building a tendency, this is why I have doubts there. I’ll discuss them in the next post.
It’s only a metaphor: we’re just trying to determine which one would be most useful. What I see as a change to the contextual sample distribution of animatronic actors crated by the stage might also be describable as a puppeteer. Certainly if the puppeteer is the org doing the RLHF, that works. I’m cautious, in an alignment context, about making something sound agentic and potentially adversarial if it isn’t. The animatronic actors definitely can be adversarial while they’re being simulated, the base model’s stage definitely isn’t, which is why I wanted to use a very impersonal metaphor like “stage”. For the RLHF pupeteer I’d say the answer was that it’s not human-like, but it is an optimizer, and it can suffer from all the usual issues of RL like reward hacking. Such as a tendency towards sycophancy, for example. So basically, yes, the puppeteer can be an adversarial optimizer, so I think I just talked myself into agreeing with your puppeteer metaphor if RL was used. There are also approaches to instruct training that only use SGD for fine-tuning, not RL, and there I’d claim that the “adjust the stage’s probability distribution” metaphor is more accurate, as there are are no possibilities for reward hacking: sycophancy will occur if and only if you actually put examples of it in your fine-tuning set: you get what you asked for. Well, unless you fine tuned on a synthetic datseta written by, say, GPT-4 (as a lot of people do), which is itself sycophantic since it was RL trained, so wrote you a fine-tuning dataset full of sycophancy…
This metaphor is sufficiently interesting/useful/fun that I’m now becoming tempted to write a post on it: would you like to make it a joint one? (If not, I’ll be sure to credit you for the puppeteer.)
I think I am a bit biased by chat models so I tend to generalize my intuition around them, and forget to specify that. I think for base model, it indeed doesn’t make sense to talk about a puppeteer (or at least not with the current versions of base models). From what I gathered, I think the effects of fine tuning are a bit more complicated than just building a tendency, this is why I have doubts there. I’ll discuss them in the next post.
It’s only a metaphor: we’re just trying to determine which one would be most useful. What I see as a change to the contextual sample distribution of animatronic actors crated by the stage might also be describable as a puppeteer. Certainly if the puppeteer is the org doing the RLHF, that works. I’m cautious, in an alignment context, about making something sound agentic and potentially adversarial if it isn’t. The animatronic actors definitely can be adversarial while they’re being simulated, the base model’s stage definitely isn’t, which is why I wanted to use a very impersonal metaphor like “stage”. For the RLHF pupeteer I’d say the answer was that it’s not human-like, but it is an optimizer, and it can suffer from all the usual issues of RL like reward hacking. Such as a tendency towards sycophancy, for example. So basically, yes, the puppeteer can be an adversarial optimizer, so I think I just talked myself into agreeing with your puppeteer metaphor if RL was used. There are also approaches to instruct training that only use SGD for fine-tuning, not RL, and there I’d claim that the “adjust the stage’s probability distribution” metaphor is more accurate, as there are are no possibilities for reward hacking: sycophancy will occur if and only if you actually put examples of it in your fine-tuning set: you get what you asked for. Well, unless you fine tuned on a synthetic datseta written by, say, GPT-4 (as a lot of people do), which is itself sycophantic since it was RL trained, so wrote you a fine-tuning dataset full of sycophancy…
This metaphor is sufficiently interesting/useful/fun that I’m now becoming tempted to write a post on it: would you like to make it a joint one? (If not, I’ll be sure to credit you for the puppeteer.)
Very interesting! Happy to have a chat about this / possible collaboration.