In Oam’s experiments, the vocabulary is token for each number from 1 to 1000, pad token, and intermediate computation (IC) token. But I wouldn’t index on his results too much because I’m unsure how replicable they are.
I am indeed using the OA API
And now some takes. I find both of your hypotheses intriguing. I’d never considered either of those before so thanks for bringing them up. I’m guessing they’re both wrong for the following reasons:
RLHF: Agreed that filler tokens take the model into a weird distribution. It’s not obvious though why that is more like the pretraining distribution than the RL distribution (except that pretraining has broader coverage). Also, GPT-3.5 was trained with RLHF and Claude with RLAIF (which is basically the same), and they don’t show the effect. One point maybe supporting your claim is that the “non-weird” filler tokens like “happy to help...” didn’t show a strong effect, but I abandoned that direction after one experiment and a variant of the “natural fillers” may well work.
Route to smarter experts: The link you shared is very cool and I hadn’t seen that before—thanks! My main pushback here is I’d be pretty surprised if gradient descent so inefficiently routed to the wrong experts on normal math problems that I would see a 10% improvement with a distribution shift.
But I wouldn’t index on his results too much because I’m unsure how replicable they are.
OK. If Oam really was using a single IC token and getting benefits, then I think that would point to the forward-pass/additional-model-parameters-activated explanation: obviously, his model is unaffected by RLHF, MoEs, model scaling, or anything of the alternate hypotheses. He should probably look into that if he can replicate padding benefits. It would not necessarily resolve the question for larger models, but at least it’d establish such an effect exists in some model implementations.
Claude with RLAIF (which is basically the same)
I don’t think RLAIF is ‘basically the same’. I find Claude to be qualitatively highly different from RLHF’d models—as I’ve commented before, the RLHF seems to cause some rather perverse consequences that RLAIF avoids. For example, you cannot ask ChatGPT to “Write a non-rhyming poem.”—I’ve tried this prompt dozens of times, and it always writes a rhyming poem, even if you explicitly then tell it it incorrectly wrote a rhyming poem and to try again, while Claude nails it without trouble.
RLHF seems to distort the entire distribution of outputs to be on-policy and suck all outputs into the vortex of RLHF’d-pablum, while RLAIF seems to be far more selective and mostly affect directly-morality-relevant outputs, leaving neutral stuff like ‘write a poem’ undistorted.
My main pushback here is I’d be pretty surprised if gradient descent so inefficiently routed to the wrong experts on normal math problems that I would see a 10% improvement with a distribution shift.
/shrug. You are looking at rather different problems here, aren’t you? I mean, GSM8K problems are complex word problems, and not like the other problems you are benchmarking like ‘Addition’ or ‘GCD’.
ChatGPT-4 seems to have improved at diverse literary styles. It sometimes ignores the “non-rhyming” instructions, but I was able to get it to avoid rhyme on my second try by first asking it, “Can you write poems that don’t rhyme?”.
Minor point, but I asked the code interpreter to produce a non-rhyming poem, and it managed to do so on the second time of asking. I restricted it to three verses because it stared off well on my initial attempt, but veered into rhyming territory in later verses.
I asked the code interpreter to produce a non-rhyming poem
FWIW, all of the examples by me or others were either the Playground or chat interface. I haven’t subscribed so I don’t have access to the code interpreter.
but veered into rhyming territory in later verses.
Yep, sucked into the memorized-rhymes vortex. I’m glad to hear it now works sometimes, well, at least partially, if you don’t give it too long to go back on-policy & ignore its prompt. (Maybe all of the times I flagged the rhyming completions actually helped out a bit.)
As a quick test of ‘forcing it out of distribution’ (per Herb’s comment), I tried writing in Playground with gpt-3.5-turbo “Write a non-rhyming poem.” with a prefix consisting of about 30 lines of “a a a a a a a a a” repeated, and without.
Without the prefix, I only get 1⁄6 non-rhyming poems (ie. 5⁄6 clearly rhymed); with the prefix, I get 4⁄5 non-rhyming poem (1 did rhyme anyway).
First, clarification:
In Oam’s experiments, the vocabulary is token for each number from 1 to 1000, pad token, and intermediate computation (IC) token. But I wouldn’t index on his results too much because I’m unsure how replicable they are.
I am indeed using the OA API
And now some takes. I find both of your hypotheses intriguing. I’d never considered either of those before so thanks for bringing them up. I’m guessing they’re both wrong for the following reasons:
RLHF: Agreed that filler tokens take the model into a weird distribution. It’s not obvious though why that is more like the pretraining distribution than the RL distribution (except that pretraining has broader coverage). Also, GPT-3.5 was trained with RLHF and Claude with RLAIF (which is basically the same), and they don’t show the effect. One point maybe supporting your claim is that the “non-weird” filler tokens like “happy to help...” didn’t show a strong effect, but I abandoned that direction after one experiment and a variant of the “natural fillers” may well work.
Route to smarter experts: The link you shared is very cool and I hadn’t seen that before—thanks! My main pushback here is I’d be pretty surprised if gradient descent so inefficiently routed to the wrong experts on normal math problems that I would see a 10% improvement with a distribution shift.
OK. If Oam really was using a single IC token and getting benefits, then I think that would point to the forward-pass/additional-model-parameters-activated explanation: obviously, his model is unaffected by RLHF, MoEs, model scaling, or anything of the alternate hypotheses. He should probably look into that if he can replicate padding benefits. It would not necessarily resolve the question for larger models, but at least it’d establish such an effect exists in some model implementations.
I don’t think RLAIF is ‘basically the same’. I find Claude to be qualitatively highly different from RLHF’d models—as I’ve commented before, the RLHF seems to cause some rather perverse consequences that RLAIF avoids. For example, you cannot ask ChatGPT to “Write a non-rhyming poem.”—I’ve tried this prompt dozens of times, and it always writes a rhyming poem, even if you explicitly then tell it it incorrectly wrote a rhyming poem and to try again, while Claude nails it without trouble.
RLHF seems to distort the entire distribution of outputs to be on-policy and suck all outputs into the vortex of RLHF’d-pablum, while RLAIF seems to be far more selective and mostly affect directly-morality-relevant outputs, leaving neutral stuff like ‘write a poem’ undistorted.
/shrug. You are looking at rather different problems here, aren’t you? I mean, GSM8K problems are complex word problems, and not like the other problems you are benchmarking like ‘Addition’ or ‘GCD’.
ChatGPT-4 seems to have improved at diverse literary styles. It sometimes ignores the “non-rhyming” instructions, but I was able to get it to avoid rhyme on my second try by first asking it, “Can you write poems that don’t rhyme?”.
https://chat.openai.com/share/698343c1-764e-4a65-9eb8-f2ec4e40da1b
Minor point, but I asked the code interpreter to produce a non-rhyming poem, and it managed to do so on the second time of asking. I restricted it to three verses because it stared off well on my initial attempt, but veered into rhyming territory in later verses.
FWIW, all of the examples by me or others were either the Playground or chat interface. I haven’t subscribed so I don’t have access to the code interpreter.
Yep, sucked into the memorized-rhymes vortex. I’m glad to hear it now works sometimes, well, at least partially, if you don’t give it too long to go back on-policy & ignore its prompt. (Maybe all of the times I flagged the rhyming completions actually helped out a bit.)
As a quick test of ‘forcing it out of distribution’ (per Herb’s comment), I tried writing in Playground with
gpt-3.5-turbo
“Write a non-rhyming poem.” with a prefix consisting of about 30 lines of “a a a a a a a a a” repeated, and without.Without the prefix, I only get 1⁄6 non-rhyming poems (ie. 5⁄6 clearly rhymed); with the prefix, I get 4⁄5 non-rhyming poem (1 did rhyme anyway).
Might be something interesting there?