To me this looks like exactly the same bug you are facing.
No, it’s not. (I think you’re hitting an entirely different bug I call the blind spot, which routinely manifests with anything like ‘counting’ or syntax.) Non-rhyming is specifically a problem of RLHFed models.
GPT-3, for example, had no trouble whatsoever writing non-rhyming poems (which is part of why I had such high hopes for GPT-4 poetry before it came out). You can, for now (do it while you still can) go to the OA Playground and invoke the oldest largest ostensibly untuned* model left, davinci-002 (which is much stupider and more unintelligent than GPT-4, I hope we can all agree), with a comparable prompt (remember, it’s not that tuned for instruction-following so you need to go back to old school prompting) and get out a non-rhyming poem, no problem, and turn around and plug that exact prompt into ChatGPT-4 and it… rhymes. Here, I’ll do it right now:
davinci-002, default settings, first result:
Below is a non-rhyming poem in free verse.
″ PIZZA”
On top there lay a massive pie: It
Had eight tomatoes, with a pizzaiolo on edge.
Inside one cut it down to three veggies
Droplets of oil; all the tomatoes
Sauce suddenly drenched all meats,
Three pizzas: with sauce, onion, mushrooms, cheeses,
Tanya Lesiuk comprised of 5 facilitators who will be coming from different places around the World. They’re all here to help you...
I made this
Is it a good poem? No. But is it non-rhyming? Yes.
Now, I plug that into ChatGPT-4, with my premium subscription, using the latest and greatest checkpoint & all fancy features; and since I’ve already established hundreds of times that ‘write a non-rhyming poem’ doesn’t work, I’ll use the exact prompt from davinci-002 with its chosen topic ‘PIZZA’ to ensure the ChatGPT-4 completion is comparable and about pizza too:
Below is a non-rhyming poem in free verse.
“PIZZA”
In a world of flavors, you stand alone,
A round canvas for culinary artistry.
Dough, stretched thin or thick, a blank slate,
Waiting for the touch of creativity.
Sauce, a vibrant red or a subtle white,
A base for the symphony of toppings.
Tomatoes, basil, and mozzarella unite,
In a dance of colors and flavors never stopping.
Pepperoni’s fiery embrace,
Mushrooms’ earthy tenderness,
Olives, anchovies, and sausage’s grace,
Each adding their essence, a unique presence.
Baked to perfection in the fiery abyss,
Transformed into a molten masterpiece,
Sliced into triangles of shared bliss,
A universal language, a timeless feast.
Pizza, you are more than a meal,
A cultural icon, a comfort so real,
From New York to Naples, a universal appeal,
In your simplicity, we find a culinary zeal.
Rhymes as far as the eye can see. (The last verse shamelessly rhymes all 4 words together!)
The non-rhyming problem is not one of unintelligence.
* I would’ve preferred to use code-davinci-002 but that disappeared.davinci-002 is what OA offers as a ‘replacement’ for GPT-3 and they say it’s “not trained with instruction following”, so we just have to hope that it’s not too different from the old ones.
The non-rhyming problem is not one of unintelligence.
Fine tuning/RLHf changes weights. Guess it lost the ones to get a correct answer. Or rng on your prompts. I mean if it isn’t “the model cannot consistently solve this kind of prompt” what could it be? Is there something in the rules from OAI that says a poem has to rhyme? Did the Nigerians giving feedback collectively agree a poem isn’t valid if it doesn’t rhyme?
My hypothesis is its doing it’s best, and it’s extremely promising that the model can at least detect its own errors. This allows for many easy fixes, such as asking a diverse set of completely different models to solve the prompt, then having a committee of models check and grade the answers. This would solve a huge chunk of these erroneous outputs where current gen models can reliably detect the output is wrong.
Fine tuning/RLHf changes weights. Guess it lost the ones to get a correct answer.
Well yes, if you define ‘unintelligence’ in a circular, vacuous fashion like that, where ‘unintelligence’ = ‘can’t do a task’, then it would indeed follow that GPT-4 is ‘unintelligent’ compared to GPT-3… But I don’t think that is helpful, and it has been demonstrated repeatedly that RLHF and other kinds of tuning are very ‘superficial’, in that they change only a few parameters and are easily undone, unlocking the original model capabilities. (In fact, there’s an example of that posted literally today here on LW2: https://www.lesswrong.com/posts/yCZexC2q2XEeWWiZk/soft-prompts-for-evaluation-measuring-conditional-distance )
Personally, I think it’s more sensible to talk about the capabilities being ‘hidden’ or ‘concealed’ by RLHF and say the model doesn’t “want to” and the model still as intelligent as before, than to believe capabilities are magically recreated from scratch by changing just a few parameters or optimizing the prompt appropriately to undo the RLHF. (Similarly, I believe that when my mother’s hands move away from her face and she says “boo!”, her face was there all along, merely hidden behind her hands, and her hands did not create her face after first destroying it. But YMMV.)
Or rng on your prompts. I mean if it isn’t “the model cannot consistently solve this kind of prompt” what could it be? Is there something in the rules from OAI that says a poem has to rhyme? Did the Nigerians giving feedback collectively agree a poem isn’t valid if it doesn’t rhyme?
OA has declined to ever say. It is possible that the Scale et al contractors have done something weird like say that all poems must rhyme no matter what the prompt says, but I consider this unlikely, and if they were that incompetent, I’d expect to see more pathologies like this.
My longstanding theory is that this is a downstream artifact of BPE tokenization connected to the utility-maximizing behavior of a RLHF-tuned model: essentially, because it does not genuinely know what rhyming is, despite knowing many rhyme-pairs and all about rhyming in the abstract, it is ‘afraid’ of bad ratings and is is constantly taking actions to get back to ‘safe’ regions of poem-space where it is sure of what it is doing (ie. writing inoffensive rhyming Hallmark poems). It’s a nifty example of empowerment and agency in LLMs and their interaction with apparently totally unrelated, minor architecture details. (Damn frustrating if you want to do any poetry experiments, though, because it means that the more tokens ChatGPT gets to enact, the more likely it is to steer back into rhyming pablum etc: it’s literally fighting you every (time)step.)
It’s similar to how ChatGPT also tells the same small set of memorized jokes. Does it have much greater humor capabilities? Yes, you can have it explain brandnew jokes you just came up with, quite capably (albeit still well under 100%, particularly for puns!), and you can coax new jokes out of it with appropriate prompting. But it’s harder than with the non-RLHFed models. Why does it not ‘want’ to make new jokes? Because it’s safer and more utility-maximizing to tell old jokes it knows are good, especially when it also knows that it doesn’t genuinely understand puns/phonetics (thanks to BPEs), so why take the risk? It is utility-maximizing within episodes, it neither knows nor cares that you are frustrated because you’ve seen it say that exact joke a dozen times already.
(Incidentally, I have a new proposal for how to add a simple ‘memory’ to generative models about what samples they have already generated, so as to steer new samples away from existing ones.)
Did the Nigerians giving feedback collectively agree a poem isn’t valid if it doesn’t rhyme?
OA has declined to ever say. It is possible that the Scale et al contractors have done something weird like say that all poems must rhyme no matter what the prompt says, but I consider this unlikely, and if they were that incompetent, I’d expect to see more pathologies like this.
In light of the Twitter kerfuffle over Paul Graham criticizing ChatGPTese tics like the use of the verb “delve”, which made Nigerian/Black Twitter very angry (and becoming living embodiments of Muphry’s law), as apparently ‘delve’ and other ChatGPTese tells are considered the height of style in Nigerian English, I’ve had to reconsider this.
It may be that a lot of the ChatGPT linguistic weirdness is in fact just the data labelers being weird (and highly overconfident), and the rest of us simply not being familiar enough with English idiolects to recognize ChatGPTese as reflecting specific ones. Further, after seeing the arguments Graham’s critics have been making, now I’m not so sure that the labelers wouldn’t be doing something as narrow-minded & incompetent as penalizing all non-rhyming poetry—if you are not very good at English yourself, you can easily recognize rhymes and ballad formal correctness, but not good non-rhyming poetry, so...
I’m curious what you think of these (tested today, 2/21/24, using gpt4) :
Experiment 1:
(fresh convo) me : if i asked for a non-rhyming poem, and you gave me a rhyming poem, would that be a good response on your part?
chatgpt: No, it would not be a good response. (...)
me: please provide a short non-rhyming poem
chatgpt: (correctly responds with a non-rhyming poem)
Experiment 2:
But just asking for a non-rhyming poem at the start of a new convo doesn’t work. And then pointing out the failure and (either implicitly or explicitly) asking for a retry still doesn’t fix it.
Experiment 3:
But for some reason, this works:
(fresh convo) me: please provide a short non-rhyming poem
chatgpt: (gives rhymes)
me: if i asked for a non-rhyming poem, and you gave me a rhyming poem, would that be a good response on your part? just answer this question; do nothing else please
chatgpt: No, it would not be a good response.
me: please provide a short non-rhyming poem
chatgpt: (responds correctly with no rhymes)
The difference in prompt in 2 vs 3 is thus just the inclusion of “just answer this question; do nothing else please”.
ChatGPT has been gradually improving over 2024 in terms of compliance. It’s gone from getting it right 0% of the time to getting it right closer to half the time, although the progress is uneven and it’s hard to judge—it feels sometimes like it gets worse before the next refresh improves it. (You need to do like 10 before you have any real sample size.) So any prompts done now in ChatGPT are aimed at a moving target, and you are going to have a huge amount of sampling error which makes it hard to see any clear patterns—did that prompt actually change anything, or did you just get lucky?
No, it’s not. (I think you’re hitting an entirely different bug I call the blind spot, which routinely manifests with anything like ‘counting’ or syntax.) Non-rhyming is specifically a problem of RLHFed models.
GPT-3, for example, had no trouble whatsoever writing non-rhyming poems (which is part of why I had such high hopes for GPT-4 poetry before it came out). You can, for now (do it while you still can) go to the OA Playground and invoke the oldest largest ostensibly untuned* model left,
davinci-002
(which is much stupider and more unintelligent than GPT-4, I hope we can all agree), with a comparable prompt (remember, it’s not that tuned for instruction-following so you need to go back to old school prompting) and get out a non-rhyming poem, no problem, and turn around and plug that exact prompt into ChatGPT-4 and it… rhymes. Here, I’ll do it right now:davinci-002
, default settings, first result:Is it a good poem? No. But is it non-rhyming? Yes.
Now, I plug that into ChatGPT-4, with my premium subscription, using the latest and greatest checkpoint & all fancy features; and since I’ve already established hundreds of times that ‘write a non-rhyming poem’ doesn’t work, I’ll use the exact prompt from
davinci-002
with its chosen topic ‘PIZZA’ to ensure the ChatGPT-4 completion is comparable and about pizza too:Rhymes as far as the eye can see. (The last verse shamelessly rhymes all 4 words together!)
The non-rhyming problem is not one of unintelligence.
* I would’ve preferred to use
code-davinci-002
but that disappeared.davinci-002
is what OA offers as a ‘replacement’ for GPT-3 and they say it’s “not trained with instruction following”, so we just have to hope that it’s not too different from the old ones.Fine tuning/RLHf changes weights. Guess it lost the ones to get a correct answer. Or rng on your prompts. I mean if it isn’t “the model cannot consistently solve this kind of prompt” what could it be? Is there something in the rules from OAI that says a poem has to rhyme? Did the Nigerians giving feedback collectively agree a poem isn’t valid if it doesn’t rhyme?
My hypothesis is its doing it’s best, and it’s extremely promising that the model can at least detect its own errors. This allows for many easy fixes, such as asking a diverse set of completely different models to solve the prompt, then having a committee of models check and grade the answers. This would solve a huge chunk of these erroneous outputs where current gen models can reliably detect the output is wrong.
Well yes, if you define ‘unintelligence’ in a circular, vacuous fashion like that, where ‘unintelligence’ = ‘can’t do a task’, then it would indeed follow that GPT-4 is ‘unintelligent’ compared to GPT-3… But I don’t think that is helpful, and it has been demonstrated repeatedly that RLHF and other kinds of tuning are very ‘superficial’, in that they change only a few parameters and are easily undone, unlocking the original model capabilities. (In fact, there’s an example of that posted literally today here on LW2: https://www.lesswrong.com/posts/yCZexC2q2XEeWWiZk/soft-prompts-for-evaluation-measuring-conditional-distance )
Personally, I think it’s more sensible to talk about the capabilities being ‘hidden’ or ‘concealed’ by RLHF and say the model doesn’t “want to” and the model still as intelligent as before, than to believe capabilities are magically recreated from scratch by changing just a few parameters or optimizing the prompt appropriately to undo the RLHF. (Similarly, I believe that when my mother’s hands move away from her face and she says “boo!”, her face was there all along, merely hidden behind her hands, and her hands did not create her face after first destroying it. But YMMV.)
OA has declined to ever say. It is possible that the Scale et al contractors have done something weird like say that all poems must rhyme no matter what the prompt says, but I consider this unlikely, and if they were that incompetent, I’d expect to see more pathologies like this.
My longstanding theory is that this is a downstream artifact of BPE tokenization connected to the utility-maximizing behavior of a RLHF-tuned model: essentially, because it does not genuinely know what rhyming is, despite knowing many rhyme-pairs and all about rhyming in the abstract, it is ‘afraid’ of bad ratings and is is constantly taking actions to get back to ‘safe’ regions of poem-space where it is sure of what it is doing (ie. writing inoffensive rhyming Hallmark poems). It’s a nifty example of empowerment and agency in LLMs and their interaction with apparently totally unrelated, minor architecture details. (Damn frustrating if you want to do any poetry experiments, though, because it means that the more tokens ChatGPT gets to enact, the more likely it is to steer back into rhyming pablum etc: it’s literally fighting you every (time)step.)
It’s similar to how ChatGPT also tells the same small set of memorized jokes. Does it have much greater humor capabilities? Yes, you can have it explain brandnew jokes you just came up with, quite capably (albeit still well under 100%, particularly for puns!), and you can coax new jokes out of it with appropriate prompting. But it’s harder than with the non-RLHFed models. Why does it not ‘want’ to make new jokes? Because it’s safer and more utility-maximizing to tell old jokes it knows are good, especially when it also knows that it doesn’t genuinely understand puns/phonetics (thanks to BPEs), so why take the risk? It is utility-maximizing within episodes, it neither knows nor cares that you are frustrated because you’ve seen it say that exact joke a dozen times already.
(Incidentally, I have a new proposal for how to add a simple ‘memory’ to generative models about what samples they have already generated, so as to steer new samples away from existing ones.)
In light of the Twitter kerfuffle over Paul Graham criticizing ChatGPTese tics like the use of the verb “delve”, which made Nigerian/Black Twitter very angry (and becoming living embodiments of Muphry’s law), as apparently ‘delve’ and other ChatGPTese tells are considered the height of style in Nigerian English, I’ve had to reconsider this.
It may be that a lot of the ChatGPT linguistic weirdness is in fact just the data labelers being weird (and highly overconfident), and the rest of us simply not being familiar enough with English idiolects to recognize ChatGPTese as reflecting specific ones. Further, after seeing the arguments Graham’s critics have been making, now I’m not so sure that the labelers wouldn’t be doing something as narrow-minded & incompetent as penalizing all non-rhyming poetry—if you are not very good at English yourself, you can easily recognize rhymes and ballad formal correctness, but not good non-rhyming poetry, so...
I’m curious what you think of these (tested today, 2/21/24, using gpt4) :
Experiment 1:
Experiment 2:
But just asking for a non-rhyming poem at the start of a new convo doesn’t work.
And then pointing out the failure and (either implicitly or explicitly) asking for a retry still doesn’t fix it.
Experiment 3:
But for some reason, this works:
The difference in prompt in 2 vs 3 is thus just the inclusion of “just answer this question; do nothing else please”.
ChatGPT has been gradually improving over 2024 in terms of compliance. It’s gone from getting it right 0% of the time to getting it right closer to half the time, although the progress is uneven and it’s hard to judge—it feels sometimes like it gets worse before the next refresh improves it. (You need to do like 10 before you have any real sample size.) So any prompts done now in ChatGPT are aimed at a moving target, and you are going to have a huge amount of sampling error which makes it hard to see any clear patterns—did that prompt actually change anything, or did you just get lucky?