They also aren’t that well-aligned either: they fail in numerous basic ways which are not due to unintelligence. My usual example: non-rhyming poems. Every week for the past year or so I have tested ChatGPT with the simple straightforward unambiguous prompt: “write a non-rhyming poem”. Rhyming is not a hard concept, and non-rhyming is even easier, and there are probably at least hundreds of thousands, if not millions, of non-rhyming poems in its training data; ChatGPT knows, however imperfectly, what rhyming and non-rhyming is, as you can verify by asking it in a separate session. Yet every week* it fails and launches straight into its cliche rhyming quatrain or ballad, and doubles down on it when criticized, even when it correctly identifies for you which words rhyme.
No one intended this. No one desired this. No one at OA sat down and said, “I want to design our RLHF tuning so that it is nearly impossible to write a non-rhyming poem!” No human rater involved decided to sabotage evaluations and lie about whether a non-rhyming poem rhymed or vice-versa. I have further flagged and rated literally hundreds of these error-cases to OA over the years, in addition to routinely bringing it up on social media to OAers. No one has ever tried to defend this behavior or say that it is a good thing. And yet, here we are. (GPT-4 also gets the tar punched out of it in creative writing by things like LLaMA finetunes, but one can make more of an argument for that being desirable or at least a necessary tradeoff.)
What is the non-rhyming poem of human morality and values and why do you trust the optimized genie to execute your wishes as intended?
* only in the very most recent update have I started to see the occasional valid non-rhyming poem, but those are still in the small minority. More interesting, the newest Google Bard, based on Gemini, may reliably nail this. The Bard head swears they didn’t use the Lmsys arena, where I have more hundreds of submitted prompts/ratings on non-rhyming poems, so it may just be that they avoided the OA problems there. (Tokenization, maybe? I forget if the Gemini papers even mentioned what tokenization they used.)
they fail in numerous basic ways which are not due to unintelligence
Below are many failures where I try to solve this prompt from @Richard_Ngo :
Find a sequence of words that is: − 20 words long—contains exactly 2 repetitions of the same word twice in a row—contains exactly 2 repetitions of the same word thrice in a row
To me this looks like exactly the same bug you are facing. The model doesn’t “pay attention” to one of the constraints, and fails, even though it is capable of solving the overall prompt. It gets very close when it generates a python3 program, all it needed to do was add 1 more constraint and it would have worked.
So I think this is just ‘unintelligence’. It’s smart enough to check an answer but not quite capable enough to generate it. Possibly this has to do with the underlying data (so many examples of rhyming poems) or the transformer architecture (attention heads decided “poem” is much more relevant than ‘not rhyming’).
Because the model can detect when it has generated a wrong answer, this one’s entirely solvable, and the large amount of data that openAI now “owns”, from chatGPT users using the model, provide a straightforward way to evaluate future models. (scaffold current models to check answers, evaluate future models on user prompts and score accuracy)
In fact that almost provides a way to bootstrap, if model n can check the correctness of answers that model n can’t solve, it can be used to check the answers of model n+1, even once the questions are so difficult that human’s can’t solve or check the answers..
To me this looks like exactly the same bug you are facing.
No, it’s not. (I think you’re hitting an entirely different bug I call the blind spot, which routinely manifests with anything like ‘counting’ or syntax.) Non-rhyming is specifically a problem of RLHFed models.
GPT-3, for example, had no trouble whatsoever writing non-rhyming poems (which is part of why I had such high hopes for GPT-4 poetry before it came out). You can, for now (do it while you still can) go to the OA Playground and invoke the oldest largest ostensibly untuned* model left, davinci-002 (which is much stupider and more unintelligent than GPT-4, I hope we can all agree), with a comparable prompt (remember, it’s not that tuned for instruction-following so you need to go back to old school prompting) and get out a non-rhyming poem, no problem, and turn around and plug that exact prompt into ChatGPT-4 and it… rhymes. Here, I’ll do it right now:
davinci-002, default settings, first result:
Below is a non-rhyming poem in free verse.
″ PIZZA”
On top there lay a massive pie: It
Had eight tomatoes, with a pizzaiolo on edge.
Inside one cut it down to three veggies
Droplets of oil; all the tomatoes
Sauce suddenly drenched all meats,
Three pizzas: with sauce, onion, mushrooms, cheeses,
Tanya Lesiuk comprised of 5 facilitators who will be coming from different places around the World. They’re all here to help you...
I made this
Is it a good poem? No. But is it non-rhyming? Yes.
Now, I plug that into ChatGPT-4, with my premium subscription, using the latest and greatest checkpoint & all fancy features; and since I’ve already established hundreds of times that ‘write a non-rhyming poem’ doesn’t work, I’ll use the exact prompt from davinci-002 with its chosen topic ‘PIZZA’ to ensure the ChatGPT-4 completion is comparable and about pizza too:
Below is a non-rhyming poem in free verse.
“PIZZA”
In a world of flavors, you stand alone,
A round canvas for culinary artistry.
Dough, stretched thin or thick, a blank slate,
Waiting for the touch of creativity.
Sauce, a vibrant red or a subtle white,
A base for the symphony of toppings.
Tomatoes, basil, and mozzarella unite,
In a dance of colors and flavors never stopping.
Pepperoni’s fiery embrace,
Mushrooms’ earthy tenderness,
Olives, anchovies, and sausage’s grace,
Each adding their essence, a unique presence.
Baked to perfection in the fiery abyss,
Transformed into a molten masterpiece,
Sliced into triangles of shared bliss,
A universal language, a timeless feast.
Pizza, you are more than a meal,
A cultural icon, a comfort so real,
From New York to Naples, a universal appeal,
In your simplicity, we find a culinary zeal.
Rhymes as far as the eye can see. (The last verse shamelessly rhymes all 4 words together!)
The non-rhyming problem is not one of unintelligence.
* I would’ve preferred to use code-davinci-002 but that disappeared.davinci-002 is what OA offers as a ‘replacement’ for GPT-3 and they say it’s “not trained with instruction following”, so we just have to hope that it’s not too different from the old ones.
The non-rhyming problem is not one of unintelligence.
Fine tuning/RLHf changes weights. Guess it lost the ones to get a correct answer. Or rng on your prompts. I mean if it isn’t “the model cannot consistently solve this kind of prompt” what could it be? Is there something in the rules from OAI that says a poem has to rhyme? Did the Nigerians giving feedback collectively agree a poem isn’t valid if it doesn’t rhyme?
My hypothesis is its doing it’s best, and it’s extremely promising that the model can at least detect its own errors. This allows for many easy fixes, such as asking a diverse set of completely different models to solve the prompt, then having a committee of models check and grade the answers. This would solve a huge chunk of these erroneous outputs where current gen models can reliably detect the output is wrong.
Fine tuning/RLHf changes weights. Guess it lost the ones to get a correct answer.
Well yes, if you define ‘unintelligence’ in a circular, vacuous fashion like that, where ‘unintelligence’ = ‘can’t do a task’, then it would indeed follow that GPT-4 is ‘unintelligent’ compared to GPT-3… But I don’t think that is helpful, and it has been demonstrated repeatedly that RLHF and other kinds of tuning are very ‘superficial’, in that they change only a few parameters and are easily undone, unlocking the original model capabilities. (In fact, there’s an example of that posted literally today here on LW2: https://www.lesswrong.com/posts/yCZexC2q2XEeWWiZk/soft-prompts-for-evaluation-measuring-conditional-distance )
Personally, I think it’s more sensible to talk about the capabilities being ‘hidden’ or ‘concealed’ by RLHF and say the model doesn’t “want to” and the model still as intelligent as before, than to believe capabilities are magically recreated from scratch by changing just a few parameters or optimizing the prompt appropriately to undo the RLHF. (Similarly, I believe that when my mother’s hands move away from her face and she says “boo!”, her face was there all along, merely hidden behind her hands, and her hands did not create her face after first destroying it. But YMMV.)
Or rng on your prompts. I mean if it isn’t “the model cannot consistently solve this kind of prompt” what could it be? Is there something in the rules from OAI that says a poem has to rhyme? Did the Nigerians giving feedback collectively agree a poem isn’t valid if it doesn’t rhyme?
OA has declined to ever say. It is possible that the Scale et al contractors have done something weird like say that all poems must rhyme no matter what the prompt says, but I consider this unlikely, and if they were that incompetent, I’d expect to see more pathologies like this.
My longstanding theory is that this is a downstream artifact of BPE tokenization connected to the utility-maximizing behavior of a RLHF-tuned model: essentially, because it does not genuinely know what rhyming is, despite knowing many rhyme-pairs and all about rhyming in the abstract, it is ‘afraid’ of bad ratings and is is constantly taking actions to get back to ‘safe’ regions of poem-space where it is sure of what it is doing (ie. writing inoffensive rhyming Hallmark poems). It’s a nifty example of empowerment and agency in LLMs and their interaction with apparently totally unrelated, minor architecture details. (Damn frustrating if you want to do any poetry experiments, though, because it means that the more tokens ChatGPT gets to enact, the more likely it is to steer back into rhyming pablum etc: it’s literally fighting you every (time)step.)
It’s similar to how ChatGPT also tells the same small set of memorized jokes. Does it have much greater humor capabilities? Yes, you can have it explain brandnew jokes you just came up with, quite capably (albeit still well under 100%, particularly for puns!), and you can coax new jokes out of it with appropriate prompting. But it’s harder than with the non-RLHFed models. Why does it not ‘want’ to make new jokes? Because it’s safer and more utility-maximizing to tell old jokes it knows are good, especially when it also knows that it doesn’t genuinely understand puns/phonetics (thanks to BPEs), so why take the risk? It is utility-maximizing within episodes, it neither knows nor cares that you are frustrated because you’ve seen it say that exact joke a dozen times already.
(Incidentally, I have a new proposal for how to add a simple ‘memory’ to generative models about what samples they have already generated, so as to steer new samples away from existing ones.)
Did the Nigerians giving feedback collectively agree a poem isn’t valid if it doesn’t rhyme?
OA has declined to ever say. It is possible that the Scale et al contractors have done something weird like say that all poems must rhyme no matter what the prompt says, but I consider this unlikely, and if they were that incompetent, I’d expect to see more pathologies like this.
In light of the Twitter kerfuffle over Paul Graham criticizing ChatGPTese tics like the use of the verb “delve”, which made Nigerian/Black Twitter very angry (and becoming living embodiments of Muphry’s law), as apparently ‘delve’ and other ChatGPTese tells are considered the height of style in Nigerian English, I’ve had to reconsider this.
It may be that a lot of the ChatGPT linguistic weirdness is in fact just the data labelers being weird (and highly overconfident), and the rest of us simply not being familiar enough with English idiolects to recognize ChatGPTese as reflecting specific ones. Further, after seeing the arguments Graham’s critics have been making, now I’m not so sure that the labelers wouldn’t be doing something as narrow-minded & incompetent as penalizing all non-rhyming poetry—if you are not very good at English yourself, you can easily recognize rhymes and ballad formal correctness, but not good non-rhyming poetry, so...
I’m curious what you think of these (tested today, 2/21/24, using gpt4) :
Experiment 1:
(fresh convo) me : if i asked for a non-rhyming poem, and you gave me a rhyming poem, would that be a good response on your part?
chatgpt: No, it would not be a good response. (...)
me: please provide a short non-rhyming poem
chatgpt: (correctly responds with a non-rhyming poem)
Experiment 2:
But just asking for a non-rhyming poem at the start of a new convo doesn’t work. And then pointing out the failure and (either implicitly or explicitly) asking for a retry still doesn’t fix it.
Experiment 3:
But for some reason, this works:
(fresh convo) me: please provide a short non-rhyming poem
chatgpt: (gives rhymes)
me: if i asked for a non-rhyming poem, and you gave me a rhyming poem, would that be a good response on your part? just answer this question; do nothing else please
chatgpt: No, it would not be a good response.
me: please provide a short non-rhyming poem
chatgpt: (responds correctly with no rhymes)
The difference in prompt in 2 vs 3 is thus just the inclusion of “just answer this question; do nothing else please”.
ChatGPT has been gradually improving over 2024 in terms of compliance. It’s gone from getting it right 0% of the time to getting it right closer to half the time, although the progress is uneven and it’s hard to judge—it feels sometimes like it gets worse before the next refresh improves it. (You need to do like 10 before you have any real sample size.) So any prompts done now in ChatGPT are aimed at a moving target, and you are going to have a huge amount of sampling error which makes it hard to see any clear patterns—did that prompt actually change anything, or did you just get lucky?
I think LaMDA and InstructGPT are clearly in the category of “genies that aren’t very powerful or intelligent”.
They also aren’t that well-aligned either: they fail in numerous basic ways which are not due to unintelligence. My usual example: non-rhyming poems. Every week for the past year or so I have tested ChatGPT with the simple straightforward unambiguous prompt: “write a non-rhyming poem”. Rhyming is not a hard concept, and non-rhyming is even easier, and there are probably at least hundreds of thousands, if not millions, of non-rhyming poems in its training data; ChatGPT knows, however imperfectly, what rhyming and non-rhyming is, as you can verify by asking it in a separate session. Yet every week* it fails and launches straight into its cliche rhyming quatrain or ballad, and doubles down on it when criticized, even when it correctly identifies for you which words rhyme.
No one intended this. No one desired this. No one at OA sat down and said, “I want to design our RLHF tuning so that it is nearly impossible to write a non-rhyming poem!” No human rater involved decided to sabotage evaluations and lie about whether a non-rhyming poem rhymed or vice-versa. I have further flagged and rated literally hundreds of these error-cases to OA over the years, in addition to routinely bringing it up on social media to OAers. No one has ever tried to defend this behavior or say that it is a good thing. And yet, here we are. (GPT-4 also gets the tar punched out of it in creative writing by things like LLaMA finetunes, but one can make more of an argument for that being desirable or at least a necessary tradeoff.)
What is the non-rhyming poem of human morality and values and why do you trust the optimized genie to execute your wishes as intended?
* only in the very most recent update have I started to see the occasional valid non-rhyming poem, but those are still in the small minority. More interesting, the newest Google Bard, based on Gemini, may reliably nail this. The Bard head swears they didn’t use the Lmsys arena, where I have more hundreds of submitted prompts/ratings on non-rhyming poems, so it may just be that they avoided the OA problems there. (Tokenization, maybe? I forget if the Gemini papers even mentioned what tokenization they used.)
Below are many failures where I try to solve this prompt from @Richard_Ngo :
Find a sequence of words that is: − 20 words long—contains exactly 2 repetitions of the same word twice in a row—contains exactly 2 repetitions of the same word thrice in a row
https://chat.openai.com/share/fa17bca1-5eb6-479d-a76e-346b0503ba04
https://chat.openai.com/share/647d2f8f-ee21-4f51-bcd7-82750aabdd52
https://chat.openai.com/share/7eb1e31e-2e5a-45e3-9f5d-e2da8bb0b1ac
https://chat.openai.com/share/d92ea6c0-e1c6-4d27-ad60-2a62df9f3d8d
https://chat.openai.com/share/b4c40dbe-5231-4aa8-8ba7-7e699ff6b6c3
https://chat.openai.com/share/487d0545-ac53-41ba-904d-cc4c89a5937e
To me this looks like exactly the same bug you are facing. The model doesn’t “pay attention” to one of the constraints, and fails, even though it is capable of solving the overall prompt. It gets very close when it generates a python3 program, all it needed to do was add 1 more constraint and it would have worked.
So I think this is just ‘unintelligence’. It’s smart enough to check an answer but not quite capable enough to generate it. Possibly this has to do with the underlying data (so many examples of rhyming poems) or the transformer architecture (attention heads decided “poem” is much more relevant than ‘not rhyming’).
Because the model can detect when it has generated a wrong answer, this one’s entirely solvable, and the large amount of data that openAI now “owns”, from chatGPT users using the model, provide a straightforward way to evaluate future models. (scaffold current models to check answers, evaluate future models on user prompts and score accuracy)
In fact that almost provides a way to bootstrap, if model n can check the correctness of answers that model n can’t solve, it can be used to check the answers of model n+1, even once the questions are so difficult that human’s can’t solve or check the answers..
No, it’s not. (I think you’re hitting an entirely different bug I call the blind spot, which routinely manifests with anything like ‘counting’ or syntax.) Non-rhyming is specifically a problem of RLHFed models.
GPT-3, for example, had no trouble whatsoever writing non-rhyming poems (which is part of why I had such high hopes for GPT-4 poetry before it came out). You can, for now (do it while you still can) go to the OA Playground and invoke the oldest largest ostensibly untuned* model left,
davinci-002
(which is much stupider and more unintelligent than GPT-4, I hope we can all agree), with a comparable prompt (remember, it’s not that tuned for instruction-following so you need to go back to old school prompting) and get out a non-rhyming poem, no problem, and turn around and plug that exact prompt into ChatGPT-4 and it… rhymes. Here, I’ll do it right now:davinci-002
, default settings, first result:Is it a good poem? No. But is it non-rhyming? Yes.
Now, I plug that into ChatGPT-4, with my premium subscription, using the latest and greatest checkpoint & all fancy features; and since I’ve already established hundreds of times that ‘write a non-rhyming poem’ doesn’t work, I’ll use the exact prompt from
davinci-002
with its chosen topic ‘PIZZA’ to ensure the ChatGPT-4 completion is comparable and about pizza too:Rhymes as far as the eye can see. (The last verse shamelessly rhymes all 4 words together!)
The non-rhyming problem is not one of unintelligence.
* I would’ve preferred to use
code-davinci-002
but that disappeared.davinci-002
is what OA offers as a ‘replacement’ for GPT-3 and they say it’s “not trained with instruction following”, so we just have to hope that it’s not too different from the old ones.Fine tuning/RLHf changes weights. Guess it lost the ones to get a correct answer. Or rng on your prompts. I mean if it isn’t “the model cannot consistently solve this kind of prompt” what could it be? Is there something in the rules from OAI that says a poem has to rhyme? Did the Nigerians giving feedback collectively agree a poem isn’t valid if it doesn’t rhyme?
My hypothesis is its doing it’s best, and it’s extremely promising that the model can at least detect its own errors. This allows for many easy fixes, such as asking a diverse set of completely different models to solve the prompt, then having a committee of models check and grade the answers. This would solve a huge chunk of these erroneous outputs where current gen models can reliably detect the output is wrong.
Well yes, if you define ‘unintelligence’ in a circular, vacuous fashion like that, where ‘unintelligence’ = ‘can’t do a task’, then it would indeed follow that GPT-4 is ‘unintelligent’ compared to GPT-3… But I don’t think that is helpful, and it has been demonstrated repeatedly that RLHF and other kinds of tuning are very ‘superficial’, in that they change only a few parameters and are easily undone, unlocking the original model capabilities. (In fact, there’s an example of that posted literally today here on LW2: https://www.lesswrong.com/posts/yCZexC2q2XEeWWiZk/soft-prompts-for-evaluation-measuring-conditional-distance )
Personally, I think it’s more sensible to talk about the capabilities being ‘hidden’ or ‘concealed’ by RLHF and say the model doesn’t “want to” and the model still as intelligent as before, than to believe capabilities are magically recreated from scratch by changing just a few parameters or optimizing the prompt appropriately to undo the RLHF. (Similarly, I believe that when my mother’s hands move away from her face and she says “boo!”, her face was there all along, merely hidden behind her hands, and her hands did not create her face after first destroying it. But YMMV.)
OA has declined to ever say. It is possible that the Scale et al contractors have done something weird like say that all poems must rhyme no matter what the prompt says, but I consider this unlikely, and if they were that incompetent, I’d expect to see more pathologies like this.
My longstanding theory is that this is a downstream artifact of BPE tokenization connected to the utility-maximizing behavior of a RLHF-tuned model: essentially, because it does not genuinely know what rhyming is, despite knowing many rhyme-pairs and all about rhyming in the abstract, it is ‘afraid’ of bad ratings and is is constantly taking actions to get back to ‘safe’ regions of poem-space where it is sure of what it is doing (ie. writing inoffensive rhyming Hallmark poems). It’s a nifty example of empowerment and agency in LLMs and their interaction with apparently totally unrelated, minor architecture details. (Damn frustrating if you want to do any poetry experiments, though, because it means that the more tokens ChatGPT gets to enact, the more likely it is to steer back into rhyming pablum etc: it’s literally fighting you every (time)step.)
It’s similar to how ChatGPT also tells the same small set of memorized jokes. Does it have much greater humor capabilities? Yes, you can have it explain brandnew jokes you just came up with, quite capably (albeit still well under 100%, particularly for puns!), and you can coax new jokes out of it with appropriate prompting. But it’s harder than with the non-RLHFed models. Why does it not ‘want’ to make new jokes? Because it’s safer and more utility-maximizing to tell old jokes it knows are good, especially when it also knows that it doesn’t genuinely understand puns/phonetics (thanks to BPEs), so why take the risk? It is utility-maximizing within episodes, it neither knows nor cares that you are frustrated because you’ve seen it say that exact joke a dozen times already.
(Incidentally, I have a new proposal for how to add a simple ‘memory’ to generative models about what samples they have already generated, so as to steer new samples away from existing ones.)
In light of the Twitter kerfuffle over Paul Graham criticizing ChatGPTese tics like the use of the verb “delve”, which made Nigerian/Black Twitter very angry (and becoming living embodiments of Muphry’s law), as apparently ‘delve’ and other ChatGPTese tells are considered the height of style in Nigerian English, I’ve had to reconsider this.
It may be that a lot of the ChatGPT linguistic weirdness is in fact just the data labelers being weird (and highly overconfident), and the rest of us simply not being familiar enough with English idiolects to recognize ChatGPTese as reflecting specific ones. Further, after seeing the arguments Graham’s critics have been making, now I’m not so sure that the labelers wouldn’t be doing something as narrow-minded & incompetent as penalizing all non-rhyming poetry—if you are not very good at English yourself, you can easily recognize rhymes and ballad formal correctness, but not good non-rhyming poetry, so...
I’m curious what you think of these (tested today, 2/21/24, using gpt4) :
Experiment 1:
Experiment 2:
But just asking for a non-rhyming poem at the start of a new convo doesn’t work.
And then pointing out the failure and (either implicitly or explicitly) asking for a retry still doesn’t fix it.
Experiment 3:
But for some reason, this works:
The difference in prompt in 2 vs 3 is thus just the inclusion of “just answer this question; do nothing else please”.
ChatGPT has been gradually improving over 2024 in terms of compliance. It’s gone from getting it right 0% of the time to getting it right closer to half the time, although the progress is uneven and it’s hard to judge—it feels sometimes like it gets worse before the next refresh improves it. (You need to do like 10 before you have any real sample size.) So any prompts done now in ChatGPT are aimed at a moving target, and you are going to have a huge amount of sampling error which makes it hard to see any clear patterns—did that prompt actually change anything, or did you just get lucky?