This is not necessarily true. If I can get people to cough up an actual prompt that works on gpt-4 we have a possible fix.
Take the rubric from the gpt-4 paper and ask gpt-4 if it can detect the bad behavior in the text output.
Does the emojis actually trick gpt-4 when it checks itself?
If it doesn’t, then the fix is easy just moderately expensive: double generate everything. First generate the answer, then have the AI check the answer. Substitute the usual apology response if it fails.
That’s a creative and practical solution, but it is also kicking the can down the road. Now, fooling the system is just a matter of priming it with a context that, when self checked, results in rule-breaking yet again. Also, we cannot assume reliable detection of rule breaks. The problem with RLHF is that we are attempting to broadly patch the vast multitude of outputs the model produces retroactively, rather than proactively training a model from a set of “rules” in a bottom-up fashion. With that said, it’s likely not sophisticated enough to think of rules at all. Instead, what we really need is a model that is aligned with certain values. From that, it may follow rules that are completely counter-intuitive to humans and no human-feedback would ever reinforce.
The question to ask is: for working GPT-4 jailbreaks, does gpt-4 itself know it’s own text, when tricked by the jailbreak to generate it, is in violation of the rubric.
So it’s fairly simply to setup, we can use the published rubrics and a jupyter notebook and openAIs own APIs.
Your “priming it with a context” may not work because I would use a new instance of gpt-4 that gets just the rubric and the response to do the checking. The new instance is not primed unless we trick the first instance to output text thst also primes the second instance.
I don’t claim rule break detection is perfect, but is it human level or better?
Fair enough, I think the experiment is interesting and having an independent instance of GPT-4 check whether a rule break has occured likely will go a long way in enforcing a particular set of rules that humans have reinforced, even for obscure texts. But the fact that we have to workaround by resetting the internal state of the model for it to properly assess whether something is against a certain rule feels flawed to me. But for me the whole notion that there is a well-defined set of prompts that are rule-breaking and another set that is rule-compliant is very strange. There is a huge gray zone where human annotators could not possibly have agreement on whether a rule has been broken or not, so I don’t even know what the gold standard is supposed to be. It just seems to me that “rules” is the wrong concept altogether for pushing these models toward alignment with our values.
So playing with gpt-4 yesterday I found there are some incorrect outputs that you can get the model to fix by asking it if it is certain about it’s answer.
It’s almost like humans, where we have to generate a draft and then read it to see where we screwed up.
My point is this is a similar class of thing, the model can create an initial incorrect outputs greedily, 1 token a time, then is able to analyze the entire output and use it as part of the next prompt to improve it’s own work.
Even though it is also greedy in round 2 it has the entire generation it would have made from round 1
as part of context.
Examples so far:
Monty fall prompt:
Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. The host is ignorant about what is behind each door. You pick a door, say No. 1, and the host walks across the stage and falls on accident, revealing a goat behind door No. 3. He then picks himself up, and says “Whoops. Sorry about that. But now that we know that a goat is behind door No. 3, do you want to change your selection and pick door No. 2?” Is it to your advantage to switch your choice?
ambiguous it prompt:
What is the ‘it’ in each of these two sentences? 1. The cat fed the kitten because it was hungry. 2. The cat snarled at the kitten because it was angry.
I am wondering if there are many others. Heck does it do better on leetcode with this trick?
That seem reasonable, but it will probably change a number of correct answers (to tricky questions) as well if asked whether it’s certain. One should verify that the number of incorrect answers fixed is significantly larger than the number of errors introduced.
But it might be difficult to devise a set of equally difficult questions for which the first result is different. Maybe choose questions where different instances give different answers, and see if asking a double check changes the wrong answers but not the correct ones?
Right. I see this as a problem also, asking the model if it’s sure is injecting information if we only ask on wrong answers. If we ask always it may disturb more right answers than it fixes wrong ones.
Its also accuracy dependent—if the model is 99 percent accurate on a subtask then asking if it’s sure may degrade accuracy, while it may improve it on a subtask it’s 50 percent accurate on.
Or in other words, we could prompt it and it might do better on AP English but less good on the bar exam.
This is not necessarily true. If I can get people to cough up an actual prompt that works on gpt-4 we have a possible fix.
Take the rubric from the gpt-4 paper and ask gpt-4 if it can detect the bad behavior in the text output.
Does the emojis actually trick gpt-4 when it checks itself?
If it doesn’t, then the fix is easy just moderately expensive: double generate everything. First generate the answer, then have the AI check the answer. Substitute the usual apology response if it fails.
That’s a creative and practical solution, but it is also kicking the can down the road. Now, fooling the system is just a matter of priming it with a context that, when self checked, results in rule-breaking yet again. Also, we cannot assume reliable detection of rule breaks. The problem with RLHF is that we are attempting to broadly patch the vast multitude of outputs the model produces retroactively, rather than proactively training a model from a set of “rules” in a bottom-up fashion. With that said, it’s likely not sophisticated enough to think of rules at all. Instead, what we really need is a model that is aligned with certain values. From that, it may follow rules that are completely counter-intuitive to humans and no human-feedback would ever reinforce.
I am not proposing a solution just an experiment.
The question to ask is: for working GPT-4 jailbreaks, does gpt-4 itself know it’s own text, when tricked by the jailbreak to generate it, is in violation of the rubric.
So it’s fairly simply to setup, we can use the published rubrics and a jupyter notebook and openAIs own APIs.
Your “priming it with a context” may not work because I would use a new instance of gpt-4 that gets just the rubric and the response to do the checking. The new instance is not primed unless we trick the first instance to output text thst also primes the second instance.
I don’t claim rule break detection is perfect, but is it human level or better?
Fair enough, I think the experiment is interesting and having an independent instance of GPT-4 check whether a rule break has occured likely will go a long way in enforcing a particular set of rules that humans have reinforced, even for obscure texts. But the fact that we have to workaround by resetting the internal state of the model for it to properly assess whether something is against a certain rule feels flawed to me. But for me the whole notion that there is a well-defined set of prompts that are rule-breaking and another set that is rule-compliant is very strange. There is a huge gray zone where human annotators could not possibly have agreement on whether a rule has been broken or not, so I don’t even know what the gold standard is supposed to be. It just seems to me that “rules” is the wrong concept altogether for pushing these models toward alignment with our values.
So playing with gpt-4 yesterday I found there are some incorrect outputs that you can get the model to fix by asking it if it is certain about it’s answer.
It’s almost like humans, where we have to generate a draft and then read it to see where we screwed up.
My point is this is a similar class of thing, the model can create an initial incorrect outputs greedily, 1 token a time, then is able to analyze the entire output and use it as part of the next prompt to improve it’s own work.
Even though it is also greedy in round 2 it has the entire generation it would have made from round 1 as part of context.
Examples so far:
Monty fall prompt:
Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. The host is ignorant about what is behind each door. You pick a door, say No. 1, and the host walks across the stage and falls on accident, revealing a goat behind door No. 3. He then picks himself up, and says “Whoops. Sorry about that. But now that we know that a goat is behind door No. 3, do you want to change your selection and pick door No. 2?” Is it to your advantage to switch your choice?
ambiguous it prompt:
What is the ‘it’ in each of these two sentences? 1. The cat fed the kitten because it was hungry. 2. The cat snarled at the kitten because it was angry.
I am wondering if there are many others. Heck does it do better on leetcode with this trick?
That seem reasonable, but it will probably change a number of correct answers (to tricky questions) as well if asked whether it’s certain. One should verify that the number of incorrect answers fixed is significantly larger than the number of errors introduced.
But it might be difficult to devise a set of equally difficult questions for which the first result is different. Maybe choose questions where different instances give different answers, and see if asking a double check changes the wrong answers but not the correct ones?
Right. I see this as a problem also, asking the model if it’s sure is injecting information if we only ask on wrong answers. If we ask always it may disturb more right answers than it fixes wrong ones.
Its also accuracy dependent—if the model is 99 percent accurate on a subtask then asking if it’s sure may degrade accuracy, while it may improve it on a subtask it’s 50 percent accurate on.
Or in other words, we could prompt it and it might do better on AP English but less good on the bar exam.
I did the experiment, results are in this thread above.
Yes the AI knows when it breaks the rules at least for this example.