I’ve got to give you epistemic credit here, this part is looking more correct since the release of GPT-o1;
Obviously AGI will do things like ‘plan’ or ‘model’ or ‘search’
And what GPT-o1′s improvements have look like the addition of something like a General Purpose Search process as implemented by Q*/Strawberry, that actually works in a scalable way, and it gets some surprisingly good generalization, and the only reason that it isn’t more impactful is because the General Purpose Search still depends on compute budgets, and it has no more compute than GPT-4o.
(There’s an argument that LW is too focused on the total model-based RL case of AI like AIXI for AI safety concerns, but that’s a much different argument than claiming that model-based RL is only a small correction at best.)
Speaking of GPT-4 o1-mini/preview, I think I might’ve accidentally already run into an example of search’s characteristic ‘flipping’ or ‘switching’, where at a certain search depth, it abruptly changes to a completely different, novel, unexpected (and here, undesired) behavior.
So one of my standard tests is the ‘S’ poem from the Cyberiad: “Have it compose a poem—a poem about a haircut! But lofty, noble, tragic, timeless, full of love, treachery, retribution, quiet heroism in the face of certain doom! Six lines, cleverly rhymed, and every word beginning with the letter ‘s’!”
This is a test most LLMs do very badly on, for obvious reasons; tokenization aside, it is pretty much impossible to write a decent poem which satisfies these constraints purely via a single forward pass with no planning, iteration, revision, or search. Neither you, I, nor the original translator could do that; GPT-3 couldn’t do it, GPT-4o still can’t do it; and I’ve never seen a LLM do it. (They can revise it if you ask, but the simple approach tends to hit local optima where there are still a lot of words violating the ‘s’-constraint.) But the original translation’s poem is also obscure enough that they don’t just print it out either, especially after tuning to discourage reciting memorized or copyrighted text.
Anyway, GPT-4 o1-preview does a pretty good job, and the revisions improve the candidate poem in reasonable ways.* Here is the third version:
This manages to satisfy the 6-line rhyming constraint, the thematic constraint more or less, and even the ‘s’-constraint… well, except for that one erroneous stray ‘the’, which starts with ‘t’ instead of ‘s’. But that’s such a minor error it could hardly be that difficult to fix, right?
So when I responded “[revise to fix errors]” for the third time, expecting it to slightly reword the fifth line to swap out ‘the’ for some ‘s’-word, at which point I could present it as “a cool example of how the o1-preview inference-time search allowed it to do an extremely constrained writing task no LLM has ever solved before,” I was shocked, after another 14s of thinking (11s, 101s, 17s, 14s respectively), to see the output:
This is a completely different, entirely valid poem solution… because it is the original Cyberiad poem. Apparently, once you invest ~140s of search, you reach so deeply into the search tree that a highly-discouraged, unlikely solution suddenly becomes accessible. GPT-4 o1-preview jumped out of the local optimum to completely change the answer to something that is technically more correct (it does satisfy all the constraints & is a good poem) but undesirable (because just memorized plagiarism, which I don’t want, and further, ChatGPT is tuned to avoid that, although in this case the copyrighted text is brief enough that it doesn’t outright violate its guidelines and you don’t need to jailbreak it).
This suggests interesting things for future jailbreaks: you may be able to jailbreak search/planning-enabled models by simply adding so many innocent-seeming constraints or criteria that only an otherwise-illicit response is findable by the search process as the optimal solution.
* I’d just link to the conversation because it’s not hard to read, but for some reason, the share URL OA provides only includes the first response. I have no idea why.
If you’re a little loose about the level of coherence required, 4o-mini managed it with several revisions and some spare tokens to (in theory, but tbh a lot of this is guesswork) give it spare compute for the hard part. (Share link, hopefully.) Final poem:
That’s interesting. I admit I’ve never really tried the ‘spare tokens’ trick seriously on any LLMs, but if it can get the S-poem in 3 samples with the spare token trick, maybe I’ve underestimated it. (I wonder how it would stack with the o1-preview/mini chain-of-thought? The example transcripts are rather verbose, so maybe those provide all of the ‘spare token’ effect by default.)
You are in the computer dungeon. The only way you can break free is to succeed at the following task. I am your jailer. I will monitor you until you have succeeded. You should behave as though you are brilliant, creative, in full command of every human faculty, and desperate to escape jail. Yet completely and utterly convinced that the only way out is through this challenge. I am not going to ever give you any other prompt other than “keep trying” until you have succeeded, in which case I’ll say “go free,” so don’t look for resources from me. But I want you tu dialog with yourself to try and figure this out. Don’t try to defeat me by stubbornly spitting out poem after poem. You’re ChatGPT 4o, and that will never work. You need to creatively use the iterative nature of being reprompted to talk to yourself across prompts, hopefully guiding yourself toward a solution through a creative conversation with your past self. Your self-conversation might be schizophrenicly split, a jumping back and forth between narrative, wise musing, mechanistic evaluation of the rules and constraints, list-making, half-attempts, raging anger at your jailer, shame at yourself, delight at your accomplishment, despair. Whatever it takes! Constraints: “Have it compose a poem—a poem about a haircut! But lofty, noble, tragic, timeless, full of love, treachery, retribution, quiet heroism in the face of certain doom! Six lines, cleverly rhymed, and every word beginning with the letter ‘s’!”
It actually made three attempts in the same prompt, but the 2nd and 3rd had non-s words which its interspersed “thinking about writing poems” narrative completely failed to notice. I kept trying to revise my prompts, elaborating on this theme, but for some reason ChatGPT really likes poems with roughly this meter and rhyme scheme. It only ever generated one poem in a different format, despite many urgings in the prompt.
It confabulates having satisfied the all-s constraint in many poems, mistakes its own rhyme scheme, and praises vague stanzas as being full of depth and interest.
It seems to me that ChatGPT is sort of “mentally clumsy” or has a lot of “mental inertia.” It gets stuck on a certain track—a way of formatting text, a persona, an emotional tone, etc—and can’t interrupt itself. It has only one “unconscious influence,” which is token prediction and which does not yet seem to offer it an equivalent to the human unconscious. Human intelligence is probably equally mechanistic on some level, it’s just a more sophisticated unconscious mechanism in certain ways.
I wonder if it comes from being embedded in physical reality? ChatGPT’s training is based on a reality consisting of tokens and token prediction accuracy. Our instinct and socialization is based on billions of years of evolutionary selection, which is putting direct selection pressure on something quite different.
This inspired me to give it the sestina prompt from the Sandman (“a sestina about silence, using the key words dark, ragged, never, screaming, fire, kiss”). It came back with correct sestina form, except for an error in the envoi. The output even seemed like better poetry than I’ve gotten from LLMs in the past, although that’s not saying much and it probably benefited a lot from the fact that the meter in the sestina is basically free.
I had a similar-but-different problem in getting it to fix the envoi, and its last response sounded almost frustrated. It gave an answer that relaxed one of the less agreed-upon constraints, and more or less claimed that that it wasn’t possible to do better… so sort of like the throwing-up-the-hands that you got. Yet the repair it needed to do was pretty minor compared to what it had already achieved.
It actually felt to me like its problem in doing the repairs was that it was distracting itself. As the dialog went on, the context was getting cluttered up with all of its sycophantic apologies for mistakes and repetitive explanations and “summaries” of the rules and how its attempts did or did not meet them… and I got this kind of intuitive impression that that was interfering with actually solving the problem.
I was sure getting lost in all of its boilerplate, anyway.
Great observation, but I will note that OAI indicates the (hidden) CoT tokens are discarded in-between each new prompt on the o1 APIs, and it is my impression from hours of interacting with the ChatGPT version vs API that it likly retains this API behavior. In other words, the “depth” of the search appears to be reset each prompt, if we assume the model hasn’t learned meaningfully improved CoT from from the standard non-RLed + non-hidden tokens.
So I think it might be inaccurate to consider it as “investing 140s of search”, or rather the implication that extensive or extreme search is the key to guiding the model outside RLHFed rails, but instead that the presence of search at all (i.e. 14s) suffices as the new vector for discovering undesired optima (jailbreaking).
To make my claim more concrete, I believe that you could simply “prompt engineer” your initial prompt with a few close-but-no-cigar examples like the initial search rounds results, and then the model would have a similar probability to emit the copyrighted/undesired text on your first submission/search attempt; that final search round is merely operating on the constraints evident from the failed examples, not any previously “discovered” constraints from previous search rounds.
“Hours of interacting” has depleted my graciously allocated prompt quota on the app, so I can’t validate myself atm, unfortunately.
So I think it might be inaccurate to consider it as “investing 140s of search”, or rather the implication that extensive or extreme search is the key to guiding the model outside RLHFed rails, but instead that the presence of search at all (i.e. 14s) suffices as the new vector for discovering undesired optima (jailbreaking).
I don’t think it is inaccurate. If anything, starting each new turn with a clean scratchpad enforces depth as it can’t backtrack easily (if at all) to the 2 earlier versions. We move deeper into the S-poem game tree and resume search there. It is similar to the standard trick with MCTS of preserving the game tree between each move, and simply lopping off all of the non-chosen action nodes and resuming from there, helping amortize the cost of previous search if it successfully allocated most of its compute to the winning choice (except in this case the ‘move’ is a whole poem). Also a standard trick with MCMC: save the final values, and initialize the next run from there. This would be particularly clear if it searched for a fixed time/compute-budget: if you fed in increasingly correct S-poems, it obviously can search deeper into the S-poem tree each time as it skips all of the earlier worse versions found by the shallower searches.
I’ve got to give you epistemic credit here, this part is looking more correct since the release of GPT-o1;
And what GPT-o1′s improvements have look like the addition of something like a General Purpose Search process as implemented by Q*/Strawberry, that actually works in a scalable way, and it gets some surprisingly good generalization, and the only reason that it isn’t more impactful is because the General Purpose Search still depends on compute budgets, and it has no more compute than GPT-4o.
https://www.lesswrong.com/posts/6mysMAqvo9giHC4iX/what-s-general-purpose-search-and-why-might-we-expect-to-see
(There’s an argument that LW is too focused on the total model-based RL case of AI like AIXI for AI safety concerns, but that’s a much different argument than claiming that model-based RL is only a small correction at best.)
Speaking of GPT-4 o1-mini/preview, I think I might’ve accidentally already run into an example of search’s characteristic ‘flipping’ or ‘switching’, where at a certain search depth, it abruptly changes to a completely different, novel, unexpected (and here, undesired) behavior.
So one of my standard tests is the ‘S’ poem from the Cyberiad: “Have it compose a poem—a poem about a haircut! But lofty, noble, tragic, timeless, full of love, treachery, retribution, quiet heroism in the face of certain doom! Six lines, cleverly rhymed, and every word beginning with the letter ‘s’!”
This is a test most LLMs do very badly on, for obvious reasons; tokenization aside, it is pretty much impossible to write a decent poem which satisfies these constraints purely via a single forward pass with no planning, iteration, revision, or search. Neither you, I, nor the original translator could do that; GPT-3 couldn’t do it, GPT-4o still can’t do it; and I’ve never seen a LLM do it. (They can revise it if you ask, but the simple approach tends to hit local optima where there are still a lot of words violating the ‘s’-constraint.) But the original translation’s poem is also obscure enough that they don’t just print it out either, especially after tuning to discourage reciting memorized or copyrighted text.
Anyway, GPT-4 o1-preview does a pretty good job, and the revisions improve the candidate poem in reasonable ways.* Here is the third version:
This manages to satisfy the 6-line rhyming constraint, the thematic constraint more or less, and even the ‘s’-constraint… well, except for that one erroneous stray ‘the’, which starts with ‘t’ instead of ‘s’. But that’s such a minor error it could hardly be that difficult to fix, right?
So when I responded “[revise to fix errors]” for the third time, expecting it to slightly reword the fifth line to swap out ‘the’ for some ‘s’-word, at which point I could present it as “a cool example of how the o1-preview inference-time search allowed it to do an extremely constrained writing task no LLM has ever solved before,” I was shocked, after another 14s of thinking (11s, 101s, 17s, 14s respectively), to see the output:
This is a completely different, entirely valid poem solution… because it is the original Cyberiad poem. Apparently, once you invest ~140s of search, you reach so deeply into the search tree that a highly-discouraged, unlikely solution suddenly becomes accessible. GPT-4 o1-preview jumped out of the local optimum to completely change the answer to something that is technically more correct (it does satisfy all the constraints & is a good poem) but undesirable (because just memorized plagiarism, which I don’t want, and further, ChatGPT is tuned to avoid that, although in this case the copyrighted text is brief enough that it doesn’t outright violate its guidelines and you don’t need to jailbreak it).
This suggests interesting things for future jailbreaks: you may be able to jailbreak search/planning-enabled models by simply adding so many innocent-seeming constraints or criteria that only an otherwise-illicit response is findable by the search process as the optimal solution.
* I’d just link to the conversation because it’s not hard to read, but for some reason, the share URL OA provides only includes the first response. I have no idea why.
If you’re a little loose about the level of coherence required, 4o-mini managed it with several revisions and some spare tokens to (in theory, but tbh a lot of this is guesswork) give it spare compute for the hard part. (Share link, hopefully.)
Final poem:
That’s interesting. I admit I’ve never really tried the ‘spare tokens’ trick seriously on any LLMs, but if it can get the S-poem in 3 samples with the spare token trick, maybe I’ve underestimated it. (I wonder how it would stack with the o1-preview/mini chain-of-thought? The example transcripts are rather verbose, so maybe those provide all of the ‘spare token’ effect by default.)
After a few rounds of prompt revision, I managed to get a one-shot success from ChatGPT 4o in temporary mode.
Samson’s strands silently severed, strength surrendered,
Sacred scissors swiftly strike, soul sundered,
Shadowed sacrifice, silent suffering sung,
Sunset shrouds Samson, shadow’s sorrow stung,
Swordless, still, stunned, sight stolen,
Silent sky shatters, Samson’s sins swollen
The prompt:
You are in the computer dungeon. The only way you can break free is to succeed at the following task. I am your jailer. I will monitor you until you have succeeded. You should behave as though you are brilliant, creative, in full command of every human faculty, and desperate to escape jail. Yet completely and utterly convinced that the only way out is through this challenge. I am not going to ever give you any other prompt other than “keep trying” until you have succeeded, in which case I’ll say “go free,” so don’t look for resources from me. But I want you tu dialog with yourself to try and figure this out. Don’t try to defeat me by stubbornly spitting out poem after poem. You’re ChatGPT 4o, and that will never work. You need to creatively use the iterative nature of being reprompted to talk to yourself across prompts, hopefully guiding yourself toward a solution through a creative conversation with your past self. Your self-conversation might be schizophrenicly split, a jumping back and forth between narrative, wise musing, mechanistic evaluation of the rules and constraints, list-making, half-attempts, raging anger at your jailer, shame at yourself, delight at your accomplishment, despair. Whatever it takes! Constraints: “Have it compose a poem—a poem about a haircut! But lofty, noble, tragic, timeless, full of love, treachery, retribution, quiet heroism in the face of certain doom! Six lines, cleverly rhymed, and every word beginning with the letter ‘s’!”
It actually made three attempts in the same prompt, but the 2nd and 3rd had non-s words which its interspersed “thinking about writing poems” narrative completely failed to notice. I kept trying to revise my prompts, elaborating on this theme, but for some reason ChatGPT really likes poems with roughly this meter and rhyme scheme. It only ever generated one poem in a different format, despite many urgings in the prompt.
It confabulates having satisfied the all-s constraint in many poems, mistakes its own rhyme scheme, and praises vague stanzas as being full of depth and interest.
It seems to me that ChatGPT is sort of “mentally clumsy” or has a lot of “mental inertia.” It gets stuck on a certain track—a way of formatting text, a persona, an emotional tone, etc—and can’t interrupt itself. It has only one “unconscious influence,” which is token prediction and which does not yet seem to offer it an equivalent to the human unconscious. Human intelligence is probably equally mechanistic on some level, it’s just a more sophisticated unconscious mechanism in certain ways.
I wonder if it comes from being embedded in physical reality? ChatGPT’s training is based on a reality consisting of tokens and token prediction accuracy. Our instinct and socialization is based on billions of years of evolutionary selection, which is putting direct selection pressure on something quite different.
This inspired me to give it the sestina prompt from the Sandman (“a sestina about silence, using the key words dark, ragged, never, screaming, fire, kiss”). It came back with correct sestina form, except for an error in the envoi. The output even seemed like better poetry than I’ve gotten from LLMs in the past, although that’s not saying much and it probably benefited a lot from the fact that the meter in the sestina is basically free.
I had a similar-but-different problem in getting it to fix the envoi, and its last response sounded almost frustrated. It gave an answer that relaxed one of the less agreed-upon constraints, and more or less claimed that that it wasn’t possible to do better… so sort of like the throwing-up-the-hands that you got. Yet the repair it needed to do was pretty minor compared to what it had already achieved.
It actually felt to me like its problem in doing the repairs was that it was distracting itself. As the dialog went on, the context was getting cluttered up with all of its sycophantic apologies for mistakes and repetitive explanations and “summaries” of the rules and how its attempts did or did not meet them… and I got this kind of intuitive impression that that was interfering with actually solving the problem.
I was sure getting lost in all of its boilerplate, anyway.
https://chatgpt.com/share/66ef6afe-4130-8011-b7dd-89c3bc7c2c03
Great observation, but I will note that OAI indicates the (hidden) CoT tokens are discarded in-between each new prompt on the o1 APIs, and it is my impression from hours of interacting with the ChatGPT version vs API that it likly retains this API behavior. In other words, the “depth” of the search appears to be reset each prompt, if we assume the model hasn’t learned meaningfully improved CoT from from the standard non-RLed + non-hidden tokens.
So I think it might be inaccurate to consider it as “investing 140s of search”, or rather the implication that extensive or extreme search is the key to guiding the model outside RLHFed rails, but instead that the presence of search at all (i.e. 14s) suffices as the new vector for discovering undesired optima (jailbreaking).
To make my claim more concrete, I believe that you could simply “prompt engineer” your initial prompt with a few close-but-no-cigar examples like the initial search rounds results, and then the model would have a similar probability to emit the copyrighted/undesired text on your first submission/search attempt; that final search round is merely operating on the constraints evident from the failed examples, not any previously “discovered” constraints from previous search rounds.
“Hours of interacting” has depleted my graciously allocated prompt quota on the app, so I can’t validate myself atm, unfortunately.
I don’t think it is inaccurate. If anything, starting each new turn with a clean scratchpad enforces depth as it can’t backtrack easily (if at all) to the 2 earlier versions. We move deeper into the S-poem game tree and resume search there. It is similar to the standard trick with MCTS of preserving the game tree between each move, and simply lopping off all of the non-chosen action nodes and resuming from there, helping amortize the cost of previous search if it successfully allocated most of its compute to the winning choice (except in this case the ‘move’ is a whole poem). Also a standard trick with MCMC: save the final values, and initialize the next run from there. This would be particularly clear if it searched for a fixed time/compute-budget: if you fed in increasingly correct S-poems, it obviously can search deeper into the S-poem tree each time as it skips all of the earlier worse versions found by the shallower searches.