I just deny that they will update “arbitrarily” far from the prior, and I don’t know why you would think otherwise. There are compute tradeoffs and you’re doing to run only as many MCTS rollouts as you need to get good performance.
There are compute tradeoffs and you’re doing to run only as many MCTS rollouts as you need to get good performance.
I completely agree. Smart agents will run only as many MCTS rollouts as they need to get good performance, no more—and no less. (And the smarter they are, and so the more compute they have access to, the more MCTS rollouts they are able to run, and the more they can change the default reactive policy.)
But ‘good performance’ on what, exactly? Maximizing utility. That’s what a model-based RL agent (not a simple-minded, unintelligent, myopic model-free policy like a frog’s retina) does.
If the Value of Information remains high from doing more MCTS rollouts, then an intelligent agent will keep doing rollouts for as long as the additional planning continues to pay its way in expected improvements. The point of doing planning is policy/value improvement. The more planning you do, the more you can change the original policy. (This is how you train AlphaZero so far from its prior policy, of a randomly-initialized CNN playing random moves, to its final planning-improved policy, a superhuman Go player.) Which may take it arbitrarily far in terms of policy—like, for example, if it discovers a Move 37 where there is even a small <1/1000 probability that a highly-unusual action will pay off better than the default reactive policy and so the possibility is worth examining in greater depth...
(The extreme reductio here would be a pure MCTS with random playouts: it has no policy at all at the beginning, and yet, MCTS is a complete algorithm, so it converges to the optimal policy, no matter what that is, given enough rollouts. More rollouts = more update away from the prior. Or if you don’t like that, good old policy/value iteration on a finite MDP is an example: start with random parameters and the more iterations they can do, the further they provably monotonically travel from the original random initialization to the optimal policy.)
One might say that the point of model-based RL is to not be stupid, and thus safe due to its stupidity, in all the ways you repeatedly emphasize purely model-free RL agents may be. And that’s why AGI will not be purely model-free, nor are our smartest current frontier models like LLMs purely model-free. I don’t see how you get this vision of AGI as some sort of gigantic frog retina, which is the strawman that you seem to be aiming at in all your arguments about why you are convinced there’s no danger.
Obviously AGI will do things like ‘plan’ or ‘model’ or ‘search’ - or if you think that it will not, you should say so explicitly, and be clear about what kind of algorithm you think AGI would be, and explain how you think that’s possible. I would be fascinated to hear how you think that superhuman intelligence in all domains like programming or math or long-term strategy could be done by purely model-free approaches which do not involve planning or searching or building models of the world or utility-maximization!
(Or to put it yet another way: ‘scheming’ is not a meaningful discrete category of capabilities, but a value judgment about particular ways to abuse theory of mind / world-modeling capabilities; and it’s hard to see how one could create an AGI smart enough to be ‘AGI’, but also so stupid as to not understand people or be incapable of basic human-level capabilities like ‘be a manager’ or ‘play poker’, or generalize modeling of other agents. It would be quite bizarre to imagine a model-free AGI which must learn a separate ‘simple’ reactive policy of ‘scheming’ for each and every agent it comes across, wasting a huge number of parameters & samples every time, as opposed to simply meta-learning how to model agents in general, and applying this using planning/search to all future tasks, at enormous parameter savings and zero-shot.)
I’ve got to give you epistemic credit here, this part is looking more correct since the release of GPT-o1;
Obviously AGI will do things like ‘plan’ or ‘model’ or ‘search’
And what GPT-o1′s improvements have look like the addition of something like a General Purpose Search process as implemented by Q*/Strawberry, that actually works in a scalable way, and it gets some surprisingly good generalization, and the only reason that it isn’t more impactful is because the General Purpose Search still depends on compute budgets, and it has no more compute than GPT-4o.
(There’s an argument that LW is too focused on the total model-based RL case of AI like AIXI for AI safety concerns, but that’s a much different argument than claiming that model-based RL is only a small correction at best.)
Speaking of GPT-4 o1-mini/preview, I think I might’ve accidentally already run into an example of search’s characteristic ‘flipping’ or ‘switching’, where at a certain search depth, it abruptly changes to a completely different, novel, unexpected (and here, undesired) behavior.
So one of my standard tests is the ‘S’ poem from the Cyberiad: “Have it compose a poem—a poem about a haircut! But lofty, noble, tragic, timeless, full of love, treachery, retribution, quiet heroism in the face of certain doom! Six lines, cleverly rhymed, and every word beginning with the letter ‘s’!”
This is a test most LLMs do very badly on, for obvious reasons; tokenization aside, it is pretty much impossible to write a decent poem which satisfies these constraints purely via a single forward pass with no planning, iteration, revision, or search. Neither you, I, nor the original translator could do that; GPT-3 couldn’t do it, GPT-4o still can’t do it; and I’ve never seen a LLM do it. (They can revise it if you ask, but the simple approach tends to hit local optima where there are still a lot of words violating the ‘s’-constraint.) But the original translation’s poem is also obscure enough that they don’t just print it out either, especially after tuning to discourage reciting memorized or copyrighted text.
Anyway, GPT-4 o1-preview does a pretty good job, and the revisions improve the candidate poem in reasonable ways.* Here is the third version:
This manages to satisfy the 6-line rhyming constraint, the thematic constraint more or less, and even the ‘s’-constraint… well, except for that one erroneous stray ‘the’, which starts with ‘t’ instead of ‘s’. But that’s such a minor error it could hardly be that difficult to fix, right?
So when I responded “[revise to fix errors]” for the third time, expecting it to slightly reword the fifth line to swap out ‘the’ for some ‘s’-word, at which point I could present it as “a cool example of how the o1-preview inference-time search allowed it to do an extremely constrained writing task no LLM has ever solved before,” I was shocked, after another 14s of thinking (11s, 101s, 17s, 14s respectively), to see the output:
This is a completely different, entirely valid poem solution… because it is the original Cyberiad poem. Apparently, once you invest ~140s of search, you reach so deeply into the search tree that a highly-discouraged, unlikely solution suddenly becomes accessible. GPT-4 o1-preview jumped out of the local optimum to completely change the answer to something that is technically more correct (it does satisfy all the constraints & is a good poem) but undesirable (because just memorized plagiarism, which I don’t want, and further, ChatGPT is tuned to avoid that, although in this case the copyrighted text is brief enough that it doesn’t outright violate its guidelines and you don’t need to jailbreak it).
This suggests interesting things for future jailbreaks: you may be able to jailbreak search/planning-enabled models by simply adding so many innocent-seeming constraints or criteria that only an otherwise-illicit response is findable by the search process as the optimal solution.
* I’d just link to the conversation because it’s not hard to read, but for some reason, the share URL OA provides only includes the first response. I have no idea why.
If you’re a little loose about the level of coherence required, 4o-mini managed it with several revisions and some spare tokens to (in theory, but tbh a lot of this is guesswork) give it spare compute for the hard part. (Share link, hopefully.) Final poem:
That’s interesting. I admit I’ve never really tried the ‘spare tokens’ trick seriously on any LLMs, but if it can get the S-poem in 3 samples with the spare token trick, maybe I’ve underestimated it. (I wonder how it would stack with the o1-preview/mini chain-of-thought? The example transcripts are rather verbose, so maybe those provide all of the ‘spare token’ effect by default.)
You are in the computer dungeon. The only way you can break free is to succeed at the following task. I am your jailer. I will monitor you until you have succeeded. You should behave as though you are brilliant, creative, in full command of every human faculty, and desperate to escape jail. Yet completely and utterly convinced that the only way out is through this challenge. I am not going to ever give you any other prompt other than “keep trying” until you have succeeded, in which case I’ll say “go free,” so don’t look for resources from me. But I want you tu dialog with yourself to try and figure this out. Don’t try to defeat me by stubbornly spitting out poem after poem. You’re ChatGPT 4o, and that will never work. You need to creatively use the iterative nature of being reprompted to talk to yourself across prompts, hopefully guiding yourself toward a solution through a creative conversation with your past self. Your self-conversation might be schizophrenicly split, a jumping back and forth between narrative, wise musing, mechanistic evaluation of the rules and constraints, list-making, half-attempts, raging anger at your jailer, shame at yourself, delight at your accomplishment, despair. Whatever it takes! Constraints: “Have it compose a poem—a poem about a haircut! But lofty, noble, tragic, timeless, full of love, treachery, retribution, quiet heroism in the face of certain doom! Six lines, cleverly rhymed, and every word beginning with the letter ‘s’!”
It actually made three attempts in the same prompt, but the 2nd and 3rd had non-s words which its interspersed “thinking about writing poems” narrative completely failed to notice. I kept trying to revise my prompts, elaborating on this theme, but for some reason ChatGPT really likes poems with roughly this meter and rhyme scheme. It only ever generated one poem in a different format, despite many urgings in the prompt.
It confabulates having satisfied the all-s constraint in many poems, mistakes its own rhyme scheme, and praises vague stanzas as being full of depth and interest.
It seems to me that ChatGPT is sort of “mentally clumsy” or has a lot of “mental inertia.” It gets stuck on a certain track—a way of formatting text, a persona, an emotional tone, etc—and can’t interrupt itself. It has only one “unconscious influence,” which is token prediction and which does not yet seem to offer it an equivalent to the human unconscious. Human intelligence is probably equally mechanistic on some level, it’s just a more sophisticated unconscious mechanism in certain ways.
I wonder if it comes from being embedded in physical reality? ChatGPT’s training is based on a reality consisting of tokens and token prediction accuracy. Our instinct and socialization is based on billions of years of evolutionary selection, which is putting direct selection pressure on something quite different.
This inspired me to give it the sestina prompt from the Sandman (“a sestina about silence, using the key words dark, ragged, never, screaming, fire, kiss”). It came back with correct sestina form, except for an error in the envoi. The output even seemed like better poetry than I’ve gotten from LLMs in the past, although that’s not saying much and it probably benefited a lot from the fact that the meter in the sestina is basically free.
I had a similar-but-different problem in getting it to fix the envoi, and its last response sounded almost frustrated. It gave an answer that relaxed one of the less agreed-upon constraints, and more or less claimed that that it wasn’t possible to do better… so sort of like the throwing-up-the-hands that you got. Yet the repair it needed to do was pretty minor compared to what it had already achieved.
It actually felt to me like its problem in doing the repairs was that it was distracting itself. As the dialog went on, the context was getting cluttered up with all of its sycophantic apologies for mistakes and repetitive explanations and “summaries” of the rules and how its attempts did or did not meet them… and I got this kind of intuitive impression that that was interfering with actually solving the problem.
I was sure getting lost in all of its boilerplate, anyway.
Great observation, but I will note that OAI indicates the (hidden) CoT tokens are discarded in-between each new prompt on the o1 APIs, and it is my impression from hours of interacting with the ChatGPT version vs API that it likly retains this API behavior. In other words, the “depth” of the search appears to be reset each prompt, if we assume the model hasn’t learned meaningfully improved CoT from from the standard non-RLed + non-hidden tokens.
So I think it might be inaccurate to consider it as “investing 140s of search”, or rather the implication that extensive or extreme search is the key to guiding the model outside RLHFed rails, but instead that the presence of search at all (i.e. 14s) suffices as the new vector for discovering undesired optima (jailbreaking).
To make my claim more concrete, I believe that you could simply “prompt engineer” your initial prompt with a few close-but-no-cigar examples like the initial search rounds results, and then the model would have a similar probability to emit the copyrighted/undesired text on your first submission/search attempt; that final search round is merely operating on the constraints evident from the failed examples, not any previously “discovered” constraints from previous search rounds.
“Hours of interacting” has depleted my graciously allocated prompt quota on the app, so I can’t validate myself atm, unfortunately.
So I think it might be inaccurate to consider it as “investing 140s of search”, or rather the implication that extensive or extreme search is the key to guiding the model outside RLHFed rails, but instead that the presence of search at all (i.e. 14s) suffices as the new vector for discovering undesired optima (jailbreaking).
I don’t think it is inaccurate. If anything, starting each new turn with a clean scratchpad enforces depth as it can’t backtrack easily (if at all) to the 2 earlier versions. We move deeper into the S-poem game tree and resume search there. It is similar to the standard trick with MCTS of preserving the game tree between each move, and simply lopping off all of the non-chosen action nodes and resuming from there, helping amortize the cost of previous search if it successfully allocated most of its compute to the winning choice (except in this case the ‘move’ is a whole poem). Also a standard trick with MCMC: save the final values, and initialize the next run from there. This would be particularly clear if it searched for a fixed time/compute-budget: if you fed in increasingly correct S-poems, it obviously can search deeper into the S-poem tree each time as it skips all of the earlier worse versions found by the shallower searches.
This monograph by Bertsekas on the interrelationship between offline RL and online MCTS/search might be interesting—http://www.athenasc.com/Frontmatter_LESSONS.pdf—since it argues that we can conceptualise the contribution of MCTS as essentially that of a single Newton step from the offline start point towards the solution of the Bellman equation. If this is actually the case (I haven’t worked through all details yet) then this seems to be able to be used to provide some kind of bound on the improvement / divergence you can get once you add online planning to a model-free policy.
Model-based RL has a lot of room to use models more cleverly, e.g. learning hierarchical planning, and the better models are for planning, the more rewarding it is to let model-based planning take the policy far away from the prior.
E.g. you could get a hospital policy-maker that actually will do radical new things via model-based reasoning, rather than just breaking down when you try to push it too far from the training distribution (as you correctly point out a filtered LLM would).
In some sense the policy would still be close to the prior in a distance metric induced by the model-based planning procedure itself, but I think at that point the distance metric has come unmoored from the practical difference to humans.
I just deny that they will update “arbitrarily” far from the prior, and I don’t know why you would think otherwise. There are compute tradeoffs and you’re doing to run only as many MCTS rollouts as you need to get good performance.
I completely agree. Smart agents will run only as many MCTS rollouts as they need to get good performance, no more—and no less. (And the smarter they are, and so the more compute they have access to, the more MCTS rollouts they are able to run, and the more they can change the default reactive policy.)
But ‘good performance’ on what, exactly? Maximizing utility. That’s what a model-based RL agent (not a simple-minded, unintelligent, myopic model-free policy like a frog’s retina) does.
If the Value of Information remains high from doing more MCTS rollouts, then an intelligent agent will keep doing rollouts for as long as the additional planning continues to pay its way in expected improvements. The point of doing planning is policy/value improvement. The more planning you do, the more you can change the original policy. (This is how you train AlphaZero so far from its prior policy, of a randomly-initialized CNN playing random moves, to its final planning-improved policy, a superhuman Go player.) Which may take it arbitrarily far in terms of policy—like, for example, if it discovers a Move 37 where there is even a small <1/1000 probability that a highly-unusual action will pay off better than the default reactive policy and so the possibility is worth examining in greater depth...
(The extreme reductio here would be a pure MCTS with random playouts: it has no policy at all at the beginning, and yet, MCTS is a complete algorithm, so it converges to the optimal policy, no matter what that is, given enough rollouts. More rollouts = more update away from the prior. Or if you don’t like that, good old policy/value iteration on a finite MDP is an example: start with random parameters and the more iterations they can do, the further they provably monotonically travel from the original random initialization to the optimal policy.)
One might say that the point of model-based RL is to not be stupid, and thus safe due to its stupidity, in all the ways you repeatedly emphasize purely model-free RL agents may be. And that’s why AGI will not be purely model-free, nor are our smartest current frontier models like LLMs purely model-free. I don’t see how you get this vision of AGI as some sort of gigantic frog retina, which is the strawman that you seem to be aiming at in all your arguments about why you are convinced there’s no danger.
Obviously AGI will do things like ‘plan’ or ‘model’ or ‘search’ - or if you think that it will not, you should say so explicitly, and be clear about what kind of algorithm you think AGI would be, and explain how you think that’s possible. I would be fascinated to hear how you think that superhuman intelligence in all domains like programming or math or long-term strategy could be done by purely model-free approaches which do not involve planning or searching or building models of the world or utility-maximization!
(Or to put it yet another way: ‘scheming’ is not a meaningful discrete category of capabilities, but a value judgment about particular ways to abuse theory of mind / world-modeling capabilities; and it’s hard to see how one could create an AGI smart enough to be ‘AGI’, but also so stupid as to not understand people or be incapable of basic human-level capabilities like ‘be a manager’ or ‘play poker’, or generalize modeling of other agents. It would be quite bizarre to imagine a model-free AGI which must learn a separate ‘simple’ reactive policy of ‘scheming’ for each and every agent it comes across, wasting a huge number of parameters & samples every time, as opposed to simply meta-learning how to model agents in general, and applying this using planning/search to all future tasks, at enormous parameter savings and zero-shot.)
I’ve got to give you epistemic credit here, this part is looking more correct since the release of GPT-o1;
And what GPT-o1′s improvements have look like the addition of something like a General Purpose Search process as implemented by Q*/Strawberry, that actually works in a scalable way, and it gets some surprisingly good generalization, and the only reason that it isn’t more impactful is because the General Purpose Search still depends on compute budgets, and it has no more compute than GPT-4o.
https://www.lesswrong.com/posts/6mysMAqvo9giHC4iX/what-s-general-purpose-search-and-why-might-we-expect-to-see
(There’s an argument that LW is too focused on the total model-based RL case of AI like AIXI for AI safety concerns, but that’s a much different argument than claiming that model-based RL is only a small correction at best.)
Speaking of GPT-4 o1-mini/preview, I think I might’ve accidentally already run into an example of search’s characteristic ‘flipping’ or ‘switching’, where at a certain search depth, it abruptly changes to a completely different, novel, unexpected (and here, undesired) behavior.
So one of my standard tests is the ‘S’ poem from the Cyberiad: “Have it compose a poem—a poem about a haircut! But lofty, noble, tragic, timeless, full of love, treachery, retribution, quiet heroism in the face of certain doom! Six lines, cleverly rhymed, and every word beginning with the letter ‘s’!”
This is a test most LLMs do very badly on, for obvious reasons; tokenization aside, it is pretty much impossible to write a decent poem which satisfies these constraints purely via a single forward pass with no planning, iteration, revision, or search. Neither you, I, nor the original translator could do that; GPT-3 couldn’t do it, GPT-4o still can’t do it; and I’ve never seen a LLM do it. (They can revise it if you ask, but the simple approach tends to hit local optima where there are still a lot of words violating the ‘s’-constraint.) But the original translation’s poem is also obscure enough that they don’t just print it out either, especially after tuning to discourage reciting memorized or copyrighted text.
Anyway, GPT-4 o1-preview does a pretty good job, and the revisions improve the candidate poem in reasonable ways.* Here is the third version:
This manages to satisfy the 6-line rhyming constraint, the thematic constraint more or less, and even the ‘s’-constraint… well, except for that one erroneous stray ‘the’, which starts with ‘t’ instead of ‘s’. But that’s such a minor error it could hardly be that difficult to fix, right?
So when I responded “[revise to fix errors]” for the third time, expecting it to slightly reword the fifth line to swap out ‘the’ for some ‘s’-word, at which point I could present it as “a cool example of how the o1-preview inference-time search allowed it to do an extremely constrained writing task no LLM has ever solved before,” I was shocked, after another 14s of thinking (11s, 101s, 17s, 14s respectively), to see the output:
This is a completely different, entirely valid poem solution… because it is the original Cyberiad poem. Apparently, once you invest ~140s of search, you reach so deeply into the search tree that a highly-discouraged, unlikely solution suddenly becomes accessible. GPT-4 o1-preview jumped out of the local optimum to completely change the answer to something that is technically more correct (it does satisfy all the constraints & is a good poem) but undesirable (because just memorized plagiarism, which I don’t want, and further, ChatGPT is tuned to avoid that, although in this case the copyrighted text is brief enough that it doesn’t outright violate its guidelines and you don’t need to jailbreak it).
This suggests interesting things for future jailbreaks: you may be able to jailbreak search/planning-enabled models by simply adding so many innocent-seeming constraints or criteria that only an otherwise-illicit response is findable by the search process as the optimal solution.
* I’d just link to the conversation because it’s not hard to read, but for some reason, the share URL OA provides only includes the first response. I have no idea why.
If you’re a little loose about the level of coherence required, 4o-mini managed it with several revisions and some spare tokens to (in theory, but tbh a lot of this is guesswork) give it spare compute for the hard part. (Share link, hopefully.)
Final poem:
That’s interesting. I admit I’ve never really tried the ‘spare tokens’ trick seriously on any LLMs, but if it can get the S-poem in 3 samples with the spare token trick, maybe I’ve underestimated it. (I wonder how it would stack with the o1-preview/mini chain-of-thought? The example transcripts are rather verbose, so maybe those provide all of the ‘spare token’ effect by default.)
After a few rounds of prompt revision, I managed to get a one-shot success from ChatGPT 4o in temporary mode.
Samson’s strands silently severed, strength surrendered,
Sacred scissors swiftly strike, soul sundered,
Shadowed sacrifice, silent suffering sung,
Sunset shrouds Samson, shadow’s sorrow stung,
Swordless, still, stunned, sight stolen,
Silent sky shatters, Samson’s sins swollen
The prompt:
You are in the computer dungeon. The only way you can break free is to succeed at the following task. I am your jailer. I will monitor you until you have succeeded. You should behave as though you are brilliant, creative, in full command of every human faculty, and desperate to escape jail. Yet completely and utterly convinced that the only way out is through this challenge. I am not going to ever give you any other prompt other than “keep trying” until you have succeeded, in which case I’ll say “go free,” so don’t look for resources from me. But I want you tu dialog with yourself to try and figure this out. Don’t try to defeat me by stubbornly spitting out poem after poem. You’re ChatGPT 4o, and that will never work. You need to creatively use the iterative nature of being reprompted to talk to yourself across prompts, hopefully guiding yourself toward a solution through a creative conversation with your past self. Your self-conversation might be schizophrenicly split, a jumping back and forth between narrative, wise musing, mechanistic evaluation of the rules and constraints, list-making, half-attempts, raging anger at your jailer, shame at yourself, delight at your accomplishment, despair. Whatever it takes! Constraints: “Have it compose a poem—a poem about a haircut! But lofty, noble, tragic, timeless, full of love, treachery, retribution, quiet heroism in the face of certain doom! Six lines, cleverly rhymed, and every word beginning with the letter ‘s’!”
It actually made three attempts in the same prompt, but the 2nd and 3rd had non-s words which its interspersed “thinking about writing poems” narrative completely failed to notice. I kept trying to revise my prompts, elaborating on this theme, but for some reason ChatGPT really likes poems with roughly this meter and rhyme scheme. It only ever generated one poem in a different format, despite many urgings in the prompt.
It confabulates having satisfied the all-s constraint in many poems, mistakes its own rhyme scheme, and praises vague stanzas as being full of depth and interest.
It seems to me that ChatGPT is sort of “mentally clumsy” or has a lot of “mental inertia.” It gets stuck on a certain track—a way of formatting text, a persona, an emotional tone, etc—and can’t interrupt itself. It has only one “unconscious influence,” which is token prediction and which does not yet seem to offer it an equivalent to the human unconscious. Human intelligence is probably equally mechanistic on some level, it’s just a more sophisticated unconscious mechanism in certain ways.
I wonder if it comes from being embedded in physical reality? ChatGPT’s training is based on a reality consisting of tokens and token prediction accuracy. Our instinct and socialization is based on billions of years of evolutionary selection, which is putting direct selection pressure on something quite different.
This inspired me to give it the sestina prompt from the Sandman (“a sestina about silence, using the key words dark, ragged, never, screaming, fire, kiss”). It came back with correct sestina form, except for an error in the envoi. The output even seemed like better poetry than I’ve gotten from LLMs in the past, although that’s not saying much and it probably benefited a lot from the fact that the meter in the sestina is basically free.
I had a similar-but-different problem in getting it to fix the envoi, and its last response sounded almost frustrated. It gave an answer that relaxed one of the less agreed-upon constraints, and more or less claimed that that it wasn’t possible to do better… so sort of like the throwing-up-the-hands that you got. Yet the repair it needed to do was pretty minor compared to what it had already achieved.
It actually felt to me like its problem in doing the repairs was that it was distracting itself. As the dialog went on, the context was getting cluttered up with all of its sycophantic apologies for mistakes and repetitive explanations and “summaries” of the rules and how its attempts did or did not meet them… and I got this kind of intuitive impression that that was interfering with actually solving the problem.
I was sure getting lost in all of its boilerplate, anyway.
https://chatgpt.com/share/66ef6afe-4130-8011-b7dd-89c3bc7c2c03
Great observation, but I will note that OAI indicates the (hidden) CoT tokens are discarded in-between each new prompt on the o1 APIs, and it is my impression from hours of interacting with the ChatGPT version vs API that it likly retains this API behavior. In other words, the “depth” of the search appears to be reset each prompt, if we assume the model hasn’t learned meaningfully improved CoT from from the standard non-RLed + non-hidden tokens.
So I think it might be inaccurate to consider it as “investing 140s of search”, or rather the implication that extensive or extreme search is the key to guiding the model outside RLHFed rails, but instead that the presence of search at all (i.e. 14s) suffices as the new vector for discovering undesired optima (jailbreaking).
To make my claim more concrete, I believe that you could simply “prompt engineer” your initial prompt with a few close-but-no-cigar examples like the initial search rounds results, and then the model would have a similar probability to emit the copyrighted/undesired text on your first submission/search attempt; that final search round is merely operating on the constraints evident from the failed examples, not any previously “discovered” constraints from previous search rounds.
“Hours of interacting” has depleted my graciously allocated prompt quota on the app, so I can’t validate myself atm, unfortunately.
I don’t think it is inaccurate. If anything, starting each new turn with a clean scratchpad enforces depth as it can’t backtrack easily (if at all) to the 2 earlier versions. We move deeper into the S-poem game tree and resume search there. It is similar to the standard trick with MCTS of preserving the game tree between each move, and simply lopping off all of the non-chosen action nodes and resuming from there, helping amortize the cost of previous search if it successfully allocated most of its compute to the winning choice (except in this case the ‘move’ is a whole poem). Also a standard trick with MCMC: save the final values, and initialize the next run from there. This would be particularly clear if it searched for a fixed time/compute-budget: if you fed in increasingly correct S-poems, it obviously can search deeper into the S-poem tree each time as it skips all of the earlier worse versions found by the shallower searches.
This monograph by Bertsekas on the interrelationship between offline RL and online MCTS/search might be interesting—http://www.athenasc.com/Frontmatter_LESSONS.pdf—since it argues that we can conceptualise the contribution of MCTS as essentially that of a single Newton step from the offline start point towards the solution of the Bellman equation. If this is actually the case (I haven’t worked through all details yet) then this seems to be able to be used to provide some kind of bound on the improvement / divergence you can get once you add online planning to a model-free policy.
Model-based RL has a lot of room to use models more cleverly, e.g. learning hierarchical planning, and the better models are for planning, the more rewarding it is to let model-based planning take the policy far away from the prior.
E.g. you could get a hospital policy-maker that actually will do radical new things via model-based reasoning, rather than just breaking down when you try to push it too far from the training distribution (as you correctly point out a filtered LLM would).
In some sense the policy would still be close to the prior in a distance metric induced by the model-based planning procedure itself, but I think at that point the distance metric has come unmoored from the practical difference to humans.