Paragox comments on Counting arguments provide no evidence for AI doom

Paragox 20 Sep 2024 0:58 UTC
2 points
0
Great observation, but I will note that OAI indicates the (hidden) CoT tokens are discarded in-between each new prompt on the o1 APIs, and it is my impression from hours of interacting with the ChatGPT version vs API that it likly retains this API behavior. In other words, the “depth” of the search appears to be reset each prompt, if we assume the model hasn’t learned meaningfully improved CoT from from the standard non-RLed + non-hidden tokens.

So I think it might be inaccurate to consider it as “investing 140s of search”, or rather the implication that extensive or extreme search is the key to guiding the model outside RLHFed rails, but instead that the presence of search at all (i.e. 14s) suffices as the new vector for discovering undesired optima (jailbreaking).

To make my claim more concrete, I believe that you could simply “prompt engineer” your initial prompt with a few close-but-no-cigar examples like the initial search rounds results, and then the model would have a similar probability to emit the copyrighted/undesired text on your first submission/search attempt; that final search round is merely operating on the constraints evident from the failed examples, not any previously “discovered” constraints from previous search rounds.

“Hours of interacting” has depleted my graciously allocated prompt quota on the app, so I can’t validate myself atm, unfortunately.
- gwern 21 Sep 2024 2:31 UTC
  5 points
  1
  Parent
  
  So I think it might be inaccurate to consider it as “investing 140s of search”, or rather the implication that extensive or extreme search is the key to guiding the model outside RLHFed rails, but instead that the presence of search at all (i.e. 14s) suffices as the new vector for discovering undesired optima (jailbreaking).
  
  I don’t think it is inaccurate. If anything, starting each new turn with a clean scratchpad enforces depth as it can’t backtrack easily (if at all) to the 2 earlier versions. We move deeper into the S-poem game tree and resume search there. It is similar to the standard trick with MCTS of preserving the game tree between each move, and simply lopping off all of the non-chosen action nodes and resuming from there, helping amortize the cost of previous search if it successfully allocated most of its compute to the winning choice (except in this case the ‘move’ is a whole poem). Also a standard trick with MCMC: save the final values, and initialize the next run from there. This would be particularly clear if it searched for a fixed time/compute-budget: if you fed in increasingly correct S-poems, it obviously can search deeper into the S-poem tree each time as it skips all of the earlier worse versions found by the shallower searches.