gwern comments on Counting arguments provide no evidence for AI doom

gwern 21 Sep 2024 2:31 UTC
5 points
1

So I think it might be inaccurate to consider it as “investing 140s of search”, or rather the implication that extensive or extreme search is the key to guiding the model outside RLHFed rails, but instead that the presence of search at all (i.e. 14s) suffices as the new vector for discovering undesired optima (jailbreaking).

I don’t think it is inaccurate. If anything, starting each new turn with a clean scratchpad enforces depth as it can’t backtrack easily (if at all) to the 2 earlier versions. We move deeper into the S-poem game tree and resume search there. It is similar to the standard trick with MCTS of preserving the game tree between each move, and simply lopping off all of the non-chosen action nodes and resuming from there, helping amortize the cost of previous search if it successfully allocated most of its compute to the winning choice (except in this case the ‘move’ is a whole poem). Also a standard trick with MCMC: save the final values, and initialize the next run from there. This would be particularly clear if it searched for a fixed time/compute-budget: if you fed in increasingly correct S-poems, it obviously can search deeper into the S-poem tree each time as it skips all of the earlier worse versions found by the shallower searches.