With respect to AGI-grade stuff happening inside the text-prediction model (which might be what you want to “RLHF” out?):
I think we have no reason to believe that these post-training methods (be it finetuning, RLHF, RLAIF, etc) modify “deep cognition” present in the network, rather than updating shallower things like “higher prior on this text being friendly” or whatnot.
I think the important points are:
These techniques supervise only the text output. There is no direct contact with the thought process leading to that output.
They make incremental local tweaks to the weights that move in the direction of the desired text.
Gradient descent prefers to find the smallest changes to the weights that yield the result.
Evidence in favor of this is the difficulty of eliminating “jailbreaking” with these methods. Each jailbreak demonstrates that a lot of the necessary algorithms/content are still in there, accessible by the network whenever it deems it useful to think that way.
With respect to AGI-grade stuff happening inside the text-prediction model (which might be what you want to “RLHF” out?):
I think we have no reason to believe that these post-training methods (be it finetuning, RLHF, RLAIF, etc) modify “deep cognition” present in the network, rather than updating shallower things like “higher prior on this text being friendly” or whatnot.
I think the important points are:
These techniques supervise only the text output. There is no direct contact with the thought process leading to that output.
They make incremental local tweaks to the weights that move in the direction of the desired text.
Gradient descent prefers to find the smallest changes to the weights that yield the result.
Evidence in favor of this is the difficulty of eliminating “jailbreaking” with these methods. Each jailbreak demonstrates that a lot of the necessary algorithms/content are still in there, accessible by the network whenever it deems it useful to think that way.
See also: LLMs Sometimes Generate Purely Negatively-Reinforced Text