This is very close to what RLHF is already doing. Also maybe see RLHF with KL penalties is Bayesian Inference.
The basic point is that a LLM finetuned with RLHF acts like an agent trained to spend an “improbability budget” (relative to the base-LLM distribution) at each step to steer the text into higher-reward trajectories.
This is very close to what RLHF is already doing. Also maybe see RLHF with KL penalties is Bayesian Inference.
The basic point is that a LLM finetuned with RLHF acts like an agent trained to spend an “improbability budget” (relative to the base-LLM distribution) at each step to steer the text into higher-reward trajectories.