Happy to see this. There’s a floating intuition in the aether that RL policy optimization methods (like RLHF) inherently lead to “agentic” cognition in a way that non-RL policy optimization methods (like supervised finetuning) do not. I don’t think that intuition is well-founded. This line of work may start giving us bits on the matter.
Happy to see this. There’s a floating intuition in the aether that RL policy optimization methods (like RLHF) inherently lead to “agentic” cognition in a way that non-RL policy optimization methods (like supervised finetuning) do not. I don’t think that intuition is well-founded. This line of work may start giving us bits on the matter.