Probably not, but mostly because you phrased it as inductive biases to be washed away in the limit, or using gimmicks like early stopping.
LLMs aren’t trained to convergence because that’s not compute-efficient, so early stopping seems like the relevant baseline. No?
everyone who reads those seems to be even more confused after reading them
I want to defend “Reward is not the optimization target” a bit, while also mourning its apparent lack of clarity. The above is a valid impression, but I don’t think it’s true. For some reason, some people really get a lot out of the post; others think it’s trivial; others think it’s obviously wrong, and so on. See Rohin’s comment:
(Just wanted to echo that I agree with TurnTrout that I find myself explaining the point that reward may not be the optimization target a lot, and I think I disagree somewhat with Ajeya’s recent post for similar reasons. I don’t think that the people I’m explaining it to literally don’t understand the point at all; I think it mostly hasn’t propagated into some parts of their other reasoning about alignment. I’m less on board with the “it’s incorrect to call reward a base objective” point but I think it’s pretty plausible that once I actually understand what TurnTrout is saying there I’ll agree with it.)
You write:
In what sense does, say, a tree search algorithm like MCTS or full-blown backwards induction not ‘optimize the reward’?
These algorithms do optimize the reward. My post addresses the model-free policy gradient setting… [goes to check post] Oh no. I can see why my post was unclear—it didn’t state this clearly. The original post does state that AIXI optimizes its reward, and also that:
For point 2 (reward provides local updates to the agent’s cognition via credit assignment; reward is not best understood as specifying our preferences), the choice of RL algorithm should not matter, as long as it uses reward to compute local updates.
However, I should have stated up-front: This post addresses model-free policy gradient algorithms like PPO and REINFORCE.
I don’t know what other disagreements or confusions you have. In the interest of not spilling bytes by talking past you—I’m happy to answer more specific questions.
LLMs aren’t trained to convergence because that’s not compute-efficient, so early stopping seems like the relevant baseline. No?
I want to defend “Reward is not the optimization target” a bit, while also mourning its apparent lack of clarity. The above is a valid impression, but I don’t think it’s true. For some reason, some people really get a lot out of the post; others think it’s trivial; others think it’s obviously wrong, and so on. See Rohin’s comment:
You write:
These algorithms do optimize the reward. My post addresses the model-free policy gradient setting… [goes to check post] Oh no. I can see why my post was unclear—it didn’t state this clearly. The original post does state that AIXI optimizes its reward, and also that:
However, I should have stated up-front: This post addresses model-free policy gradient algorithms like PPO and REINFORCE.
I don’t know what other disagreements or confusions you have. In the interest of not spilling bytes by talking past you—I’m happy to answer more specific questions.