Regarding this quote “we see that the model trained to be good at Othello seems to have a much worse world model”
What if for LLMs trained to play games like Othello, chess, go, etc..., instead of directly training models to play the best moves, we first train them to play legal moves like in this paper to have it construct a good world model.
Then once it has a world model, we “freeze” those weights and add on additional layers and train just those layers to play the game well.
Wouldn’t this force the play-well model to include the good world model? (a model we can probe/understand).
Wouldn’t that also force the play-well layers of the model to learn something much easier to probe and understand?
From there, we could potentially probe the play-well layers to learn something about what the optimal strategy of the game actually is.
That might work, though you could easily end up with the final model not actually faithfully using its world model to make the correct moves—if there’s more efficient/correct heuristics, there’s no guarantee it’ll use the expensive world model, or not just forget about it.
I would expect it to not work in the limit. All the models must converge on the same optimal solution for a deterministic perfect-information game like Othello and become value-equivalent, ignoring the full board state which is irrelevant to reward-maximizing. (You don’t need to model edge-cases or weird scenarios which don’t ever come up while pursuing the optimal policy, and the optimal ‘world-model’ can be arbitrarily tinier and unfaithful to the full true world dynamics.*) Simply hardwiring a world model doesn’t change this, any more than feeding in the exact board state as an input would lead to it caring about or paying attention to the irrelevant parts of the board state. As far as the RL agent is concerned, knowledge of irrelevant board state is a wasteful bug to be worked around or eliminated, no matter where this knowledge comes from or is injected.
* I’m sure Nanda knows this but for those whom this isn’t obvious or haven’t seen other discussions on this point (some related to the ‘simulators’ debate): a DRL agent only wants to maximize reward, and only wants to model the world to the extent that maximizes reward. For a complicated world or incomplete maximization, this may induce a very rich world-model inside the agent, but the final converged optimal agent may have an arbitrarily impoverished world model. In this case, imagine a version of Othello where at the first turn, the agent may press a button labeled ‘win’. Obviously, the optimal agent will learn nothing at all beyond learning ‘push the button on the first move’ and won’t learn any world-model at all of Othello! No matter how rich and fascinating the rest of the game may be, the optimal agent neither knows nor cares.
All the models must converge on the same optimal solution for a deterministic perfect-information game like Othello and become value-equivalent, ignoring the full board state which is irrelevant to reward-maximizing.
Strong claim! I’m skeptical (EDIT: if you mean “in the limit” to apply to practically relevant systems we build in the future. If so,) do you have a citation for DRL convergence results relative to this level of expressivity, and reasoning for why realistic early stopping in practice doesn’t matter? (Also, of course, even one single optimal policy can be represented by multiple different network parameterizations which induce the same semantics, with eg some using the WM and some using heuristics.)
I think the more relevant question is “given a frozen initial network, what are the circuit-level inductive biases of the training process?”. I doubt one can answer this via appeals to RL convergence results.
(I skimmed through the value equivalence paper, but LMK if my points are addressed therein.)
a DRL agent only wants to maximize reward, and only wants to model the world to the extent that maximizes reward.
As a side note, I think this “agent only wants to maximize reward” language is unproductive (see “Reward is not the optimization target”, and “Think carefully before calling RL policies ‘agents’”). In this case, I suspect that your language implicitly equivocates between “agent” denoting “the RL learning process” and “the trained policy network”:
As far as the RL agent is concerned, knowledge of irrelevant board state is a wasteful bug to be worked around or eliminated, no matter where this knowledge comes from or is injected.
if you mean “in the limit” to apply to practically relevant systems we build in the future.
Outside of simple problems like Othello, I expect most DRL agents will not converge fully to the peak of the ‘spinning top’, and so will retain traces of their informative priors like world-models.
For example, if you plug GPT-5 into a robot, I doubt it would ever be trained to the point of discarding most of its non-value-relevant world-model—the model is too high-capacity for major forgetting, and past meta-learning incentivizes keeping capabilities around just in case.
But that’s not ‘every system we build in the future’, just a lot of them. Not hard to imagine realistic practical scenarios where that doesn’t hold—I would expect that any specialized model distilled from it (for cheaper faster robotic control) would not learn or would discard much more of its non-value-relevant world-model compared to its parent, and that would have potential safety & interpretability implications. The System II distills and compiles down to a fast efficient System I. (For example, if you were trying to do safety by dissecting its internal understanding of the world, or if you were trying to hack a superior reward model, adding in safety criteria not present in the original environment/model, by exploiting an internal world model, you might fail because the optimized distilled model doesn’t have those parts of the world model, even if the parent model did, as they were irrelevant.) Chess end-game databases are provably optimal & very superhuman, and yet, there is no ‘world-model’ or human-interpretable concepts of chess anywhere to be found in them; the ‘world-model’ used to compute them, whatever that was, was discarded as unnecessary after the optimal policy was reached.
I think the more relevant question is “given a frozen initial network, what are the circuit-level inductive biases of the training process?”. I doubt one can answer this via appeals to RL convergence results.
Probably not, but mostly because you phrased it as inductive biases to be washed away in the limit, or using gimmicks like early stopping. (It’s not like stopping forgetting is hard. Of course you can stop forgetting by changing the problem to be solved, and simply making a representation of the world-state part of the reward, like including a reconstruction loss.) In this case, however, Othello is simple enough that the superior agent has already apparently discarded much of the world-model and provides a useful example of what end-to-end reward maximization really means—while reward is sufficient to learn world-models as needed, full complete world-models are neither necessary nor sufficient for rewards.
As a side note, I think this “agent only wants to maximize reward” language is unproductive (see “Reward is not the optimization target”, and “Think carefully before calling RL policies ‘agents’”).
I’ve tried to read those before, and came away very confused what you meant, and everyone who reads those seems to be even more confused after reading them. At best, you seem to be making a bizarre mishmash of confusing model-free and policies and other things best not confused and being awestruck by a triviality on the level of ‘organisms are adaptation-executers and not fitness-maximizers’, and at worst, you are obviously wrong: reward is the optimization target, both for the outer loop and for the inner loop of things like model-based algorithms. (In what sense does, say, a tree search algorithm like MCTS or full-blown backwards induction not ‘optimize the reward’?)
Probably not, but mostly because you phrased it as inductive biases to be washed away in the limit, or using gimmicks like early stopping.
LLMs aren’t trained to convergence because that’s not compute-efficient, so early stopping seems like the relevant baseline. No?
everyone who reads those seems to be even more confused after reading them
I want to defend “Reward is not the optimization target” a bit, while also mourning its apparent lack of clarity. The above is a valid impression, but I don’t think it’s true. For some reason, some people really get a lot out of the post; others think it’s trivial; others think it’s obviously wrong, and so on. See Rohin’s comment:
(Just wanted to echo that I agree with TurnTrout that I find myself explaining the point that reward may not be the optimization target a lot, and I think I disagree somewhat with Ajeya’s recent post for similar reasons. I don’t think that the people I’m explaining it to literally don’t understand the point at all; I think it mostly hasn’t propagated into some parts of their other reasoning about alignment. I’m less on board with the “it’s incorrect to call reward a base objective” point but I think it’s pretty plausible that once I actually understand what TurnTrout is saying there I’ll agree with it.)
You write:
In what sense does, say, a tree search algorithm like MCTS or full-blown backwards induction not ‘optimize the reward’?
These algorithms do optimize the reward. My post addresses the model-free policy gradient setting… [goes to check post] Oh no. I can see why my post was unclear—it didn’t state this clearly. The original post does state that AIXI optimizes its reward, and also that:
For point 2 (reward provides local updates to the agent’s cognition via credit assignment; reward is not best understood as specifying our preferences), the choice of RL algorithm should not matter, as long as it uses reward to compute local updates.
However, I should have stated up-front: This post addresses model-free policy gradient algorithms like PPO and REINFORCE.
I don’t know what other disagreements or confusions you have. In the interest of not spilling bytes by talking past you—I’m happy to answer more specific questions.
Regarding this quote “we see that the model trained to be good at Othello seems to have a much worse world model”
What if for LLMs trained to play games like Othello, chess, go, etc..., instead of directly training models to play the best moves, we first train them to play legal moves like in this paper to have it construct a good world model.
Then once it has a world model, we “freeze” those weights and add on additional layers and train just those layers to play the game well.
Wouldn’t this force the play-well model to include the good world model? (a model we can probe/understand).
Wouldn’t that also force the play-well layers of the model to learn something much easier to probe and understand?
From there, we could potentially probe the play-well layers to learn something about what the optimal strategy of the game actually is.
That might work, though you could easily end up with the final model not actually faithfully using its world model to make the correct moves—if there’s more efficient/correct heuristics, there’s no guarantee it’ll use the expensive world model, or not just forget about it.
I would expect it to not work in the limit. All the models must converge on the same optimal solution for a deterministic perfect-information game like Othello and become value-equivalent, ignoring the full board state which is irrelevant to reward-maximizing. (You don’t need to model edge-cases or weird scenarios which don’t ever come up while pursuing the optimal policy, and the optimal ‘world-model’ can be arbitrarily tinier and unfaithful to the full true world dynamics.*) Simply hardwiring a world model doesn’t change this, any more than feeding in the exact board state as an input would lead to it caring about or paying attention to the irrelevant parts of the board state. As far as the RL agent is concerned, knowledge of irrelevant board state is a wasteful bug to be worked around or eliminated, no matter where this knowledge comes from or is injected.
* I’m sure Nanda knows this but for those whom this isn’t obvious or haven’t seen other discussions on this point (some related to the ‘simulators’ debate): a DRL agent only wants to maximize reward, and only wants to model the world to the extent that maximizes reward. For a complicated world or incomplete maximization, this may induce a very rich world-model inside the agent, but the final converged optimal agent may have an arbitrarily impoverished world model. In this case, imagine a version of Othello where at the first turn, the agent may press a button labeled ‘win’. Obviously, the optimal agent will learn nothing at all beyond learning ‘push the button on the first move’ and won’t learn any world-model at all of Othello! No matter how rich and fascinating the rest of the game may be, the optimal agent neither knows nor cares.
Strong claim! I’m skeptical (EDIT: if you mean “in the limit” to apply to practically relevant systems we build in the future. If so,) do you have a citation for DRL convergence results relative to this level of expressivity, and reasoning for why realistic early stopping in practice doesn’t matter? (Also, of course, even one single optimal policy can be represented by multiple different network parameterizations which induce the same semantics, with eg some using the WM and some using heuristics.)
I think the more relevant question is “given a frozen initial network, what are the circuit-level inductive biases of the training process?”. I doubt one can answer this via appeals to RL convergence results.
(I skimmed through the value equivalence paper, but LMK if my points are addressed therein.)
As a side note, I think this “agent only wants to maximize reward” language is unproductive (see “Reward is not the optimization target”, and “Think carefully before calling RL policies ‘agents’”). In this case, I suspect that your language implicitly equivocates between “agent” denoting “the RL learning process” and “the trained policy network”:
Outside of simple problems like Othello, I expect most DRL agents will not converge fully to the peak of the ‘spinning top’, and so will retain traces of their informative priors like world-models.
For example, if you plug GPT-5 into a robot, I doubt it would ever be trained to the point of discarding most of its non-value-relevant world-model—the model is too high-capacity for major forgetting, and past meta-learning incentivizes keeping capabilities around just in case.
But that’s not ‘every system we build in the future’, just a lot of them. Not hard to imagine realistic practical scenarios where that doesn’t hold—I would expect that any specialized model distilled from it (for cheaper faster robotic control) would not learn or would discard much more of its non-value-relevant world-model compared to its parent, and that would have potential safety & interpretability implications. The System II distills and compiles down to a fast efficient System I. (For example, if you were trying to do safety by dissecting its internal understanding of the world, or if you were trying to hack a superior reward model, adding in safety criteria not present in the original environment/model, by exploiting an internal world model, you might fail because the optimized distilled model doesn’t have those parts of the world model, even if the parent model did, as they were irrelevant.) Chess end-game databases are provably optimal & very superhuman, and yet, there is no ‘world-model’ or human-interpretable concepts of chess anywhere to be found in them; the ‘world-model’ used to compute them, whatever that was, was discarded as unnecessary after the optimal policy was reached.
Probably not, but mostly because you phrased it as inductive biases to be washed away in the limit, or using gimmicks like early stopping. (It’s not like stopping forgetting is hard. Of course you can stop forgetting by changing the problem to be solved, and simply making a representation of the world-state part of the reward, like including a reconstruction loss.) In this case, however, Othello is simple enough that the superior agent has already apparently discarded much of the world-model and provides a useful example of what end-to-end reward maximization really means—while reward is sufficient to learn world-models as needed, full complete world-models are neither necessary nor sufficient for rewards.
I’ve tried to read those before, and came away very confused what you meant, and everyone who reads those seems to be even more confused after reading them. At best, you seem to be making a bizarre mishmash of confusing model-free and policies and other things best not confused and being awestruck by a triviality on the level of ‘organisms are adaptation-executers and not fitness-maximizers’, and at worst, you are obviously wrong: reward is the optimization target, both for the outer loop and for the inner loop of things like model-based algorithms. (In what sense does, say, a tree search algorithm like MCTS or full-blown backwards induction not ‘optimize the reward’?)
LLMs aren’t trained to convergence because that’s not compute-efficient, so early stopping seems like the relevant baseline. No?
I want to defend “Reward is not the optimization target” a bit, while also mourning its apparent lack of clarity. The above is a valid impression, but I don’t think it’s true. For some reason, some people really get a lot out of the post; others think it’s trivial; others think it’s obviously wrong, and so on. See Rohin’s comment:
You write:
These algorithms do optimize the reward. My post addresses the model-free policy gradient setting… [goes to check post] Oh no. I can see why my post was unclear—it didn’t state this clearly. The original post does state that AIXI optimizes its reward, and also that:
However, I should have stated up-front: This post addresses model-free policy gradient algorithms like PPO and REINFORCE.
I don’t know what other disagreements or confusions you have. In the interest of not spilling bytes by talking past you—I’m happy to answer more specific questions.