There is a model/episodes duality, and an aligned model (in whatever sense) corresponds to an aligned distribution of episodes (within its scope). Episodes are related to each other by time evolution (which corresponds to preference/values/utility when considered across all episodes in scope), induced by the model, the rules of episode construction/generation, and ways of restricting episodes to smaller/earlier/partial episodes.
The mystery of this framing is in how to relate different models (or prompt-conditioned aspects of behavior of the same model) to each other through shared episodes/features (aligning them with each other), and what kinds of equilibria this settles into after running many IDA loops: sampling episodes induced by the models within their scopes (where they would generalize, because they’ve previously learned on similar training data), letting them develop a tiny bit under time evolution (letting the models generate more details/features), and retraining models to anticipate the result immediately (reflecting on time evolution). The equilibrium models induce coherent time evolution (preference/decisions) within scope, and real world observations could be the unchanging boundary conditions that ground the rest (provide the alignment target).
So in this sketch, most of the activity is in the (hypothetical) episodes that express content/behavior of models, including specialized models that talk about specialized technical situations (features). Some of these models might be agents with recognizable preference, but a lot of them could be more like inference rules or laws of physics, giving tractable time evolution where interesting processes can live, be observed through developing episodes, and reified in specialized models that pay attention to them. It’s only the overall equilibrium, and the choice of boundary conditions, that gets to express alignment, not episodes or even individual models.
There is a model/episodes duality, and an aligned model (in whatever sense) corresponds to an aligned distribution of episodes (within its scope). Episodes are related to each other by time evolution (which corresponds to preference/values/utility when considered across all episodes in scope), induced by the model, the rules of episode construction/generation, and ways of restricting episodes to smaller/earlier/partial episodes.
The mystery of this framing is in how to relate different models (or prompt-conditioned aspects of behavior of the same model) to each other through shared episodes/features (aligning them with each other), and what kinds of equilibria this settles into after running many IDA loops: sampling episodes induced by the models within their scopes (where they would generalize, because they’ve previously learned on similar training data), letting them develop a tiny bit under time evolution (letting the models generate more details/features), and retraining models to anticipate the result immediately (reflecting on time evolution). The equilibrium models induce coherent time evolution (preference/decisions) within scope, and real world observations could be the unchanging boundary conditions that ground the rest (provide the alignment target).
So in this sketch, most of the activity is in the (hypothetical) episodes that express content/behavior of models, including specialized models that talk about specialized technical situations (features). Some of these models might be agents with recognizable preference, but a lot of them could be more like inference rules or laws of physics, giving tractable time evolution where interesting processes can live, be observed through developing episodes, and reified in specialized models that pay attention to them. It’s only the overall equilibrium, and the choice of boundary conditions, that gets to express alignment, not episodes or even individual models.