Ah you’re right, the paper never directly says the architecture is trained end-to-end—updated the post, thanks for the catch.
He might still mean something closer to end-to-end learning, because
The world model is differentiable w.r.t the cost (Figure 2), suggesting it isn’t trained purely using self-supervised learning.
The configurator needs to learn to modulate the world model, the cost, and the actor; it seems unlikely that this can be done well if these are all swappable black boxes. So there is likely some phase of co-adaptation between configurator, actor, cost, and world model.
Ah you’re right, the paper never directly says the architecture is trained end-to-end—updated the post, thanks for the catch.
He might still mean something closer to end-to-end learning, because
The world model is differentiable w.r.t the cost (Figure 2), suggesting it isn’t trained purely using self-supervised learning.
The configurator needs to learn to modulate the world model, the cost, and the actor; it seems unlikely that this can be done well if these are all swappable black boxes. So there is likely some phase of co-adaptation between configurator, actor, cost, and world model.