What in your view is the fundamental difference between world models and goals such that the former generalise well and the latter generalise poorly?
One can easily construct a model with a free parameter X and training data such that many choices of X will match the training data but results will diverge in situations not represented in the training data (for example, the model is a physical simulation and X tracks the state of some region in the simulation that will affect the learner’s environment later, but hasn’t done so during training). The simplest x_s could easily be wrong. We can even moralise the story: the model regards its job as predicting the output under x_s and if the world happens to operate according to some other x’ then the model doesn’t care. However it’s still going to be ineffective in the future where the value of X matters.
Nora and/or Quentin: you talk a lot about inductive biases of neural nets ruling scheming out, but I have a vague sense that scheming ought to happen in some circumstances—perhaps rather contrived, but not so contrived as to be deliberately inducing the ulterior motive. Do you expect this to be impossible? Can you propose a set of conditions you think sufficient to rule out scheming?