The question of how strongly training pressures models to minimize loss is one that I isolate and discuss explicitly in the report, in section 1.5, “On ‘slack’ in training”—and at various points the report references how differing levels of “slack” might affect the arguments it considers. Here I was influenced in part by discussions with various people, yourself included, who seemed to disagree about how much weight to put on arguments in the vein of: “policy A would get lower loss than policy B, so we should think it more likely that SGD selects policy A than policy B.”
(And for clarity, I don’t think that arguments of this form always support expecting models to do tons of reasoning about the training set-up. For example, as the report discusses in e.g. Section 4.4, on “speed arguments,” the amount of world-modeling/instrumental-reasoning that the model does can affect the loss it gets via e.g. using up cognitive resources. So schemers—and also, reward-on-the-episode seekers—can be at some disadvantage, in this respect, relative to models that don’t think about the training process at all.)
The question of how strongly training pressures models to minimize loss is one that I isolate and discuss explicitly in the report, in section 1.5, “On ‘slack’ in training”—and at various points the report references how differing levels of “slack” might affect the arguments it considers. Here I was influenced in part by discussions with various people, yourself included, who seemed to disagree about how much weight to put on arguments in the vein of: “policy A would get lower loss than policy B, so we should think it more likely that SGD selects policy A than policy B.”
(And for clarity, I don’t think that arguments of this form always support expecting models to do tons of reasoning about the training set-up. For example, as the report discusses in e.g. Section 4.4, on “speed arguments,” the amount of world-modeling/instrumental-reasoning that the model does can affect the loss it gets via e.g. using up cognitive resources. So schemers—and also, reward-on-the-episode seekers—can be at some disadvantage, in this respect, relative to models that don’t think about the training process at all.)