Also, this actually doesn’t depend on the specific training procedure of auto-regressive LLMs, namely, backpropagation with token-by-token cross-entropy loss.
Agreed.
I do have some concerns about how far you can push the wider class of “predictors” in some directions before the process starts selecting for generalizing instrumental behaviors, but there isn’t a fundamental uniqueness about autoregressive NLL-backpropped prediction.
Possibly? I can’t tell if I endorse all the possible interpretations. When I say high instrumentality, I tend to focus on 1. the model is strongly incentivized to learn internally-motivated instrumental behavior (e.g. because the reward it was trained on is extremely sparse, and so the model must have learned some internal structure encouraging intermediate instrumental behavior useful during training), and 2. those internal motivations are less constrained and may occupy a wider space of weird options.
#2 may overlap with the kind of alienness you mean, but I’m not sure I would focus on the alienness of the world model instead of learned values (in the context of how I think about “high instrumentality” models, anyway).
Agreed.
I do have some concerns about how far you can push the wider class of “predictors” in some directions before the process starts selecting for generalizing instrumental behaviors, but there isn’t a fundamental uniqueness about autoregressive NLL-backpropped prediction.
Possibly? I can’t tell if I endorse all the possible interpretations. When I say high instrumentality, I tend to focus on 1. the model is strongly incentivized to learn internally-motivated instrumental behavior (e.g. because the reward it was trained on is extremely sparse, and so the model must have learned some internal structure encouraging intermediate instrumental behavior useful during training), and 2. those internal motivations are less constrained and may occupy a wider space of weird options.
#2 may overlap with the kind of alienness you mean, but I’m not sure I would focus on the alienness of the world model instead of learned values (in the context of how I think about “high instrumentality” models, anyway).