Predictors/simulators are typically trained against a ground truth for every output. There is no gap between the output and its evaluation; an episode need not be completed before figuring out how good the first token prediction was. These immediate evaluations for every training sample can be thought of as a broad and densely defined reward function.
To me, this sounds like saying that simulator AIs are trained such as to optimise some quality of the predictions/simulacra/plans/inferences that they produce. This seems to be more or less a tautology (otherwise, these models weren’t called “simulators”). Also, this actually doesn’t depend on the specific training procedure of auto-regressive LLMs, namely, backpropagation with token-by-token cross-entropy loss. For example, GFlowNets are also squarely “simulators” (or “inference machines”, as Bengio calls them), but their performance is typically evaluated (i.e., the loss is computed) on complete trajectories (which is admittedly inefficient though, and there is some work about train them with “local credit” on incomplete trajectories).
I also note the conspicuous coincidence that some of the most capable architectures yet devised are low instrumentality or rely on world models which are. It would not surprise me if low instrumentality, and the training constraints/hints that it typically corresponds to, often implies relative ease of training at a particular level of capability which seems like a pretty happy accident.
It seems to me that “high instrumentality” is an instance of what I call an agent having an alien world model. However, albeit “bare” LLMs (especially with recurrence, such as RWKV-LM) seem to be able to develop powerful alien models that could “harbour” a misaligned mesaoptimiser, even rather stupid LMCAs on top of such LLMs, such as the “exemplary actor”, seems to be immunised from this risk.
It’s important to note, however, that a view of cognitive system as having a probabilistic world model and performing inferences with this model (i.e., predictive processing) is just one way of modelling agent’s dynamics, and is not exhaustive. There could be unexpected surprises/failures that evade this modelling paradigm altogether.
It seems critical to understand the degree to which outer constraints apply to inner learning, how different forms of training/architecture affect the development of instrumentality, and how much “space” is required for different levels of instrumentality to develop.
Although apparently GFlowNets (which is a training algorithm, essentially, loss design, rather than network architecture) would have less propensitity for “high instrumentality” than auto-regressive LLMs under the currently dominating training paradigm, this is probably unimportant, given what I wrote above: it seems easy to “weed out” alienness on the level of agent architecture on top of an LLM, anyway.
Also, this actually doesn’t depend on the specific training procedure of auto-regressive LLMs, namely, backpropagation with token-by-token cross-entropy loss.
Agreed.
I do have some concerns about how far you can push the wider class of “predictors” in some directions before the process starts selecting for generalizing instrumental behaviors, but there isn’t a fundamental uniqueness about autoregressive NLL-backpropped prediction.
Possibly? I can’t tell if I endorse all the possible interpretations. When I say high instrumentality, I tend to focus on 1. the model is strongly incentivized to learn internally-motivated instrumental behavior (e.g. because the reward it was trained on is extremely sparse, and so the model must have learned some internal structure encouraging intermediate instrumental behavior useful during training), and 2. those internal motivations are less constrained and may occupy a wider space of weird options.
#2 may overlap with the kind of alienness you mean, but I’m not sure I would focus on the alienness of the world model instead of learned values (in the context of how I think about “high instrumentality” models, anyway).
To me, this sounds like saying that simulator AIs are trained such as to optimise some quality of the predictions/simulacra/plans/inferences that they produce. This seems to be more or less a tautology (otherwise, these models weren’t called “simulators”). Also, this actually doesn’t depend on the specific training procedure of auto-regressive LLMs, namely, backpropagation with token-by-token cross-entropy loss. For example, GFlowNets are also squarely “simulators” (or “inference machines”, as Bengio calls them), but their performance is typically evaluated (i.e., the loss is computed) on complete trajectories (which is admittedly inefficient though, and there is some work about train them with “local credit” on incomplete trajectories).
It seems to me that “high instrumentality” is an instance of what I call an agent having an alien world model. However, albeit “bare” LLMs (especially with recurrence, such as RWKV-LM) seem to be able to develop powerful alien models that could “harbour” a misaligned mesaoptimiser, even rather stupid LMCAs on top of such LLMs, such as the “exemplary actor”, seems to be immunised from this risk.
It’s important to note, however, that a view of cognitive system as having a probabilistic world model and performing inferences with this model (i.e., predictive processing) is just one way of modelling agent’s dynamics, and is not exhaustive. There could be unexpected surprises/failures that evade this modelling paradigm altogether.
Although apparently GFlowNets (which is a training algorithm, essentially, loss design, rather than network architecture) would have less propensitity for “high instrumentality” than auto-regressive LLMs under the currently dominating training paradigm, this is probably unimportant, given what I wrote above: it seems easy to “weed out” alienness on the level of agent architecture on top of an LLM, anyway.
Agreed.
I do have some concerns about how far you can push the wider class of “predictors” in some directions before the process starts selecting for generalizing instrumental behaviors, but there isn’t a fundamental uniqueness about autoregressive NLL-backpropped prediction.
Possibly? I can’t tell if I endorse all the possible interpretations. When I say high instrumentality, I tend to focus on 1. the model is strongly incentivized to learn internally-motivated instrumental behavior (e.g. because the reward it was trained on is extremely sparse, and so the model must have learned some internal structure encouraging intermediate instrumental behavior useful during training), and 2. those internal motivations are less constrained and may occupy a wider space of weird options.
#2 may overlap with the kind of alienness you mean, but I’m not sure I would focus on the alienness of the world model instead of learned values (in the context of how I think about “high instrumentality” models, anyway).