It should be much easier to understand the relevant pressures in toy models, rather than jumping straight into a full-scale language model where you don’t have any idea what precise things you’re looking for (e.g. how values or abstractions are represented internally). I’m all for empirical feedback loops, but you need some idea of what to look at in order for that feedback loop to actually tell you anything useful.
We’re currently working with language-model-text-adventures for three overall reasons:
There’s a nasty capabilities confound in working with most RL models. We found that MineRL agents are likely too stupid to be appreciably capable in Minecraft, and so it won’t be clear whether any given action of theirs occurred because of some value behind it or because the model was making an egregious mistake. It wouldn’t be especially clear if any values at all were forming in a model, if the model just crafts the crafting bench … and then spazzes around helplessly.
And if we scale down to very small RL settings, where the model is competent to its toy world, we just don’t expect the degenerate case of learned value formation we’d see there to meaningfully generalize to larger models.
We’re working with language models rather than CoinRun RL agents (which wouldn’t suffer from a capabilities confound) because we’ll get more bits of data out of the language models! If you force the model to rely on a monologue scratchpad to do any of its inter-forward-pass computation, then the model will have to air much of its computation to us in that scratchpad.
…we’re still going to hit all of the latent computation in the model with all the interpretability tools we can, though—one example being causal tracing of the value-laden outputs of the model.
We tentatively expect that i.i.d. training doesn’t select for the most dangerous coherence-properties as strongly as action-dependent training setups do. So we’re especially interested in looking at docile shard formation in language models. Very plausibly, this will be how the AGI is trained, and so we want to know as much as possible about this special case of learned value formation.
Human thought partially exposes only a partially scrutable outer surface layer. Words only trace our real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of human thought are not exposed for direct imitation learning and can’t be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.
Words only trace real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of the AGI’s thoughts are not exposed for direct inspection and can’t be compactly expressed in natural language without abandoning the customary semantics.
This is a serious worry! We’re hoping to get the model to do as much of its computation visibly, in words, as possible. There’s going to be a lot of additional, significant latent computation going on, and we’ll have to interpret that with other interpretability tools. The tighter the serial-computation limits placed on a transformer forward-pass, by running anti-steganography tools and limiting serial depth, the less slack the model has to form a mesa-optimizer inside a single forward pass that can then deceive us.
This is all for naught if we don’t learn something about genuine consequentialists by studying modern language models in this way. But both kinds of systems share an underlying architecture, i.e., a massive neural network, and its plausible that what we find holds about value formation in the most powerful ML systems today will generalize to tomorrow’s ML.
We’re currently working with language-model-text-adventures for three overall reasons:
There’s a nasty capabilities confound in working with most RL models. We found that MineRL agents are likely too stupid to be appreciably capable in Minecraft, and so it won’t be clear whether any given action of theirs occurred because of some value behind it or because the model was making an egregious mistake. It wouldn’t be especially clear if any values at all were forming in a model, if the model just crafts the crafting bench … and then spazzes around helplessly.
And if we scale down to very small RL settings, where the model is competent to its toy world, we just don’t expect the degenerate case of learned value formation we’d see there to meaningfully generalize to larger models.
We’re working with language models rather than CoinRun RL agents (which wouldn’t suffer from a capabilities confound) because we’ll get more bits of data out of the language models! If you force the model to rely on a monologue scratchpad to do any of its inter-forward-pass computation, then the model will have to air much of its computation to us in that scratchpad.
…we’re still going to hit all of the latent computation in the model with all the interpretability tools we can, though—one example being causal tracing of the value-laden outputs of the model.
We tentatively expect that i.i.d. training doesn’t select for the most dangerous coherence-properties as strongly as action-dependent training setups do. So we’re especially interested in looking at docile shard formation in language models. Very plausibly, this will be how the AGI is trained, and so we want to know as much as possible about this special case of learned value formation.
This is a serious worry! We’re hoping to get the model to do as much of its computation visibly, in words, as possible. There’s going to be a lot of additional, significant latent computation going on, and we’ll have to interpret that with other interpretability tools. The tighter the serial-computation limits placed on a transformer forward-pass, by running anti-steganography tools and limiting serial depth, the less slack the model has to form a mesa-optimizer inside a single forward pass that can then deceive us.
This is all for naught if we don’t learn something about genuine consequentialists by studying modern language models in this way. But both kinds of systems share an underlying architecture, i.e., a massive neural network, and its plausible that what we find holds about value formation in the most powerful ML systems today will generalize to tomorrow’s ML.