Some experimental directions I recently wrote up; might as well be public:
Some attempts to demonstrate how goal agnosticism breaks with modifications to the architecture and training type. Trying to make clear the relationship between sparsity/distance of the implicit reward function and unpredictability of results.
A continuation and refinement of my earlier (as of yet unpublished) experiments about out of distribution capability decay. Goal agnosticism is achieved by bounding the development of capabilities into a shape incompatible with internally motivated instrumental behavior across the training distribution; if it’s possible for any nontrivial capability to persist out of distribution at toy scales, even with significant contrivance to train it into existence in the first place, that would be extremely concerning for the potential persistence of deceptive mesaoptimizers at scale.
Ideally, the experiment would examine the difference between OOD capabilities with varying levels of overlap with the training distribution. For example, contrast four cases: A: A model is trained on ten different “languages” with zero translation tasks between them. These “languages” would be not human languages, but rather trivial types of sequences that do not share any obvious form or underlying structure. One language could be the sequence generated by f(x) = 2x + 1; another might be to endlessly repeat “brink bronk poot toot.” B: A model is trained on ten different languages with significantly different form, but a shared underlying structure. For example, all the languages might involve solving trivial arithmetic, but one language is “3 + 4 = 7″ and another language is “three plus four equals seven.” C: Same as B, but now give the model translation tasks. D: Same as C, but leave one language pair’s translation tasks unspecified. Any successful translation for that pair would necessarily arise from a generalizing implementation.
For each model, drop parts of the training distribution but continue to perform test evaluations on that discontinued part. Do models with more apparent shared implementation decay more slowly? How does the decay vary with hyperparameters?
Some circuit-level analysis might be helpful here to identify whether capability is lost via trivial gating versus catastrophic scrambling, but it’s probably best to punt that to a separate experiment.
I suspect there is an equivalence between conditioning and representational intervention, like activation steering. They may be different interfaces to the same effect. I’d like to poke around metatoken-like approaches (like Pretraining Language Models with Human Preferences) and see if I can find anything compelling from a representational perspective.
Assuming goal agnosticism is actually achieved and maintained, it broadens the scope for what kinds of interpretability can be useful by ruling out internal representational adversaries. There may be room for more experiments around motivational interpretability. (Some other work has already been published on special cases.)
Less concretely, I’d also like to:
Figure out how to think about the “fragility” of goal agnostic systems. Conditioning a predictor can easily yield an agent that is not goal agnostic; this is expected and not inherently problematic. But what if it is trivial to accidentally condition a strong model into being a worldeater, rather than a passive Q&A bot? There’s clearly a spectrum here in terms of how “chaotic” a model is—the degree to which small perturbations can yield massive consequences—but it remains conceptually fuzzy.
More fully ground “Responsible Scaling Policy”-style approaches on a goal agnostic foundation. If a lab can demonstrate that a model is incapable of learning preferences over external world states, and that their method of aiming the model isn’t “fragile” in the above sense, then it’s a good candidate for incremental experimentation.
Come up with other ways to connect this research path with policy more generally.
Some experimental directions I recently wrote up; might as well be public:
Some attempts to demonstrate how goal agnosticism breaks with modifications to the architecture and training type. Trying to make clear the relationship between sparsity/distance of the implicit reward function and unpredictability of results.
A continuation and refinement of my earlier (as of yet unpublished) experiments about out of distribution capability decay. Goal agnosticism is achieved by bounding the development of capabilities into a shape incompatible with internally motivated instrumental behavior across the training distribution; if it’s possible for any nontrivial capability to persist out of distribution at toy scales, even with significant contrivance to train it into existence in the first place, that would be extremely concerning for the potential persistence of deceptive mesaoptimizers at scale.
Ideally, the experiment would examine the difference between OOD capabilities with varying levels of overlap with the training distribution. For example, contrast four cases:
A: A model is trained on ten different “languages” with zero translation tasks between them. These “languages” would be not human languages, but rather trivial types of sequences that do not share any obvious form or underlying structure. One language could be the sequence generated by f(x) = 2x + 1; another might be to endlessly repeat “brink bronk poot toot.”
B: A model is trained on ten different languages with significantly different form, but a shared underlying structure. For example, all the languages might involve solving trivial arithmetic, but one language is “3 + 4 = 7″ and another language is “three plus four equals seven.”
C: Same as B, but now give the model translation tasks.
D: Same as C, but leave one language pair’s translation tasks unspecified. Any successful translation for that pair would necessarily arise from a generalizing implementation.
For each model, drop parts of the training distribution but continue to perform test evaluations on that discontinued part. Do models with more apparent shared implementation decay more slowly? How does the decay vary with hyperparameters?
Some circuit-level analysis might be helpful here to identify whether capability is lost via trivial gating versus catastrophic scrambling, but it’s probably best to punt that to a separate experiment.
I suspect there is an equivalence between conditioning and representational intervention, like activation steering. They may be different interfaces to the same effect. I’d like to poke around metatoken-like approaches (like Pretraining Language Models with Human Preferences) and see if I can find anything compelling from a representational perspective.
Assuming goal agnosticism is actually achieved and maintained, it broadens the scope for what kinds of interpretability can be useful by ruling out internal representational adversaries. There may be room for more experiments around motivational interpretability. (Some other work has already been published on special cases.)
Less concretely, I’d also like to:
Figure out how to think about the “fragility” of goal agnostic systems. Conditioning a predictor can easily yield an agent that is not goal agnostic; this is expected and not inherently problematic. But what if it is trivial to accidentally condition a strong model into being a worldeater, rather than a passive Q&A bot? There’s clearly a spectrum here in terms of how “chaotic” a model is—the degree to which small perturbations can yield massive consequences—but it remains conceptually fuzzy.
More fully ground “Responsible Scaling Policy”-style approaches on a goal agnostic foundation. If a lab can demonstrate that a model is incapable of learning preferences over external world states, and that their method of aiming the model isn’t “fragile” in the above sense, then it’s a good candidate for incremental experimentation.
Come up with other ways to connect this research path with policy more generally.