I suspect (though with high uncertainty) there are some factors differentially advantaging safety research, especially prosaic safety research (often in the shape of fine-tuning/steering/post-training), which often requires OOMs less compute and more broadly seems like it should have faster iteration cycles; and likely even ‘human task time’ (e.g. see figures in https://metr.github.io/autonomy-evals-guide/openai-o1-preview-report/) probably being shorter for a lot of prosaic safety research, which makes it likely it will on average be automatable sooner. The existence of cheap, accurate proxy metrics might point the opposite way, though in some cases there seem to exist good, diverse proxies—e.g. Eight Methods to Evaluate Robust Unlearning in LLMs. More discussion in the extra slides of this presentation.
Related, I expect the delay from needing to build extra infrastructure to train much larger LLMs will probably differentially affect capabilities progress more, by acting somewhat like a pause/series of pauses; which it would be nice to exploit by enrolling more safety people to try to maximally elicit newly deployed capabilities—and potentially be uplifted—for safety research. (Anecdotally, I suspect I’m already being uplifted at least a bit as a safety researcher by using Sonnet.)
Also, I think it’s much harder to pause/slow down capabilities than to accelerate safety, so I think more of the community focus should go to the latter.
And for now, it’s fortunate that inference scaling means CoT and similarly differentially transparent (vs. model internals) intermediate outputs, which makes it probably a safer way of eliciting capabilities.
I mean, in the same way as it is a boon to AI capabilities research? How is this differentially useful?
I suspect (though with high uncertainty) there are some factors differentially advantaging safety research, especially prosaic safety research (often in the shape of fine-tuning/steering/post-training), which often requires OOMs less compute and more broadly seems like it should have faster iteration cycles; and likely even ‘human task time’ (e.g. see figures in https://metr.github.io/autonomy-evals-guide/openai-o1-preview-report/) probably being shorter for a lot of prosaic safety research, which makes it likely it will on average be automatable sooner. The existence of cheap, accurate proxy metrics might point the opposite way, though in some cases there seem to exist good, diverse proxies—e.g. Eight Methods to Evaluate Robust Unlearning in LLMs. More discussion in the extra slides of this presentation.
Related, I expect the delay from needing to build extra infrastructure to train much larger LLMs will probably differentially affect capabilities progress more, by acting somewhat like a pause/series of pauses; which it would be nice to exploit by enrolling more safety people to try to maximally elicit newly deployed capabilities—and potentially be uplifted—for safety research. (Anecdotally, I suspect I’m already being uplifted at least a bit as a safety researcher by using Sonnet.)
Also, I think it’s much harder to pause/slow down capabilities than to accelerate safety, so I think more of the community focus should go to the latter.
And for now, it’s fortunate that inference scaling means CoT and similarly differentially transparent (vs. model internals) intermediate outputs, which makes it probably a safer way of eliciting capabilities.