You’ll often be able to use the model to advance lots of other forms of science and technology, including alignment-relevant science and technology, where experiment and human understanding (+ the assistance of available AI tools) is enough to check the advances in question.
Interpretability research seems like a strong and helpful candidate here, since many aspects of interpretability research seem like they involve relatively tight, experimentally-checkable feedback loops. (See e.g. Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie.)
The apparent early success of language model agents for interpretability (e.g. MAIA, FIND) seems perhaps related.
The apparent early success of language model agents for interpretability
(e.g. MAIA, FIND) seems perhaps related.