Bogdan Ionut Cirstea comments on The “no sandbagging on checkable tasks” hypothesis

Bogdan Ionut Cirstea 16 May 2024 5:04 UTC
1 point
0
You’ll often be able to use the model to advance lots of other forms of science and technology, including alignment-relevant science and technology, where experiment and human understanding (+ the assistance of available AI tools) is enough to check the advances in question.
- Interpretability research seems like a strong and helpful candidate here, since many aspects of interpretability research seem like they involve relatively tight, experimentally-checkable feedback loops. (See e.g. Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie.)
The apparent early success of language model agents for interpretability
(e.g. MAIA, FIND) seems perhaps related.