Anyhow I totally agree on the urgency, tractability, and importance of faithful CoT research. I think that if we can do enough of that research fast enough, we’ll be able to ‘hold the line’ at stage 2 for some time, possibly long enough to reach AGI.
Do you have thoughts on how much it helps that autointerp seems to now be roughly human-level on some metrics and can be applied cheaply (e.g. https://transluce.org/neuron-descriptions), so perhaps we might have another ‘defensive line’ even past stage 2 (e.g. in the case of https://transluce.org/neuron-descriptions, corresponding to the level of ‘granularity’ of autointerp applied to the activations of all the MLP neurons inside an LLM)?
I don’t have much to say about the human-level-on-some-metrics thing. evhub’s mechinterp tech tree was good for laying out the big picture, if I recall correctly. And yeah I think interpretability is another sort of ‘saving throw’ or ‘line of defense’ that will hopefully be ready in time.
Do you have thoughts on how much it helps that autointerp seems to now be roughly human-level on some metrics and can be applied cheaply (e.g. https://transluce.org/neuron-descriptions), so perhaps we might have another ‘defensive line’ even past stage 2 (e.g. in the case of https://transluce.org/neuron-descriptions, corresponding to the level of ‘granularity’ of autointerp applied to the activations of all the MLP neurons inside an LLM)?
Later edit: Depending on research progress (especially w.r.t. cost effectiveness), other levels of ‘granularity’ might also become available (fully automated) soon for monitoring, e.g. sparse (SAE) feature circuits (of various dangerous/undesirable capabilities), as demo-ed in Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.
I don’t have much to say about the human-level-on-some-metrics thing. evhub’s mechinterp tech tree was good for laying out the big picture, if I recall correctly. And yeah I think interpretability is another sort of ‘saving throw’ or ‘line of defense’ that will hopefully be ready in time.