I wrote some agenda-setting and brainstorming docs (#10 and #11 in this list) which people are welcome to read and comment on if interested.
Unknowingly, I happen to have worked on some very related topics during the Astra Fellowship winter ’24 with @evhub (and also later). Most of it is still unpublished, but this is the doc (draft) of the short presentation I gave at the end; and I mention some other parts in this comment.
(I’m probably biased and partial but) I think the rough plan laid out in #10 and #11 in the list is among the best and most tractable I’ve ever seen. I really like the ‘core system—amplified system’ framework and have had some related thoughts during Astra (comment; draft). I also think there’s been really encouraging recent progress on using trusted systems (in Redwood’s control framework terminology; often by differentially ‘turning up’ the capabilities on the amplification part of the system [vs. those of the core system]) to (safely) push forward automated safety work on the core system; e.g. A Multimodal Automated Interpretability Agent. And I could see some kind of safety case framework where, as we gain confidence in the control/alignment of the amplified system and as the capabilities of the systems increase, we move towards increasingly automating the safety research applied to the (increasingly ‘interior’ parts of the) core system. [Generalized] Inference scaling laws also seem pretty good w.r.t. this kind of plan, though worrying in otherways.
And I could see some kind of safety case framework where, as we gain confidence in the control/alignment of the amplified system and as the capabilities of the systems increase, we move towards increasingly automating the safety research applied to the (increasingly ‘interior’ parts of the) core system.
E.g. I would interpret the results from https://transluce.org/neuron-descriptions as showing that we can now get 3-minute-human-level automated interpretability on all the MLP neurons of a LLM (‘core system’), for about 5 cents / neuron (using sub-ASL-3 models and very unlikely to be scheming because bad at prerequisites).
Thanks a lot for posting these!
Unknowingly, I happen to have worked on some very related topics during the Astra Fellowship winter ’24 with @evhub (and also later). Most of it is still unpublished, but this is the doc (draft) of the short presentation I gave at the end; and I mention some other parts in this comment.
(I’m probably biased and partial but) I think the rough plan laid out in #10 and #11 in the list is among the best and most tractable I’ve ever seen. I really like the ‘core system—amplified system’ framework and have had some related thoughts during Astra (comment; draft). I also think there’s been really encouraging recent progress on using trusted systems (in Redwood’s control framework terminology; often by differentially ‘turning up’ the capabilities on the amplification part of the system [vs. those of the core system]) to (safely) push forward automated safety work on the core system; e.g. A Multimodal Automated Interpretability Agent. And I could see some kind of safety case framework where, as we gain confidence in the control/alignment of the amplified system and as the capabilities of the systems increase, we move towards increasingly automating the safety research applied to the (increasingly ‘interior’ parts of the) core system. [Generalized] Inference scaling laws also seem pretty good w.r.t. this kind of plan, though worrying in other ways.
E.g. I would interpret the results from https://transluce.org/neuron-descriptions as showing that we can now get 3-minute-human-level automated interpretability on all the MLP neurons of a LLM (‘core system’), for about 5 cents / neuron (using sub-ASL-3 models and very unlikely to be scheming because bad at prerequisites).