And I could see some kind of safety case framework where, as we gain confidence in the control/alignment of the amplified system and as the capabilities of the systems increase, we move towards increasingly automating the safety research applied to the (increasingly ‘interior’ parts of the) core system.
E.g. I would interpret the results from https://transluce.org/neuron-descriptions as showing that we can now get 3-minute-human-level automated interpretability on all the MLP neurons of a LLM (‘core system’), for about 5 cents / neuron (using sub-ASL-3 models and very unlikely to be scheming because bad at prerequisites).
E.g. I would interpret the results from https://transluce.org/neuron-descriptions as showing that we can now get 3-minute-human-level automated interpretability on all the MLP neurons of a LLM (‘core system’), for about 5 cents / neuron (using sub-ASL-3 models and very unlikely to be scheming because bad at prerequisites).