Thanks, this analysis makes a lot of sense to me. Some random thoughts:
The lack of modularity in ML models definitely makes it harder to make them safe. I wonder if mechanistic interpretability research could find some ways around this. For example, if you could identify modules within transformers (kinda like circuits) along with their purpose(s), you could maybe test each module as an isolated component.
More generally, if we get to a point where we can (even approximately?) “modularize” an ML model, and edit these modules, we could maybe use something like Nancy Leveson’s framework to make it safe (i.e. treat safety as a control problem, where a system’s lower-level components impose constraints on its emergent properties, in particular safety).
Of course these things (1) depend on mechanistic interpretability research being far ahead of where it is now, and (2) wouldn’t alone be enough to make safe AI, but would maybe help quite a bit.
Thanks, this analysis makes a lot of sense to me. Some random thoughts:
The lack of modularity in ML models definitely makes it harder to make them safe. I wonder if mechanistic interpretability research could find some ways around this. For example, if you could identify modules within transformers (kinda like circuits) along with their purpose(s), you could maybe test each module as an isolated component.
More generally, if we get to a point where we can (even approximately?) “modularize” an ML model, and edit these modules, we could maybe use something like Nancy Leveson’s framework to make it safe (i.e. treat safety as a control problem, where a system’s lower-level components impose constraints on its emergent properties, in particular safety).
Of course these things (1) depend on mechanistic interpretability research being far ahead of where it is now, and (2) wouldn’t alone be enough to make safe AI, but would maybe help quite a bit.