Would it solve the problem, if we could simply delete the parts of the model that humans don’t understand?
I suspect this is along the lines of a correct approach not just to alignment-around-training-distribution, but to robust reflection/extrapolation very far off-distribution. Interpretable and trained/verified-as-aligned coarse-grained descriptions of behavior should determine new training that takes place off-initial-distribution, screening off opaque details of the old model.
Then even if something along the lines of mesa-optimizers is always lurking below the surface of model behavior, with potential to enact a phase change on behavior far off-distribution, it doesn’t get good channels to get its bearings or influence what comes after, because it never gets far off-distribution (behavior is only rendered within-scope where it’s still aligned). It’s a new model determined by descriptions of behavior of the old models that implements behavior far off-distribution instead.
I suspect this is along the lines of a correct approach not just to alignment-around-training-distribution, but to robust reflection/extrapolation very far off-distribution. Interpretable and trained/verified-as-aligned coarse-grained descriptions of behavior should determine new training that takes place off-initial-distribution, screening off opaque details of the old model.
Then even if something along the lines of mesa-optimizers is always lurking below the surface of model behavior, with potential to enact a phase change on behavior far off-distribution, it doesn’t get good channels to get its bearings or influence what comes after, because it never gets far off-distribution (behavior is only rendered within-scope where it’s still aligned). It’s a new model determined by descriptions of behavior of the old models that implements behavior far off-distribution instead.