Vladimir_Nesov comments on What Makes an Idea Understandable? On Architecturally and Culturally Natural Ideas.

Vladimir_Nesov 18 Aug 2022 11:03 UTC
2 points
0

Would it solve the problem, if we could simply delete the parts of the model that humans don’t understand?

I suspect this is along the lines of a correct approach not just to alignment-around-training-distribution, but to robust reflection/extrapolation very far off-distribution. Interpretable and trained/verified-as-aligned coarse-grained descriptions of behavior should determine new training that takes place off-initial-distribution, screening off opaque details of the old model.

Then even if something along the lines of mesa-optimizers is always lurking below the surface of model behavior, with potential to enact a phase change on behavior far off-distribution, it doesn’t get good channels to get its bearings or influence what comes after, because it never gets far off-distribution (behavior is only rendered within-scope where it’s still aligned). It’s a new model determined by descriptions of behavior of the old models that implements behavior far off-distribution instead.