This approach requires solving a bunch of problems that may or may not be solvable—finding a notion of mechanistic explanation with the desired properties, evaluating whether that explanation “applies” to particular inputs, bounding the number of sub-explanations so that we can use them for anomaly detection without false positives, efficiently finding explanations for key model behaviors, and so on. Each of those steps could fail. And in practice we are pursuing a much more specific approach to formalizing mechanistic explanations as probabilistic heuristic arguments, which could fail even more easily.
This approach also depends on a fuzzier philosophical claim, which is more like: “if every small heuristic argument that explains the model behavior on the training set also applies to the current input, then the model doesn’t know that something weird is happening on this input.” It seems like your objection is that this is an unfalsifiable definitional move, but I disagree:
We can search for cases where we intuitively judge that the model “knows” about a distinction between two mechanisms and yet there is no heuristic argument that distinguishes those mechanisms (even though “know” is pre-formal).
Moreover, we can search more directly for any plausible case in which SGD produces a model that pursues a coherent and complex plan to tamper with the sensors without there being any heuristic argument that distinguishes it from the normal reason—that’s what we ultimately care about and “know” is just an intuitive waypoint that we can skip if it introduces problematic ambiguity.
If we actually solve all the concrete problems (like formalizing and finding heuristic arguments) then we can just look at empirical cases of backdoors, sensor tampering, or natural mechanism distinctions and empirically evaluate whether in fact those distinctions are detected by our method. That won’t imply that our method can distinguish real-world cases of sensor tampering, but it will provide much stronger empirical evidence than is available for most alignment approaches (because there is no reason for the methods to break down around human level in particular).
All of those things are challenging without a clear formalization of “heuristic argument,” but I still feel we can do some productive thinking about them. Moreover this is objection is more like “We’re looking at a 3-step plan where it’s hard to evaluate step 3 without knowing details about how step 1 went” rather than “This plan is unfalsifiable.”
This approach requires solving a bunch of problems that may or may not be solvable—finding a notion of mechanistic explanation with the desired properties, evaluating whether that explanation “applies” to particular inputs, bounding the number of sub-explanations so that we can use them for anomaly detection without false positives, efficiently finding explanations for key model behaviors, and so on. Each of those steps could fail. And in practice we are pursuing a much more specific approach to formalizing mechanistic explanations as probabilistic heuristic arguments, which could fail even more easily.
This approach also depends on a fuzzier philosophical claim, which is more like: “if every small heuristic argument that explains the model behavior on the training set also applies to the current input, then the model doesn’t know that something weird is happening on this input.” It seems like your objection is that this is an unfalsifiable definitional move, but I disagree:
We can search for cases where we intuitively judge that the model “knows” about a distinction between two mechanisms and yet there is no heuristic argument that distinguishes those mechanisms (even though “know” is pre-formal).
Moreover, we can search more directly for any plausible case in which SGD produces a model that pursues a coherent and complex plan to tamper with the sensors without there being any heuristic argument that distinguishes it from the normal reason—that’s what we ultimately care about and “know” is just an intuitive waypoint that we can skip if it introduces problematic ambiguity.
If we actually solve all the concrete problems (like formalizing and finding heuristic arguments) then we can just look at empirical cases of backdoors, sensor tampering, or natural mechanism distinctions and empirically evaluate whether in fact those distinctions are detected by our method. That won’t imply that our method can distinguish real-world cases of sensor tampering, but it will provide much stronger empirical evidence than is available for most alignment approaches (because there is no reason for the methods to break down around human level in particular).
All of those things are challenging without a clear formalization of “heuristic argument,” but I still feel we can do some productive thinking about them. Moreover this is objection is more like “We’re looking at a 3-step plan where it’s hard to evaluate step 3 without knowing details about how step 1 went” rather than “This plan is unfalsifiable.”