(Bold direct claims, not super confident—criticism welcome.)
The approach to ELK in this post is unfalsifiable.
A counterexample to the approach would need to be a test-time situation in which:
The predictor correctly predicts a safe-looking diamond.
The predictor “knows” that the diamond is unsafe.
The usual “explanation” (e.g., heuristic argument) for safe-looking-diamond predictions on the training data applies.
Points 2 and 3 are in direct conflict: the predictor knowing that the diamond is unsafe rules out the usual explanation for the safe-looking predictions.
So now I’m unclear what progress has been made. This looks like simply defining “the predictor knows P” as “there is a mechanistic explanation of the outputs starting from an assumption of P in the predictor’s world model”, then declaring ELK solved by noting we can search over and compare mechanistic explanations.
This approach requires solving a bunch of problems that may or may not be solvable—finding a notion of mechanistic explanation with the desired properties, evaluating whether that explanation “applies” to particular inputs, bounding the number of sub-explanations so that we can use them for anomaly detection without false positives, efficiently finding explanations for key model behaviors, and so on. Each of those steps could fail. And in practice we are pursuing a much more specific approach to formalizing mechanistic explanations as probabilistic heuristic arguments, which could fail even more easily.
This approach also depends on a fuzzier philosophical claim, which is more like: “if every small heuristic argument that explains the model behavior on the training set also applies to the current input, then the model doesn’t know that something weird is happening on this input.” It seems like your objection is that this is an unfalsifiable definitional move, but I disagree:
We can search for cases where we intuitively judge that the model “knows” about a distinction between two mechanisms and yet there is no heuristic argument that distinguishes those mechanisms (even though “know” is pre-formal).
Moreover, we can search more directly for any plausible case in which SGD produces a model that pursues a coherent and complex plan to tamper with the sensors without there being any heuristic argument that distinguishes it from the normal reason—that’s what we ultimately care about and “know” is just an intuitive waypoint that we can skip if it introduces problematic ambiguity.
If we actually solve all the concrete problems (like formalizing and finding heuristic arguments) then we can just look at empirical cases of backdoors, sensor tampering, or natural mechanism distinctions and empirically evaluate whether in fact those distinctions are detected by our method. That won’t imply that our method can distinguish real-world cases of sensor tampering, but it will provide much stronger empirical evidence than is available for most alignment approaches (because there is no reason for the methods to break down around human level in particular).
All of those things are challenging without a clear formalization of “heuristic argument,” but I still feel we can do some productive thinking about them. Moreover this is objection is more like “We’re looking at a 3-step plan where it’s hard to evaluate step 3 without knowing details about how step 1 went” rather than “This plan is unfalsifiable.”
(Bold direct claims, not super confident—criticism welcome.)
The approach to ELK in this post is unfalsifiable.
A counterexample to the approach would need to be a test-time situation in which:
The predictor correctly predicts a safe-looking diamond.
The predictor “knows” that the diamond is unsafe.
The usual “explanation” (e.g., heuristic argument) for safe-looking-diamond predictions on the training data applies.
Points 2 and 3 are in direct conflict: the predictor knowing that the diamond is unsafe rules out the usual explanation for the safe-looking predictions.
So now I’m unclear what progress has been made. This looks like simply defining “the predictor knows P” as “there is a mechanistic explanation of the outputs starting from an assumption of P in the predictor’s world model”, then declaring ELK solved by noting we can search over and compare mechanistic explanations.
This approach requires solving a bunch of problems that may or may not be solvable—finding a notion of mechanistic explanation with the desired properties, evaluating whether that explanation “applies” to particular inputs, bounding the number of sub-explanations so that we can use them for anomaly detection without false positives, efficiently finding explanations for key model behaviors, and so on. Each of those steps could fail. And in practice we are pursuing a much more specific approach to formalizing mechanistic explanations as probabilistic heuristic arguments, which could fail even more easily.
This approach also depends on a fuzzier philosophical claim, which is more like: “if every small heuristic argument that explains the model behavior on the training set also applies to the current input, then the model doesn’t know that something weird is happening on this input.” It seems like your objection is that this is an unfalsifiable definitional move, but I disagree:
We can search for cases where we intuitively judge that the model “knows” about a distinction between two mechanisms and yet there is no heuristic argument that distinguishes those mechanisms (even though “know” is pre-formal).
Moreover, we can search more directly for any plausible case in which SGD produces a model that pursues a coherent and complex plan to tamper with the sensors without there being any heuristic argument that distinguishes it from the normal reason—that’s what we ultimately care about and “know” is just an intuitive waypoint that we can skip if it introduces problematic ambiguity.
If we actually solve all the concrete problems (like formalizing and finding heuristic arguments) then we can just look at empirical cases of backdoors, sensor tampering, or natural mechanism distinctions and empirically evaluate whether in fact those distinctions are detected by our method. That won’t imply that our method can distinguish real-world cases of sensor tampering, but it will provide much stronger empirical evidence than is available for most alignment approaches (because there is no reason for the methods to break down around human level in particular).
All of those things are challenging without a clear formalization of “heuristic argument,” but I still feel we can do some productive thinking about them. Moreover this is objection is more like “We’re looking at a 3-step plan where it’s hard to evaluate step 3 without knowing details about how step 1 went” rather than “This plan is unfalsifiable.”