Clarifying the relationship between mechanistic anomaly detection (MAD), measurement tampering detection (MTD), weak to strong generalization (W2SG), weak to strong learning (W2SL), and eliciting latent knowledge (ELK). (Nothing new or interesting here, I just often loose track of these relationships in my head)
eliciting latent knowledge is an approach to scalable oversight which hopes to use the latent knowledge of a model as a supervision signal or oracle.
weak to strong learning is an experimental setup for evaluating scalable oversight protocols, and is a class of sandwiching experiments
weak to strong generalization is a class of approaches to ELK which relies on generalizing a “weak” supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.
measurement tampering detection is a class of weak to strong generalization problems, where the “weak” supervision consists of multiple measurements which are sufficient for supervision in the absence of “tampering” (where tampering is not yet formally defined)
mechanistic anomaly detection is an approach to ELK, where examples are flagged as anomalous if they cause the model to do things for “different reasons” then on a trusted dataset, where “different reasons” are defined w.r.t internal model cognition and structure.
mechanistic anomaly detection methods that work for ELK should also probably work for other problems (such as backdoor detection and adversarial example detection)
so when developing benchmarks for mechanistic anomaly detection, we both want to test methods against methods in standard machine learning security problems (adversarial examples and trojans) that have similar structure to scalable oversight problems, against other elk approaches (e.g. CCS), and against other scalable oversight approaches (e.g. debate)
weak to strong generalization is a class of approaches to ELK which relies on generalizing a “weak” supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.
You could also distinguish between weak-to-strong generalization, where you have a weak supervision signal on the entire distribution (which may sometimes be wrong), and easy-to-hard generalization, where you have a correct supervision signal but only on an easy part of the distribution. Of course both of these are simplifications. In reality, I’d expect the setting to be more like: you have a certain weak supervision budget (or maybe even budgets at different levels of strength), and you can probably decide how to spend the budget. You might only have an imperfect sense of which cases are “easy” vs “hard” though.
mechanistic anomaly detection is an approach to ELK
I think going from MAD to a fully general ELK solution requires some extra ingredients. In practice, the plan might be to MTD and then using the AI in ways such that this is enough (rather than needing a fully general ELK solution). This is related to narrow elicitation though MTD seems even narrower. Even for MTD, you probably need something to bridge the easy-to-hard gap, but at least for that there are specific proposals that seem plausible (this or, as a more concrete instance, exclusion fine-tuning from the Redwood MTD paper). I think it could turn out that general/worst-case solutions to MAD and ELK run into very similar obstacles, but I don’t think a practical MAD solution (e.g. contingent on empirical facts about deep learning) obviously lets you solve ELK.
I would also add that you could motivate MAD as a method to deal with scheming (or other high-stakes failures). In that case, the things to compare against most naturally might look a bit different (e.g. AI control, coup probes, interpretability-themed things); and it wouldn’t make sense to compare against debate in that setting. I think most mainstream ML problems that are similar to MAD are closer to this than to scalable oversight.
I use Scalable Oversight to refer to both Alignment and Control
I’m confused whether weak to strong learning is a restatement of scalable oversight, ELK, or its own thing, so I ignore it
I don’t explicitly include easy-to-hard, I think OOD basically covers it
taxonomies and abstractions are brittle and can be counterproductive
Scalable Oversight Taxonomy
Scalable Oversight
Scalable Alignment
Benchmarks / Tasks
Sandwiching Experiments (human amateurs + model, gt from human experts)
Weak models supervising Strong models
Approaches
Debate
Recursive reward modeling
(Solution to Eliciting Latent Knowledge) + Narrow Elicitation
(Note—I think assumes more then prior scalable oversight ideas that there will be base model with adequate knowledge, such that the hard part is extracting the knowledge rather than teaching the model)
Eliciting Latent Knowledge
Approaches
Contrast Consistent Search
Confidence
Intermediate Probing
“Speed Prior”
“Simplicity Prior”
Concept Extrapolation—learn all salient generalizations, use expensive supervision to select correct one
IID Mechanistic Anomaly Detection + expensive supervision on anomalies
Subclasses
Measurement Tampering Detection
Approaches
OOD Mechanistic Anomaly Detection
In distribution
Out of Distribution (likely? requires multiple measurment structure)
Concept Extrapolation
train diverse probes on untrusted data, select probe that predicts positive measurements less frequently
Clarifying the relationship between mechanistic anomaly detection (MAD), measurement tampering detection (MTD), weak to strong generalization (W2SG), weak to strong learning (W2SL), and eliciting latent knowledge (ELK). (Nothing new or interesting here, I just often loose track of these relationships in my head)
eliciting latent knowledge is an approach to scalable oversight which hopes to use the latent knowledge of a model as a supervision signal or oracle.
weak to strong learning is an experimental setup for evaluating scalable oversight protocols, and is a class of sandwiching experiments
weak to strong generalization is a class of approaches to ELK which relies on generalizing a “weak” supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.
measurement tampering detection is a class of weak to strong generalization problems, where the “weak” supervision consists of multiple measurements which are sufficient for supervision in the absence of “tampering” (where tampering is not yet formally defined)
mechanistic anomaly detection is an approach to ELK, where examples are flagged as anomalous if they cause the model to do things for “different reasons” then on a trusted dataset, where “different reasons” are defined w.r.t internal model cognition and structure.
mechanistic anomaly detection methods that work for ELK should also probably work for other problems (such as backdoor detection and adversarial example detection)
so when developing benchmarks for mechanistic anomaly detection, we both want to test methods against methods in standard machine learning security problems (adversarial examples and trojans) that have similar structure to scalable oversight problems, against other elk approaches (e.g. CCS), and against other scalable oversight approaches (e.g. debate)
Nice overview, agree with most of it!
You could also distinguish between weak-to-strong generalization, where you have a weak supervision signal on the entire distribution (which may sometimes be wrong), and easy-to-hard generalization, where you have a correct supervision signal but only on an easy part of the distribution. Of course both of these are simplifications. In reality, I’d expect the setting to be more like: you have a certain weak supervision budget (or maybe even budgets at different levels of strength), and you can probably decide how to spend the budget. You might only have an imperfect sense of which cases are “easy” vs “hard” though.
I think going from MAD to a fully general ELK solution requires some extra ingredients. In practice, the plan might be to MTD and then using the AI in ways such that this is enough (rather than needing a fully general ELK solution). This is related to narrow elicitation though MTD seems even narrower. Even for MTD, you probably need something to bridge the easy-to-hard gap, but at least for that there are specific proposals that seem plausible (this or, as a more concrete instance, exclusion fine-tuning from the Redwood MTD paper). I think it could turn out that general/worst-case solutions to MAD and ELK run into very similar obstacles, but I don’t think a practical MAD solution (e.g. contingent on empirical facts about deep learning) obviously lets you solve ELK.
I would also add that you could motivate MAD as a method to deal with scheming (or other high-stakes failures). In that case, the things to compare against most naturally might look a bit different (e.g. AI control, coup probes, interpretability-themed things); and it wouldn’t make sense to compare against debate in that setting. I think most mainstream ML problems that are similar to MAD are closer to this than to scalable oversight.
Here’s a revised sketch
A few notes:
I use Scalable Oversight to refer to both Alignment and Control
I’m confused whether weak to strong learning is a restatement of scalable oversight, ELK, or its own thing, so I ignore it
I don’t explicitly include easy-to-hard, I think OOD basically covers it
taxonomies and abstractions are brittle and can be counterproductive
Scalable Oversight Taxonomy
Scalable Oversight
Scalable Alignment
Benchmarks / Tasks
Sandwiching Experiments (human amateurs + model, gt from human experts)
Weak models supervising Strong models
Approaches
Debate
Recursive reward modeling
(Solution to Eliciting Latent Knowledge) + Narrow Elicitation
(Note—I think assumes more then prior scalable oversight ideas that there will be base model with adequate knowledge, such that the hard part is extracting the knowledge rather than teaching the model)
Eliciting Latent Knowledge
Approaches
Contrast Consistent Search
Confidence
Intermediate Probing
“Speed Prior”
“Simplicity Prior”
Concept Extrapolation—learn all salient generalizations, use expensive supervision to select correct one
IID Mechanistic Anomaly Detection + expensive supervision on anomalies
Subclasses
Measurement Tampering Detection
Approaches
OOD Mechanistic Anomaly Detection
In distribution
Out of Distribution (likely? requires multiple measurment structure)
Concept Extrapolation
train diverse probes on untrusted data, select probe that predicts positive measurements less frequently
Narrow Elicitation
...
Scalable Control
Weak Review
Untrusted Rephrase or whatever
Coup probes
MAD (Review all anomalies)
Trojans
...
MAD (maybe?)
Adversarial Examples
...
MAD (maybe?)
Natural Mechanism Distinction
MAD
Spurious Correlate Detection / Resolution
Concept Extrapolation