I use Scalable Oversight to refer to both Alignment and Control
I’m confused whether weak to strong learning is a restatement of scalable oversight, ELK, or its own thing, so I ignore it
I don’t explicitly include easy-to-hard, I think OOD basically covers it
taxonomies and abstractions are brittle and can be counterproductive
Scalable Oversight Taxonomy
Scalable Oversight
Scalable Alignment
Benchmarks / Tasks
Sandwiching Experiments (human amateurs + model, gt from human experts)
Weak models supervising Strong models
Approaches
Debate
Recursive reward modeling
(Solution to Eliciting Latent Knowledge) + Narrow Elicitation
(Note—I think assumes more then prior scalable oversight ideas that there will be base model with adequate knowledge, such that the hard part is extracting the knowledge rather than teaching the model)
Eliciting Latent Knowledge
Approaches
Contrast Consistent Search
Confidence
Intermediate Probing
“Speed Prior”
“Simplicity Prior”
Concept Extrapolation—learn all salient generalizations, use expensive supervision to select correct one
IID Mechanistic Anomaly Detection + expensive supervision on anomalies
Subclasses
Measurement Tampering Detection
Approaches
OOD Mechanistic Anomaly Detection
In distribution
Out of Distribution (likely? requires multiple measurment structure)
Concept Extrapolation
train diverse probes on untrusted data, select probe that predicts positive measurements less frequently
Here’s a revised sketch
A few notes:
I use Scalable Oversight to refer to both Alignment and Control
I’m confused whether weak to strong learning is a restatement of scalable oversight, ELK, or its own thing, so I ignore it
I don’t explicitly include easy-to-hard, I think OOD basically covers it
taxonomies and abstractions are brittle and can be counterproductive
Scalable Oversight Taxonomy
Scalable Oversight
Scalable Alignment
Benchmarks / Tasks
Sandwiching Experiments (human amateurs + model, gt from human experts)
Weak models supervising Strong models
Approaches
Debate
Recursive reward modeling
(Solution to Eliciting Latent Knowledge) + Narrow Elicitation
(Note—I think assumes more then prior scalable oversight ideas that there will be base model with adequate knowledge, such that the hard part is extracting the knowledge rather than teaching the model)
Eliciting Latent Knowledge
Approaches
Contrast Consistent Search
Confidence
Intermediate Probing
“Speed Prior”
“Simplicity Prior”
Concept Extrapolation—learn all salient generalizations, use expensive supervision to select correct one
IID Mechanistic Anomaly Detection + expensive supervision on anomalies
Subclasses
Measurement Tampering Detection
Approaches
OOD Mechanistic Anomaly Detection
In distribution
Out of Distribution (likely? requires multiple measurment structure)
Concept Extrapolation
train diverse probes on untrusted data, select probe that predicts positive measurements less frequently
Narrow Elicitation
...
Scalable Control
Weak Review
Untrusted Rephrase or whatever
Coup probes
MAD (Review all anomalies)
Trojans
...
MAD (maybe?)
Adversarial Examples
...
MAD (maybe?)
Natural Mechanism Distinction
MAD
Spurious Correlate Detection / Resolution
Concept Extrapolation