Slightly Aspirational AGI Safety research landscape
This is a combination of an overview of current subfields in empirical AI safety and research subfields I would like to see but which do not currently exist or are very small. I think this list is probably worse than this recent review, but making it was useful for reminding myself how big this field is.
Other capability evaluations / benchmarking (useful for knowing how smart AIs are, informing forecasts), including persona evals
Understanding normal but poorly understood things, like in context learning
Understanding weird phenomenon in deep learning, like this paper
Understand how various HHH fine-tuning techniques work
AI Control
General supervision of untrusted models (using human feedback efficiently, using weaker models for supervision, schemes for getting useful work done given various Control constraints)
Unlearning
Steganography prevention / CoT faithfulness
Censorship study (how censoring AI models affects performance; and similar things)
Model organisms of misalignment
Demonstrations of deceptive alignment and sycophancy / reward hacking
Trojans
Alignment evaluations
Capability elicitation
Scaling / scalable oversight
RLHF / RLAIF
Debate, market making, imitative generalization, etc.
Reward hacking and sycophancy (potentially overoptimization, but I’m confused by much of it)
Weak to strong generalization
General ideas: factoring cognition, iterated distillation and amplification, recursive reward modeling
Robustness
Anomaly detection
Understanding distribution shifts and generalization
User jailbreaking
Adversarial attacks / training (generally), including latent adversarial training
AI Security
Extracting info about models or their training data
Understanding optimization in multi-agent situations
Attacks vs. defenses for various problems
Unsorted / grab bag
Watermarking and AI generation detection
Honesty (model says what it believes)
Truthfulness (only say true things, aka accuracy improvement)
Uncertainty quantification / calibration
Landscape study (anticipating and responding to risks from new AI paradigms like compound AI systems or self play)
Don’t quite make the list:
Whose values? Figuring out how to aggregate preferences in RLHF. This seems like it’s almost certainly not a catastrophe-relevant safety problem, and my weak guess is that it makes other alignment properties harder to get (e.g., incentivizes sycophancy and makes jailbreaks easier by causing the model’s goals to be more context/user dependent). This work seems generally net positive to me, but it’s not relevant to catastrophic risks and thus is relatively low priority.
Fine-tuning that calls itself “alignment”. I think it’s super lame that people are co-opting language from AI safety. Some of this work may actually be useful, e.g., by finding ways to mitigate jailbreaks, but it’s mostly low quality.
I think mechanistic anomaly detection (mostly ARC but also Redwood and some forthcoming work) is importantly different than robustness (though clearly related).
Slightly Aspirational AGI Safety research landscape
This is a combination of an overview of current subfields in empirical AI safety and research subfields I would like to see but which do not currently exist or are very small. I think this list is probably worse than this recent review, but making it was useful for reminding myself how big this field is.
Interpretability / understanding model internals
Circuit interpretability
Superposition study
Activation engineering
Developmental interpretability
Understanding deep learning
Scaling laws / forecasting
Dangerous capability evaluations (directly relevant to particular risks, e.g., biosecurity, self-proliferation)
Other capability evaluations / benchmarking (useful for knowing how smart AIs are, informing forecasts), including persona evals
Understanding normal but poorly understood things, like in context learning
Understanding weird phenomenon in deep learning, like this paper
Understand how various HHH fine-tuning techniques work
AI Control
General supervision of untrusted models (using human feedback efficiently, using weaker models for supervision, schemes for getting useful work done given various Control constraints)
Unlearning
Steganography prevention / CoT faithfulness
Censorship study (how censoring AI models affects performance; and similar things)
Model organisms of misalignment
Demonstrations of deceptive alignment and sycophancy / reward hacking
Trojans
Alignment evaluations
Capability elicitation
Scaling / scalable oversight
RLHF / RLAIF
Debate, market making, imitative generalization, etc.
Reward hacking and sycophancy (potentially overoptimization, but I’m confused by much of it)
Weak to strong generalization
General ideas: factoring cognition, iterated distillation and amplification, recursive reward modeling
Robustness
Anomaly detection
Understanding distribution shifts and generalization
User jailbreaking
Adversarial attacks / training (generally), including latent adversarial training
AI Security
Extracting info about models or their training data
Attacking LLM applications, self-replicating worms
Multi-agent safety
Understanding AI in conflict situations
Cascading failures
Understanding optimization in multi-agent situations
Attacks vs. defenses for various problems
Unsorted / grab bag
Watermarking and AI generation detection
Honesty (model says what it believes)
Truthfulness (only say true things, aka accuracy improvement)
Uncertainty quantification / calibration
Landscape study (anticipating and responding to risks from new AI paradigms like compound AI systems or self play)
Don’t quite make the list:
Whose values? Figuring out how to aggregate preferences in RLHF. This seems like it’s almost certainly not a catastrophe-relevant safety problem, and my weak guess is that it makes other alignment properties harder to get (e.g., incentivizes sycophancy and makes jailbreaks easier by causing the model’s goals to be more context/user dependent). This work seems generally net positive to me, but it’s not relevant to catastrophic risks and thus is relatively low priority.
Fine-tuning that calls itself “alignment”. I think it’s super lame that people are co-opting language from AI safety. Some of this work may actually be useful, e.g., by finding ways to mitigate jailbreaks, but it’s mostly low quality.
I think mechanistic anomaly detection (mostly ARC but also Redwood and some forthcoming work) is importantly different than robustness (though clearly related).