People fail to realise that leading AGI labs, such as OpenAI and Conjecture (I bet DeepMind and Anthropic, too, even though they haven’t publicly stated this) do not plan to align LLMs “once and forever”, but rather use LLMs to produce novel alignment research, which will almost certainly go in package with novel AI architectures (or variations on existing proposals, such as LeCun’s H-JEPA).
Among many other theories of change, MATS currently supports:
Interpretability and model editing research that might build into better model editing, better auditing, better regularizers, better DL science, and manyotherthings:
Neel Nanda’s and Conjecture’s mechanistic interpretability research
If AGI labs truly bet on AI-assisted (or fully AI-automated) science, across the domains of science (the second group in your list), then research done in the following three groups will be submerged by that AI-assisted research.
It’s still important to do some research in these areas, for two reasons:
(1) hedging bets against some unexpected turn of events, such as AIs failing to improve the speed and depth of generated scientific insight, at least in some areas (perhaps governance & strategy are more iffy areas, and it’s hard to become sure that strategies suggested by AIs are free of ‘deep deceptiveness’ style of bias than pure math or science).
(2) When AIs will presumably generate all that awesome science, the humanity still needs to have people capable of understanding, evaluating, and finding weaknesses in this science.
This, however, suggests a different focus in the latter three groups: growing excellent science evaluators rather than generators (GAN style). More Yudkowskys able to shut down and poke holes in various plans. Less focus on producing sheer amount of research and more focus on the ability to criticise others’ and one’s own research. There is overlap of course but there are also differences in how researchers should develop, if we keep this in mind. Also credit assignment systems and community authority-inferring mechanisms should recognise this focus.
MATS’ framing is that we are supporting a “diverse portfolio” of research agendas that might “pay off” in different worlds (i.e., your “hedging bets” analogy is accurate). We also think the listed research agendas have some synergy you might have missed. For example, interpretability research might build into better AI-assisted white-box auditing, white/gray-box steering (e.g., via ELK), or safe architecture design (e.g., “retargeting the search”).
The distinction between “evaluator” and “generator” seems fuzzier to me than you portray. For instance, two “generator” AIs might be able to red-team each other for the purposes of evaluating an alignment strategy.
Among many other theories of change, MATS currently supports:
Evaluating models for dangerous capabilities, to aid safety standards, moratoriums, and alignment MVPs:
Owain Evans’ evaluations of situational awareness
Ethan Perez’s and Evan Hubinger’s demonstrations of deception and other undesirable traits
Daniel Kokotajlo’s dangerous capabilities evaluations
Dan Hendrycks’ deceptive capabilities benchmarking
Building non-agentic AI systems that can be used to extract useful alignment research:
Evan Hubinger’s conditioning predictive models
Janus’ simulators and Nicholas Kees Dupois’ cyborgism
Interpretability and model editing research that might build into better model editing, better auditing, better regularizers, better DL science, and many other things:
Neel Nanda’s and Conjecture’s mechanistic interpretability research
Alex Turner’s RL interpretability and model editing research
Collin Burns’ “gray-box ELK”
Research that aims to predict or steer architecture-agnostic or emergent systems:
Vivek Hebbar’s research on “sharp left turns,” etc.
John Wentworth’s selection theorems research
Alex Turner’s and Quintin Pope’s shard theory research
Technical alignment-adjacent governance and strategy research:
Jesse Clifton’s and Daniel Kokotajlo’s research on averting AI conflict and AI arms races
Dan Hendrycks’ research on aligning complex systems
If AGI labs truly bet on AI-assisted (or fully AI-automated) science, across the domains of science (the second group in your list), then research done in the following three groups will be submerged by that AI-assisted research.
It’s still important to do some research in these areas, for two reasons:
(1) hedging bets against some unexpected turn of events, such as AIs failing to improve the speed and depth of generated scientific insight, at least in some areas (perhaps governance & strategy are more iffy areas, and it’s hard to become sure that strategies suggested by AIs are free of ‘deep deceptiveness’ style of bias than pure math or science).
(2) When AIs will presumably generate all that awesome science, the humanity still needs to have people capable of understanding, evaluating, and finding weaknesses in this science.
This, however, suggests a different focus in the latter three groups: growing excellent science evaluators rather than generators (GAN style). More Yudkowskys able to shut down and poke holes in various plans. Less focus on producing sheer amount of research and more focus on the ability to criticise others’ and one’s own research. There is overlap of course but there are also differences in how researchers should develop, if we keep this in mind. Also credit assignment systems and community authority-inferring mechanisms should recognise this focus.
MATS’ framing is that we are supporting a “diverse portfolio” of research agendas that might “pay off” in different worlds (i.e., your “hedging bets” analogy is accurate). We also think the listed research agendas have some synergy you might have missed. For example, interpretability research might build into better AI-assisted white-box auditing, white/gray-box steering (e.g., via ELK), or safe architecture design (e.g., “retargeting the search”).
The distinction between “evaluator” and “generator” seems fuzzier to me than you portray. For instance, two “generator” AIs might be able to red-team each other for the purposes of evaluating an alignment strategy.