Lists cut from our main post, in a token gesture toward readability.
We list past reviews of alignment work, ideas which seem to be dead, the cool but neglected neuroscience / biology approach, various orgs which don’t seem to have any agenda, and a bunch of things which don’t fit elsewhere.
Lots of agendas but not clear if anyone besides Byrnes and Thiergart are actively turning the crank. Seems like it would need a billion dollars.
Human enhancement
One-sentence summary: maybe we can give people new sensory modalities, or much higher bandwidth for conceptual information, or much better idea generation, or direct interface with DL systems, or direct interface with sensors, or transfer learning, and maybe this would help. The old superbaby dream goes here I suppose.
Theory of change: maybe this makes us better at alignment research
Merging
One-sentence summary: maybe we can form networked societies of DL systems and brains
Theory of change: maybe this lets us preserve some human values through bargaining or voting or weird politics.
One-sentence summary: maybe we can get really high-quality alignment labels from brain data, maybe we can steer models by training humans to do activation engineering fast and intuitively, maybe we can crack the true human reward function / social instincts and maybe adapt some of them for AGI.
Theory of change: as you’d guess
Some names: Byrnes, Cvitkovic, Foresight’s BCI,Also (list from Byrnes): Eli Sennesh, Adam Safron, Seth Herd, Nathan Helm-Burger, Jon Garcia, Patrick Butlin
Appendix: Research support orgs
One slightly confusing class of org is described by the sample {CAIF, FLI}. Often run by active researchers with serious alignment experience, but usually not following an obvious agenda, delegating a basket of strategies to grantees, doing field-building stuff like NeurIPS workshops and summer schools.
One-sentence summary: support researchers making differential progress in cooperative AI (eg precommitment mechanisms that can’t be used to make threats)
ARAAC—Make RL agents that can justify themselves (or other agents with their code as input) in human language at various levels of abstraction (AI apology, explainable RL for broad XAI)
CAIS: Machine ethics—Model intrinsic goods and normative factors, vs mere task preferences, as these will be relevant even under extreme world changes; helps us avoid proxy misspecification as well as value lock-in. Present state of research unknown (some relevant bits in RepEng).
CAIS: Power Aversion—incentives to models to avoid gaining more power than necessary. Related to mild optimisation and powerseeking. Present state of research unknown.
Appendices to the live agendas
Lists cut from our main post, in a token gesture toward readability.
We list past reviews of alignment work, ideas which seem to be dead, the cool but neglected neuroscience / biology approach, various orgs which don’t seem to have any agenda, and a bunch of things which don’t fit elsewhere.
Appendix: Prior enumerations
Everitt et al (2018)
Ji (2023)
Soares (2023)
Gleave and MacLean (2023)
Krakovna and Shah on Deepmind (2023)
AI Plans (2023), mostly irrelevant
Macdermott (2023)
Kirchner et al (2022): unsupervised analysis
Krakovna (2022), Paradigms of AI alignment
Hubinger (2020)
Perret (2020)
Nanda (2021)
Larsen (2022)
Zoellner (2022)
McDougall (2022)
Sharkey et al (2022) on interp
Koch (2020)
Critch (2020)
Hubinger on types of interpretability (2022)
Tai_safety_bibliography (2021)
Akash and Larsen (2022)
Karnofsky prosaic plan (2022)
Steinhardt (2019)
Russell (2016)
FLI (2017)
Shah (2020)
things which claim to be agendas
Tekofsky listing indies (2023)
This thing called boundaries (2023)
Sharkey et al (2022), mech interp
FLI Governance scorecard
The money
Appendix: Graveyard
Ambitious value learning?
MIRI youngbloods (see Hebbar)
JW selection theorems??
Provably Beneficial Artificial Intelligence (but see Open Agency and Omohundro)
HCH (see QACI)
IDA → Critiques and recursive reward modelling
Debate is now called Critiques and ERO
Market-making (Hubinger)
Logical inductors
Conditioning Predictive Models: Risks and Strategies?
Impact measures, conservative agency, side effects → “power aversion”
Acceptability Verification
Quantilizers
Redwood interp?
AI Safety Hub
Enabling Robots to Communicate their Objectives (early stage interp?)
Aligning narrowly superhuman models (Cotra idea; tiny followup; lives on as scalable oversight?)
automation of semantic interpretability
i.e. automatically proposing hypotheses instead of just automatically verifying them
Oracle AI is not dead so much as ~everything LLM falls under its purview
Tool AI, similarly, ~is LLMs plugged into a particular interface
EleutherAI: #accelerating-alignment—AI alignment assistants are live, but it doesn’t seem like EleutherAI is currently working on this
Algorithm Distillation Interpretability (Levoso)
Appendix: Biology for AI alignment
Lots of agendas but not clear if anyone besides Byrnes and Thiergart are actively turning the crank. Seems like it would need a billion dollars.
Human enhancement
One-sentence summary: maybe we can give people new sensory modalities, or much higher bandwidth for conceptual information, or much better idea generation, or direct interface with DL systems, or direct interface with sensors, or transfer learning, and maybe this would help. The old superbaby dream goes here I suppose.
Theory of change: maybe this makes us better at alignment research
Merging
One-sentence summary: maybe we can form networked societies of DL systems and brains
Theory of change: maybe this lets us preserve some human values through bargaining or voting or weird politics.
Cyborgism, Millidge, Dupuis
As alignment aid
One-sentence summary: maybe we can get really high-quality alignment labels from brain data, maybe we can steer models by training humans to do activation engineering fast and intuitively, maybe we can crack the true human reward function / social instincts and maybe adapt some of them for AGI.
Theory of change: as you’d guess
Some names: Byrnes, Cvitkovic, Foresight’s BCI, Also (list from Byrnes): Eli Sennesh, Adam Safron, Seth Herd, Nathan Helm-Burger, Jon Garcia, Patrick Butlin
Appendix: Research support orgs
One slightly confusing class of org is described by the sample {CAIF, FLI}. Often run by active researchers with serious alignment experience, but usually not following an obvious agenda, delegating a basket of strategies to grantees, doing field-building stuff like NeurIPS workshops and summer schools.
CAIF
One-sentence summary: support researchers making differential progress in cooperative AI (eg precommitment mechanisms that can’t be used to make threats)
Some names: Lewis Hammond
Estimated # FTEs: 3
Some outputs in 2023: Neurips contest, summer school
Funded by: Polaris Ventures
Critiques:
Funded by: ?
Trustworthy command, closure, opsec, common good, alignment mindset: ?
Resources: £2,423,943
AISC
One-sentence summary: entrypoint for new researchers to test fit and meet collaborators. More recently focussed on a capabilities pause. Still going!
Some names: Remmelt Ellen, Linda Linsefors
Estimated # FTEs: 2
Some outputs in 2023: tag
Funded by: ?
Critiques: ?
Funded by: ?
Trustworthy command, closure, opsec, common good, alignment mindset: ?
Resources: ~$200,000
See also:
FLI
AISS
PIBBSS
Gladstone AI
Apart
Catalyze
EffiSciences
Students (Oxford, Harvard, Groningen, Bristol, Cambridge, Stanford, Delft, MIT)
Appendix: Meta, mysteries, more
Metaphilosophy (2023) – Wei Dai, probably <1 FTE
Encultured—making a game
Bricman: https://compphil.github.io/truth/
Algorithmic Alignment
McIlrath: side effects and human-aware AI
how hard is alignment?
Intriguing
The UK government now has an evals/alignment lab of sorts (Xander Davies, Nitarshan Rajkumar, Krueger, Rumman Chowdhury)
The US government will soon have an evals/alignment lab of sorts, USAISI. (Elham Tabassi?)
A Roadmap for Robust End-to-End Alignment (2018)
Not theory but cool
Greene Lab Cognitive Oversight? Closest they’ve got is AI governance and they only got $13,800
GATO Framework seems to have a galaxy-brained global coordination and/or self-aligning AI angle
Wentworth 2020 - a theory of abstraction suitable for embedded agency.
https://www.aintelope.net/
ARAAC—Make RL agents that can justify themselves (or other agents with their code as input) in human language at various levels of abstraction (AI apology, explainable RL for broad XAI)
Modeling Cooperation
Sandwiching (ex)
Tripwires
Meeseeks AI
Shutdown-seeking AI
Decision theory (Levinstein)
CAIS: Machine ethics—Model intrinsic goods and normative factors, vs mere task preferences, as these will be relevant even under extreme world changes; helps us avoid proxy misspecification as well as value lock-in. Present state of research unknown (some relevant bits in RepEng).
CAIS: Power Aversion—incentives to models to avoid gaining more power than necessary. Related to mild optimisation and powerseeking. Present state of research unknown.