Appendices to the live agendas
Lists cut from our main post, in a token gesture toward readability.
We list past reviews of alignment work, ideas which seem to be dead, the cool but neglected neuroscience / biology approach, various orgs which don’t seem to have any agenda, and a bunch of things which don’t fit elsewhere.
Appendix: Prior enumerations
AI Plans (2023), mostly irrelevant
Krakovna (2022), Paradigms of AI alignment
Sharkey et al (2022) on interp
Karnofsky prosaic plan (2022)
Appendix: Graveyard
Ambitious value learning?
MIRI youngbloods (see Hebbar)
Provably Beneficial Artificial Intelligence (but see Open Agency and Omohundro)
HCH (see QACI)
IDA → Critiques and recursive reward modelling
Debate is now called Critiques and ERO
Market-making (Hubinger)
Logical inductors
Impact measures, conservative agency, side effects → “power aversion”
Quantilizers
Redwood interp?
Enabling Robots to Communicate their Objectives (early stage interp?)
Aligning narrowly superhuman models (Cotra idea; tiny followup; lives on as scalable oversight?)
automation of semantic interpretability
i.e. automatically proposing hypotheses instead of just automatically verifying them
Oracle AI is not dead so much as ~everything LLM falls under its purview
Tool AI, similarly, ~is LLMs plugged into a particular interface
EleutherAI: #accelerating-alignment—AI alignment assistants are live, but it doesn’t seem like EleutherAI is currently working on this
Appendix: Biology for AI alignment
Lots of agendas but not clear if anyone besides Byrnes and Thiergart are actively turning the crank. Seems like it would need a billion dollars.
Human enhancement
One-sentence summary: maybe we can give people new sensory modalities, or much higher bandwidth for conceptual information, or much better idea generation, or direct interface with DL systems, or direct interface with sensors, or transfer learning, and maybe this would help. The old superbaby dream goes here I suppose.
Theory of change: maybe this makes us better at alignment research
Merging
One-sentence summary: maybe we can form networked societies of DL systems and brains
Theory of change: maybe this lets us preserve some human values through bargaining or voting or weird politics.
As alignment aid
One-sentence summary: maybe we can get really high-quality alignment labels from brain data, maybe we can steer models by training humans to do activation engineering fast and intuitively, maybe we can crack the true human reward function / social instincts and maybe adapt some of them for AGI.
Theory of change: as you’d guess
Some names: Byrnes, Cvitkovic, Foresight’s BCI, Also (list from Byrnes): Eli Sennesh, Adam Safron, Seth Herd, Nathan Helm-Burger, Jon Garcia, Patrick Butlin
Appendix: Research support orgs
One slightly confusing class of org is described by the sample {CAIF, FLI}. Often run by active researchers with serious alignment experience, but usually not following an obvious agenda, delegating a basket of strategies to grantees, doing field-building stuff like NeurIPS workshops and summer schools.
One-sentence summary: support researchers making differential progress in cooperative AI (eg precommitment mechanisms that can’t be used to make threats)
Some names: Lewis Hammond
Estimated # FTEs: 3
Some outputs in 2023: Neurips contest, summer school
Funded by: Polaris Ventures
Critiques:
Funded by: ?
Trustworthy command, closure, opsec, common good, alignment mindset: ?
Resources: £2,423,943
One-sentence summary: entrypoint for new researchers to test fit and meet collaborators. More recently focussed on a capabilities pause. Still going!
Some names: Remmelt Ellen, Linda Linsefors
Estimated # FTEs: 2
Some outputs in 2023: tag
Funded by: ?
Critiques: ?
Funded by: ?
Trustworthy command, closure, opsec, common good, alignment mindset: ?
Resources: ~$200,000
See also:
Appendix: Meta, mysteries, more
Metaphilosophy (2023) – Wei Dai, probably <1 FTE
Encultured—making a game
McIlrath: side effects and human-aware AI
The UK government now has an evals/alignment lab of sorts (Xander Davies, Nitarshan Rajkumar, Krueger, Rumman Chowdhury)
The US government will soon have an evals/alignment lab of sorts, USAISI. (Elham Tabassi?)
Greene Lab Cognitive Oversight? Closest they’ve got is AI governance and they only got $13,800
GATO Framework seems to have a galaxy-brained global coordination and/or self-aligning AI angle
Wentworth 2020 - a theory of abstraction suitable for embedded agency.
ARAAC—Make RL agents that can justify themselves (or other agents with their code as input) in human language at various levels of abstraction (AI apology, explainable RL for broad XAI)
Sandwiching (ex)
Decision theory (Levinstein)
CAIS: Machine ethics—Model intrinsic goods and normative factors, vs mere task preferences, as these will be relevant even under extreme world changes; helps us avoid proxy misspecification as well as value lock-in. Present state of research unknown (some relevant bits in RepEng).
CAIS: Power Aversion—incentives to models to avoid gaining more power than necessary. Related to mild optimisation and powerseeking. Present state of research unknown.
Honestly this isn’t that long, I might say to re-merge it with the main post. Normally I’m a huge proponent of breaking posts up smaller, but yours is literally trying to be an index, so breaking a piece off makes it harder to use.
yeah you’re right
For what it’s worth, I am not doing (and have never done) any research remotely similar to your text “maybe we can get really high-quality alignment labels from brain data, maybe we can steer models by training humans to do activation engineering fast and intuitively”.
I have a concise and self-contained summary of my main research project here (Section 2).
I care a lot! Will probably make a section for this in the main post under “Getting the model to learn what we want”, thanks for the correction.