MATS AI Safety Strategy Curriculum
As part of the MATS Winter 2023-24 Program, scholars were invited to take part in a series of weekly discussion groups on AI safety strategy. Each strategy discussion focused on a specific crux we deemed relevant to prioritizing AI safety interventions and was accompanied by a reading list and suggested discussion questions. The discussion groups were faciliated by several MATS alumni and other AI safety community members and generally ran for 1-1.5 h.
As assessed by our alumni reviewers, scholars in our Summer 2023 Program were much better at writing concrete plans for their research than they were at explaining their research’s theory of change. We think it is generally important for researchers, even those early in their career, to critically evaluate the impact of their work, to:
Choose high-impact research directions and career pathways;
Conduct adequate risk analyses to mitigate unnecessary safety hazards and avoid research with a poor safety-capabilities advancement ratio;
Discover blindspots and biases in their research strategy.
We expect that the majority of improvements to the above areas occur through repeated practice, ideally with high-quality feedback from a mentor or research peers. However, we also think that engaging with some core literature and discussing with peers is beneficial. This is our attempt to create a list of core literature for AI safety strategy appropriate for the average MATS scholar, who should have completed the AISF Alignment Course.
We are not confident that the reading lists and discussion questions below are the best possible version of this project, but we thought they were worth publishing anyways. MATS welcomes feedback and suggestions for improvement.
Week 1: How will AGI arise?
What is AGI?
Karnofsky—Forecasting Transformative AI, Part 1: What Kind of AI? (13 min)
Metaculus—When will the first general AI system be devised, tested, and publicly announced? (read Resolution Criteria) (5 min)
How large will models need to be and when will they be that large?
Alexander—Biological Anchors: The Trick that Might or Might Not Work (read Parts I-II) (27 min)
Optional: Davidson—What a compute-centric framework says about AI takeoff speeds (20 min)
Optional: Habryka et al. - AI Timelines (dialogue between Ajeya Cotra, Daniel Kokotajlo, and Ege Erdil) (61 min)
Optional: Halperin, Chow, Mazlish—AGI and the EMH: markets are not expecting aligned or unaligned AI in the next 30 years (31 min)
How far can current architectures scale?
Patel—Will Scaling Work? (16 min)
Epoch—AI Trends (5 min)
Optional: Nostalgebraist—Chinchilla’s Wild Implications (13 min)
Optional: Porby—Why I think strong general AI is coming soon (40 min)
What observations might make us update?
Optional: Berglund et al. - Taken out of context: On measuring situational awareness in LLMs (33 min)
Optional: Cremer, Whittlestone—Artificial Canaries: Early Warning Signs for Anticipatory and Democratic Governance of AI (34 min)
Suggested discussion questions
If you look at any of the outside view models linked in “Biological Anchors: The Trick that Might or Might Not Work” (e.g., Ajeya Cotra’s and Tom Davidson’s models), which of their quantitative estimates do you agree or disagree with? Do your disagreements make your timelines longer or shorter?
Do you disagree with the models used to forecast AGI? That is, rather than disagree with their estimates of particular variables, do you disagree with any more fundamental assumptions of the model? How does that change your timelines, if at all?
If you had to make a probabilistic model to forecast AGI, what quantitative variables would you use and what fundamental assumptions would your model rely on?
How should estimates of when AGI will happen change your research priorities if at all? How about the research priorities of AI safety researchers in general? How about the research priorities of AI safety funders?
Will scaling LLMs + other kinds of scaffolding be enough to get to AGI? What about other paradigms? How many breakthroughs around as difficult as the transformer architecture are left, if any?
How should the kinds of safety research we invest in change depending on whether scaling LLMs + scaffolding will lead to AGI?
How should your research priorities change depending on how uncertain we are about what paradigm will lead to AGI, if at all? How about the priorities of AI safety researchers in general?
How could you tell if we were getting closer to AGI? What concrete observations would make you think we don’t have more than 10 years left, how about 5 years, how about 6 months?
Week 2: Is the world vulnerable to AI?
Conceptual frameworks for risk: What kinds of technological advancements is the world vulnerable to in general?
Bostrom—The Unilateralist’s Curse and the Case for a Principle of Conformity (5-15 min)
This is a pretty simple statistical model. You should only read enough to understand the model, eg, why would a principle of conformity help; why might research get done even if 99% of experts think it is a terrible idea; why does adding more people capable of doing the research make it more likely that the research gets done?
Bostrom—The Vulnerable World Hypothesis (15 min)
You should read enough to understand the “urn model”. It is also worth looking over the typology of vulnerabilities.
Optional: Aschenbrenner—Securing posterity (14 min)
Attack vectors: How might AI cause catastrophic harm to civilization?
Hilton—What could an AI-caused existential catastrophe actually look like? (11 min)
Seger et al. - Open Sourcing Highly Capable Foundation Models (section: “Risks of Open-Sourcing Foundation Models”) (16 min)
Longlist of possible “attack vectors”:
Cyberweapons
WormGPT enables hacking at scale
Palisade Research showed that ChatGPT can hack an unpatched Windows machine with self-prompted chain-of-thought
Spear phishing attacks will become extremely sophisticated and scalable
AI cyber-defenders might be too weak or disfavored by offence-defence balance to stop takeover, or dangerous for similar reasons (and thus might engage in trade with rogue AI)
Bioweapons
DL is great at predicting protein structure and new chemical weapons
LLMs could help novices build bioweapons
Labs might be increasingly automated due to financial incentives, allowing for hostile takeover and bioweapon experimentation by AIs
Mass persuasion/manipulation
Human persuasion by AI systems will likely be powerful
Blake Lemoin thought LaMDA was sentient
Russia’s interference in the US 2016 election using chatbots was effective
People are already in love with Replika (i.e., like the movie “Her”)
Cicero beats humans at Diplomacy
Autonomous weapons
AI beats human pilots at real aerial dogfights
AI beats human world-champions at real drone racing
Optional: Burtell, Woodside—Artificial Influence: An Analysis Of AI-Driven Persuasion (25 min)
Optional: 1a3orn—Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk (29 min)
Optional: Lawsen—AI x-risk, approximately ordered by embarrassment (23 min)
AI’s unique threat: What properties of AI systems make them more dangerous than malicious human actors?
Bostrom—Superintelligence (Chapter 6: Cognitive superpowers) (up to “Power over nature and agents”) (14 min)
Optional: Christiano—What failure looks like (10 min)
Optional: Barak, Edelman—AI will change the world, but won’t take it over by playing “3-dimensional chess” (28 min)
Suggested discussion questions
How do ML technologies interact with the unilateralist’s curse model? If you were going to use the unilateralist’s curse model to make predictions about what a world with more adoption of ML technologies would look like, what predictions would you make?
How do ML technologies interact with the vulnerable world hypothesis model? Which type in the typology of vulnerabilities section do ML technologies fall under? Are there any special considerations specific to ML technologies that should make us treat them as not just another draw from the urn?
What are the basic assumptions of the urn model of technological development? Are they plausible?
What are the basic assumptions of the unilateralist’s curse model? Are they plausible?
How is access to LLMs or other ML technologies different from access to the internet with regard to democratizing dual-use technologies, if it is at all?
Are there other non-obvious dual-use technologies that access to ML technologies might democratize?
In Karnofsky’s case for the claim that AI could defeat all of us combined, what are the basic premises? What sets of these premises would have to turn out false for the conclusion to no longer follow? How plausible is it that Karnofsky is making some sort of mistake? (Note that Karnofsky is explicitly arguing for a much weaker claim than “AI will defeat all of us combined”).
Suppose that we do end up with a world where we have ML systems that can get us a lot of anything we can measure, would this be bad? Is it plausible that the benefits of such a technology could outweigh the costs? What are the costs exactly?
Optional: In “What failure looks like” Paul Christiano paints a particular picture of what a world in which the development and adoption of ML technologies goes poorly. Is this picture plausible? What are the assumptions that it rests on? Are these assumption plausible? What would a world with fast ML advancement and adoption look like if it turns out that some set of these assumptions are false?
Week 3: How hard is AI alignment?
What is alignment?
Christiano—Clarifying Alignment (5 min)
Arbital—AI Alignment (5 min)
Optional: Christiano—Corrigibility (8 min)
Optional: Shah—What is ambitious value learning? (3 min)
How likely is deceptive alignment?
Carlsmith—Scheming AIs: Will AIs fake alignment during training in order to get power? (Summary) (35 min)
Optional: read the full report
The full report is extremely good in that it takes a close look at a lot of the considerations that should inform our estimate of how likely deceptive alignment is. If you haven’t spent much time thinking about this, I recommend reading the full report if you have the time.
What is the distinction between inner and outer alignment? Is this a useful framing?
Hubinger—The Inner Alignment Problem (15 min)
Turner—Inner and outer alignment decompose one hard problem into two extremely hard problems (read extended summary, up to Section I) (7 min)
Optional: read the rest of the post (49 min)
How many tries do we get, and what’s the argument for the worst case?
Christiano—Where I agree and disagree with Eliezer (22 min)
In the above article, Cristiano is responding to the article linked below. Both articles are good, but if you have to read only one, I think it’s better to only read the Cristiano one.Optional: Yudkowksy—AGI Ruin: A List of Lethalities (36 min)
How much do alignment techniques for SOTA models generalize to AGI? What does that say about how valuable alignment research on present day SOTA models is?
Ruthenis—Current AIs Provide Nearly No Evidence Relevant to AGI Alignment (10 min)
Optional: Comment Thread on Previous Post (5-15 min)
Suggested discussion questions
What are the differences between Christiano’s concept of “intent alignment” and Aribtal’s concept of “alignment for advanced agents”? What are the advantages and disadvantages of framing the problem in either way?
Is “gradient hacking” required for AI scheming?
What are the key considerations that make deceptive alignment more or less likely?
Is it likely that alignment techniques for current gen models will generalize to more capable models? Does it make sense to focus on alignment strategies that work for current gen models anyway? If so, why?
Suppose that we were able to get intent alignment in models that are just barely more intelligent than human AI safety researchers, would that be enough? Why or why not?
Why is learned optimization inherently more dangerous than other kinds of learned algorithms?
Under what sorts of situations should we expect to encounter learned optimizations?
Imagine that if you have full access to a model’s weights, and you have access to an arbitrarily large but finite amount of compute and time, how can you tell whether a given model contains a mesaoptimizer or not?
How is the concept of learned optimization related to the concepts of deceptive alignment or scheming? Can you have one without the other? If so, how?
Can you come up with stories where a model was trained and behaved in a way that was not intent aligned with its operators, but it’s not clear whether this counts as a case of inner misalignment or outer misalignment?
What are the most important points of disagreement between Eliezer Yudkowsky and Paul Cristiano? How should we change how we prioritize different research programs depending on which side of such disagreements turn out correct?
Week 4: How should we prioritize AI safety research?
What is an “alignment tax” and how do we reduce it?
Christiano—Current Work In AI Alignment (31 min)
Here is a transcript of the talk if you prefer that over a video: Current Work In AI Alignment
This talk has a lot of important content, and for that reason it appears in two places in this curriculum. It appears here for its discussion of alignment taxes, and it appears in the next section for its discussion of “handoffs”. It’s also worth taking a close look at the directed acyclic graph (DAG) that Christiano uses to frame his talk.
What kinds of alignment research will we be able to delegate to models if any?
Christiano—Current Work In AI Alignment (31 min, already counted above)
How should we think about prioritizing work within the control paradigm in comparison to work with the alignment paradigm?
Shlegeris, et al. - AI Control: Improving Safety Despite Intentional Subversion (12 min)
This blogpost summarizes some of Shlegeris and collaborators’ recent work, but we are including it mostly because of how it highlights its relationship to more traditional safety work. I recommend paying particular attention to that section and sections after it.Optional: Greenblatt, et al. - The case for ensuring that powerful AIs are controlled
How should we prioritize alignment research in light of the amount of time we have left until transformative AI?
Karnofsky—How might we align transformative AI if it’s developed very soon? (54 min)
This is very long, so it might be worth skimming the sections you find most interesting instead of reading the whole thing carefully. That said, I am including it because it does a good job of walking through the potential strategies and potential pitfalls of a concrete transformative AI scenario in the near future.Hubinger—A transparency and interpretability tech tree (21 min)
Optional: Charbel-Raphaël—Against Almost Every Theory of Impact of Interpretability
How should you prioritize your research projects in light of the amount of time you have left until transformative AI?
Kidd—Aspiring AI safety researchers should ~argmax over AGI timelines (5 min)
I am including this not exactly because I endorse the methodology exactly, but because it is a good example of taking a seemingly very intractable personal prioritization problem and breaking it down into more concrete questions rendering the problem much easier to think about.Shlegeris—A freshman year during the AI midgame: my approach to the next year (5 min)
This is a fairly personal post, but I think it gives a good example of how to think thoughtfully about prioritizing your research projects while also being kind to yourself.
Suggested discussion questions
Look at the DAG from Paul Christiano’s talk (you can find an image version in the transcript of the talk). What nodes are missing from this DAG that seem important to you to highlight? Why are they important?
What nodes from Christiano’s DAG does your research feed into? Does it feed into several parts? The most obvious node for alignment research to feed into is the “reducing the alignment tax” node. Are there ways your research could also be upstream of other nodes? What about other research projects you are excited about?
It might be especially worth thinking about both of the above questions before you come to the discussion group.
How does research within the control paradigm fit into Christiano’s DAG?
What kinds of research make sense under the control paradigm which do not under the alignment paradigm?
It seems like there may be a sort of chicken and egg problem for alignment plans that involve creating an AI to do alignment research, that is, you use AI to align your AI but you need the AI you use to align your AI to already be aligned. Is this a real problem? What could go wrong if you used an unaligned AI to align your AI? Are things likely to go wrong in this way? What are some ways that you could get around the problem?
Looking at Evan Hubinger’s interpretability/transparency tech tree, do you think there are nodes that are missing?
It’s been six months since Hubinger publish his tech tree. Have we unlocked any new nodes on the tech tree since then?
What would a tech tree for a different approach, eg control, look like?
Week 5: What are AI labs doing?
How are the big labs approaching AI alignment and AI risk in general?
Anthropic:
Responsible Scaling Policies (30 min)
I recommend specifically spending more time on the ASL-3 definition and commitments.
DeepMind:
OpenAI:
Our approach to frontier risk (10 min)
How are small non-profit research orgs approaching AI alignment and AI risk in general?
METR: Landing page
This is just the landing page of their website, but it’s a pretty good explanation of their high level strategy and priorities.Redwood Research: Research Page
You all already got a bunch of context on what Redwood is up to thanks to their lectures, but here is a link to their “Our Research” page on their website anyway.Conjecture: Research Page
General summaries:
Larsen, Lifland - (My understanding of) what everyone is doing and why
This post is sort of old by ML standards, but I think it is currently still SOTA as an overview of what all the different research groups are doing. Maybe you should write a newer and better one.
This post is also very long. I recommend skimming it and keeping it as a reference rather than trying to read the whole thing in one sitting.
Suggested discussion questions
Are there any general differences that you notice between Anthropic, Deepmind, and OpenAI’s approaches to alignment or other safety mechanisms? How could you summarize these differences? Where are their points of emphasis different? Are their primary threat models different, if so, how?
Are there any general differences that you notice between how the big labs (eg, Anthropic, OpenAI) and smaller non-profit orgs (eg, ARC, METR) approach alignment or other safety mechanisms? How could you summarize those differences? Where are their points of emphasis different?
Can you summarize the difference between Anthropic’s RSPs and OpenAI’s RDPs? Do the differences seem important or do they seem like a sort of narcissism of small differences kind of deal?
What is an ASL? How do Anthropic define ASL-2 and ASL-3? What commitments do Anthropic make regarding ASL-3 models?
Would the concept of ASL still make sense in OpenAI’s RDP framework? Would you have to adjust it in any way?
Do you think that Anthropic’s commitments on ASL-3 are too strict, not strict enough, or approximately right? Relatedly, do you expect that they will indeed follow through on these commitments before training/deploying an ASL-3 model?
How would you define ASL-4 if you had to? You as a group have 10 minutes and the world is going to use your definition. Go!
Ok cool, good job, now that you’ve done that, what commitments should labs make relating to training, deploying, selling fine tuning access to, etc, an ASL-4 model? You again have 10 minutes, and the world is depending on you. Good luck!
Week 6: What governance measures reduce AI risk?
Should we try to slow down or stop frontier AI research through regulation?
lc—What an actually pessimistic containment strategy looks like (10 min)
This is a very argumentative post, but I think it presents an interesting frame.
Karnofsky—We’re Not Ready: Thoughts on “pausing” and responsible scaling policies (10 min)
1a3orn—Ways I Expect AI Regulation To Increase Extinction Risk (10 min)
Optional: Grace—Let’s think about slowing down AI (20 min)
This post is now two years old, but it genuinely was surprisingly revolutionary in the space two years ago.
It might be worth coming back to read the second part of this post on AI race dynamics after you have read Carl Schulman’s piece in the third section.
What AI governance levers exist?
Regulating chips: Balwit—How We Can Regulate AI (10 min)
Safety standards/regulations: AISF—Primer on Safety Standards and Regulations for Industrial-Scale AI Development (5 min)
Making labs liable: Arnold—AI Insight Forum: Privacy and Liability (2 min)
What catastrophes uniquely occur in multipolar AGI scenarios?
Christiano—What failure looks like (20 min)
Many of you may have already read this, but if you haven’t already read it many times, it seems pretty fundamental to me, and it might be worth reading again.
Optional: Critch—What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) (30 min)
Suggested discussion questions
The posts from lc, Karnofsky, and 1a3orn are in descending order of optimism about regulation. How optimistic are you about the counterfactual impact of regulation?
What are some things you could observe or experiments you could run that would change your mind?
1a3orn’s post paints a particular picture of how regulation might go wrong, how plausible is this picture? What are the key factors that might make a world like this more or less likely given concerted efforts at regulation?
What are other ways that regulation might backfire if there any?
Regulation might be a good idea, but what about popular movement building? How might such efforts fail or make things worse?
If you got to decide when transformative AI or AGI will first be built, what would be the best time for that to happen? Imagine that you are changing little else about the world. Suppose that the delay is caused only by the difficulty of making AGI, or other contingent factors, like less investment.
Is your ideal date before or after you expect AGI to in fact be developed?
What key beliefs would you need to change your mind about for you to change your mind about when it would be best for AGI to be developed?
Under what assumptions does it make sense to model the strategic situation of labs doing frontier AI research as an arms race? Under what set of assumptions does it not make sense? What do the payoff matrices look like for the relevant actors under either set of assumptions?
Which assumptions do you think it makes the most sense to use? Which assumptions do you think the labs are most likely to use? If your answers to these questions are different, how do you explain those differences?
How do race dynamics contribute to the likelihood of ending up in a multipolar scenario Like the ones described in Christiano and Critch’s posts, if they do at all?
Week 7: What do positive futures look like?
Note: attending discussion this week was highly optional.
What near-term positive advancements might occur if AI is well-directed?
Universal provision of basic needs
Medical advancements for healthspan and longevity
More informed and empowered democratic citizens
What values might we want to actualize with the aid of AI?
Universal basic rights and self-determination
Positive experience and non-suffering
Bostrom—Letter from Utopia (11 min)
Optional: Yudkowsky − 31 Laws of Fun
Cosmopolitanism (non-racism, non-ageism, non-speciesism, etc.)
What (very speculative) long-term futures seem possible and promising?
Long reflection on value and meaning
Optional: Ord—The Precipice: Chapter 7, Safeguarding Humanity (40 min)
Digital societies and AI sentience
Growing life throughout the universe
Rational Animations—How to Take Over the Universe (in Three Easy Steps) (18 min)
Rational Animations—Humanity was born way ahead of its time. The reason is grabby aliens (13 min)
Optional: Rational Animations—Will we grab the universe? Grabby aliens predictions.
Highly recommended!
Optional: Hanson et al. - If Loud Aliens Explain Human Earliness, Quiet Aliens Are Also Rare
Suggested discussion questions
If everything goes well, how do we expect AI to change society in 10 years? What about 50 years?
What values would you like to actualize in the world with the aid of AI?
If we build sentient AIs, what rights should those AIs have? What about human minds that have been digitally uploaded?
Are positive futures harder to imagine than dystopias? If so, why would that be?
In the “ship of Theseus” thought experiment, the ship is replaced, plank-by-plank, until nothing remains of the original ship. If humanity’s descendants are radically different from current humans, do we consider their lives and values to be as meaningful as our own? How should we act if we can steer what kind of descendants emerge?
What current human values/practices could you imagine seeming morally repugnant to our distant descendants?
Would you hand control of the future over to a benevolent AI sovereign? Why/why not?
We might expect that especially over the long term, human values might change a lot. This is sometimes called “value drift”. Is there a reason to be more concerned about value drift caused by AIs or transhumans than from human civilization developing as it would otherwise?
Acknowledgements
Ronny Fernandez was chief author of the reading lists and discussion questions, Ryan Kidd planned, managed, and edited this project, and Juan Gil coordinated the discussion groups. Many thanks to the MATS alumni and other community members who helped as facilitators!
- MATS Winter 2023-24 Retrospective by 11 May 2024 0:09 UTC; 84 points) (
- MATS Winter 2023-24 Retrospective by 11 May 2024 0:09 UTC; 62 points) (EA Forum;
- MATS AI Safety Strategy Curriculum v2 by 7 Oct 2024 22:44 UTC; 42 points) (
- MATS AI Safety Strategy Curriculum v2 by 7 Oct 2024 23:01 UTC; 29 points) (EA Forum;
- 16 May 2024 21:07 UTC; 4 points) 's comment on MATS Winter 2023-24 Retrospective by (
- 25 Mar 2024 11:26 UTC; 2 points) 's comment on How Educational Courses Help Build Fields: Lessons from AI Safety Fundamentals by (EA Forum;
I want to note for posterity that I tried to write this reading list somewhat impartially. That is, I have a lot of takes about a lot of this stuff, and I tried to include a lot of material that I disagree with but which I have found helpful in some way or other. I also included things that people I trust have found helpful even if I personally never found it helpful.
https://www.lesswrong.com/posts/3pinFH3jerMzAvmza/on-how-various-plans-miss-the-hard-bits-of-the-alignment#comments
Seems like something important to be aware of, even if they may disagree.