On the margin, we think more alignment researchers should work on “buying time” interventions instead of technical alignment research (or whatever else they were doing).
But what does “buying time” actually look like? In this post, we list some interventions that have the potential to buy time (some of which also have other benefits, like increasing coordination, accelerating community growth, and reducing the likelihood that labs deploy dangerous systems).
If you are interested in any of these, please reach out to us. Note also that Thomas has a list of specific technical projects (with details about how they would be implemented), and he is looking for collaborators.
Direct outreach (through written resources and 1-1 conversations)
Some ML/AGI researchers haven’t heard the core arguments around AI x-risk. Engaging with written resources will cause some of them to be more concerned about AI x-risk.
Some ML/AGI researchers have heard the core arguments. 1-1 conversations can allow safety researchers to better understand & address their cruxes (and vice-versa)
Generates new critiques of existing alignment ideas, arguments, and proposals
More people do alignment research
Increased trust and coordination between labs and non-lab safety researchers.
Technical + Nontechnical (non-technical people can organize and support these efforts, with technical people being the ones actually having conversations, giving presentations, choosing outreach resources, etc.)
New resources
Many ML/AGI researchers have heard the theoretical ideas but want to see resources that are more concrete & grounded in empirical research.
Formalizes problems; makes it easier for ML experts & engineers to contribute to alignment research.
Technical
Concerning demonstrations of alignment failure
Many ML/AGI researchers would take AI x-risk more seriously if there were clear demonstrations of alignment failures.
If an AI lab is about to deploy a system that could destroy the world, a compelling demo might convince them not to deploy.
More people do alignment research
Technical
Break proposals
Many ML/AGI researchers believe that one (or more) existing alignment proposals will work
Helps alignment researchers understand cruxes & prioritize between research ideas/agendas
Technical
Coordination events
Similar to direct 1-1 outreach. Also, could lead to collaborations between safety-conscious individuals at different organizations.
Increased trust and coordination between labs and non-lab safety researchers.
New critiques of existing alignment ideas, arguments, and proposals
Technical + nontechnical (Technical people should be at these events, though non-technical people could organize them)
Support lab teams
Safety teams and governance teams at top AI labs can promote a safety culture at these labs and push for policies that slow down AGI research, reduce race dynamics, etc.
Increased trust and coordination between labs and non-lab safety researchers.
Non-technical
Lab safety standards
[Ignore this part. We needed to put filler text here to format the table properly; for some reason the table looks better when there is a lot of text here.]
There seem to be some policies that, if implemented in a reasonable way, could extend timelines & reduce race dynamics
Increased trust and coordination between labs and non-lab safety researchers.
Some standards could reduce the likelihood that labs deploy dangerous systems (e.g., a policy that a system must first pass an interpretability check or deception check).
Non-technical
Disclaimers
Feel free to skip this section if you’re interested in learning more about our proposed “buying time” ideas.
Disclaimer #1: Some of these interventions assume that timelines are largely a function of the culture at major AI labs. More specifically, we expect that timelines are largely a function of (a) the extent to which leaders and researchers at AI labs are concerned about AI x-risk and (b) the extent to which they have concrete interventions they can implement to reduce AI x-risk, and (c) how costly it is to implement those interventions.
Disclaimer #2a: We don’t spend much time arguing which of these interventions are most impactful. This is partly because many of these need to be executed by people with specific skill sets, so personal fit considerations will be especially relevant.
Nonetheless, we currently think that the following three areas are the most important:
Disclaimer #2b: The most important interventions do not necessarily need the most people. As an example, 1-2 (highly competent) teams organizing coordination events is likely sufficient to saturate the space, whereas we could see 5+ teams working on demonstrating alignment failures. Additionally, projects with minimal downside risks are best-suited to absorb the most people.
We currently think that the following three projects could absorb lots of talented people:
Demonstrate concerning behavior & alignment failures in current (and future) models (more here)
Develop new resources that make AI x-risk arguments & problems more concrete (more here)
Disclaimer #3: Many of these interventions have serious downside risks. We also think many of them are difficult, and they only have a shot at working if they are executed extremely well by people who have (a) strong models of downside risks, (b) the ability to notice when their work might be accelerating AGI timelines, and (c) the ability to notice when their work is reducing their ability to think well & see the world clearly. See also Habryka’s comment here (and note that although we’re still excited about more people considering this kind of work, we agree with many of the concerns he lists, and we think people should understand these concerns deeply before performing this kind of work. Please feel free to reach out to us before doing anything risky).
Disclaimer #4: Many of these interventions have large benefits other than buying time. For the most part, we think that the main benefit from most of these interventions is their effect on buying time, but we won’t be presenting those arguments here.
Disclaimer #5: We have several “background assumptions” that inform our thinking. Some examples include (a) somewhat short AI timelines (AGI likely developed in 5-15 years), (b) high alignment difficulty (alignment by default is unlikely, and current approaches seem unlikely to work), and (c) there has been some work done in each of these areas, but there are opportunities to do things that are much more targeted & ambitious than previous/existing projects.
Disclaimer #6: This was written before the FTX crisis. We think the points still stand.
Ideas to buy time
Direct outreach to AGI researchers
The case for AGI risk is relatively nuanced and non-obvious. There is value in raising awareness about basic arguments about AI x-risk and why alignment might fail by default. This makes it easier for people to quickly understand the concerns for AI x-risk, which means that more people will buy-in to alignment being hard.
Examples of work that we’d be excited to see disseminated more widely:
Resources that make AI x-risk arguments more concrete (see here)
Note these resources, while a lot better than nothing, are still pretty far from ideal. In particular, we wish that there was an accessible version of AGI Ruin.
Additionally, written resources are often not sufficient to address the cruxes and worldviews of people who are performing AGI research. Individualized conversations between AGI researchers and knowledgeable alignment researchers could help address cruxes around safety.
It is important for these conversations to be conducted by people who are deeply familiar with AI alignment arguments and also have a strong understanding of the ML/AGI community. However, we think that non-technical people could play an important role in organizing these efforts (e.g., by setting up conversations between safety researchers and AGI researchers, setting up talks for technical people to give at major AI labs, and doing much of the logistics/ops/non-technical work required to coordinate an ambitious series of outreach activities).
Disclaimer: there is also a lot of downside risk here. Doing this type of outreach without adequate preparation or respect may cause the community to lose the respect of AGI researchers or make people confused about AI x-risk concerns. We encourage people interested in this work to reach out to us. We also suggest this post by Vael Gates and this post by Maris Hobbhahn. Note also that these posts focus on outreach to ML academics, whereas we’re most excited about well-conducted outreach efforts to AGI researchers at leading AI labs.
Develop new resources that make AI x-risk arguments & problems more concrete
Many of the existing AI x-risk resources focus on theoretical/conceptual arguments. Additionally, many of them were written before we knew much about deep learning or large language models.
Some people find these philosophical arguments compelling, but others demand evidence that is more concrete, more grounded in empirical research, and more rooted in the “engineering mindset.”
We believe there is a clear gap in the AI x-risk space right now: many theoretical and conceptual arguments can be discussed in the context of present-day AI systems, concretizing and strengthening the case for AI x-risk.
By creating better AGI risk resources, we can (a) find new alignment researchers and (b) get people who are building AGI to be more cautious and more safety-focused.
Scaling laws for Reward Model Overoptimization explored how putting too much optimization pressure on imperfect reward proxies resulted in failure to generalize, as is predicted by Goodhart’s law.
Specification gaming: the flip side of AI ingenuity described how often the reward functions given to RL agents can be ‘gamed’ — the RL agent can take actions that achieve high reward but do not achieve the intended outcome of the designer.
Another way to make AI x-risk ideas more concrete is to actually observe problems in existing models. As an example, we might detect power-seeking behavior or deceptive tendencies in large language models. We’re sympathetic to the idea that some alignment failures may not occur until we get to AGI (especially if we expect a sharp increase in capabilities). But it seems plausible that at least some alignment failures could be identified with sub-AGI models.
Beth’s team is trying to develop evaluations that help us understand when AI models might be dangerous.
We consider this to be a sufficiently ambitious intervention. If a reasonable evaluation was developed and Magma[1] decided to implement it, it could improve Magma’s ability to identify dangerous models. As a toy example, you could imagine Magma is about to deploy a model. Before they do so, they contact ARC, ARC implements an eval tool, and the eval tool reveals the model (a) has the power to change the world, (b) actively deceives humans in certain contexts, or (c) learns incorrect goals when implemented out-of-distribution. This eval could then lead Magma to (a) delay deployment and (b) work with ARC [and others] to figure out how to improve the model.
Encultured is trying to develop a video game that could serve as “an amazing sandbox for play-testing AI alignment & cooperation.”
From a “buying time” perspective, the theory of change seems very similar to that of the evals project. Encultured’s video game essentially serves as an eval. Magma could deploy its AI system in the video game, allowing them to detect undesirable behavior (e.g., power-seeking, deception), and then causing them to delay the deployment of their powerful model as they try to improve the safety/alignment of the model.
Ideas for new projects in this vein include:
A thorough analysis of whether deception achieves higher reward from RLHF and is therefore selected for.
Empirical demonstrations of various instrumentally convergent goals like self-preservation, power-seeking, and self-improvement. This could be especially interesting in a chain of thought language model that is operating at a high capabilities level and for which you can see the model’s reasoning for selecting instrumentally convergent actions. (Tamera Lanham’s externalized reasoning oversight agenda is an example of good work in this direction)
An empirical analysis of Goodharting. Find a domain for which humans can’t give an exact reward signal, and then demonstrate all the difficulties that arise when working with an imperfect reward signal. This is similar to specification gaming and would be built on top of this.
Break and red team alignment proposals (especially those that will likely be used by major AI labs)
Many AGI researchers already know about the alignment problem, but they don’t expect it to be as difficult as we do. One reason for this is they often believe that current alignment proposals will be sufficient.
We think it’s useful for people to focus on (a) finding problems with existing alignment proposals and (b) making stronger arguments about already-known problems. (Often, critiques are already being made informally in office conversations or LessWrong posts, but they aren’t reaching key stakeholders at labs).
Examples of previous work:
Vivek Hebbar’s SERI MATS application questions break down how people can approach (a) finding problems with existing alignment proposals and (b) making stronger arguments about already-known problems
Nate Soares’s critiques of Eliciting Latent Knowledge, Shard Theory, and various other alignment proposals.
Critics of CIRL argue that it fails as an alignment solution due to the problem of fully updated deference. In a nutshell, the idea of CIRL is to induce corrigibility by maintaining uncertainty over the human’s values, and the failure mode is that once the model learns a sufficiently narrow distribution over the human’s values, it optimizes that in an unbounded fashion (see also the ACX post and Ryan Carey’s paper on this topic).
Critiques of RLHF argue that it selects for policies which involve deceiving the human giving feedback.
We would be especially excited for more breaking & redteaming projects that engage with proposals that AGI researchers think will work (e.g., RRM and RLHF). Ideally, these projects would present ideas that are legible to researchers[2] at AI labs & the ML community, and involve back-and-forth discussions between AGI researchers and safety teams at AI labs.
Organize coordination events
Events that get alignment researchers and AGI researchers together in the same room discussing AGI, core alignment difficulties, and alignment proposals.
On a small scale, such events could include safety talks at leading AGI labs. As an example of a more ambitious event, Anthropic’s interpretability retreat involved many alignment researchers and AGI researchers discussing interpretability proposals, their limitations, and some future directions.
Thinking even more ambitiously, there could be fellowships & residencies that bring AGI researchers and the broader alignment community. Imagine a hypothetical training program run via a collaboration between an AI lab and an AI alignment organization. This program could help employees learn about the latest developments in large language models,learn about the importance of & latest developments in AI alignment research, and lead to collaborations/friendships between incoming AGI researchers and incoming alignment researchers.
One could also imagine programs for senior researchers could involve collaborations between experienced members of AI labs and experienced members of the AI alignment community.
Talks at OpenAI, DeepMind, etc. by various alignment researchers and thinkers
Note that there are downside risks of such programs, especially insofar as they could lead to new capabilities insights that accelerate AGI timelines). We think researchers should be extremely cautious about advancing capabilities and should generally keep such research private.
Support safety and governance teams at major AI labs
It will largely be the responsibility of safety and governance teams to push labs to not publish papers that differentially-advance-capabilities, maintain strong information security, invest in alignment research, use alignment strategies, and not deploy potentially dangerous models. As such, it’s really important that members of the AI alignment community support safety-conscious individuals at the labs (and consider joining the lab safety/governance teams).
Examples of teams that people could support:
OpenAI alignment or governance teams
Deepmind safety or governance teams
Anthropic alignment, interpretability, or governance teams
Teams at Google Brain, Meta AI research, Stability AI, etc.
Note that people should be aware of the risk that alignment-concerned people joining labs can lead to differential increases in capabilities, as reported here. People considering supporting lab teams are welcome to reach out to us to discuss this tradeoff.
Develop and promote reasonable safety standard for AI labs
Organizations that are concerned about alignment / x-risk can set robustly good standards that could propagate to other labs. Examples of policies that can work well at this:
Infosecurity policies, for example, Conjecture’s Internal Infohazard policy is explicitly aimed at promoting cross lab coordination and trust, and one of the hopes is that other organizations will publicly commit to similar policies.
Publication policies that reduce the spread of capabilities insights. For example, Anthropic has committed to not publish capabilities beyond the state of the art, and one of their hopes in this is to set an example that other labs could follow.
Cooperation agreements. For example, OpenAI has a merge clause in their charter that triggers if someone else is close to building AGI, they will stop and assist with that project instead of racing themselves. If more people have this kind of agreement, this can counteract race dynamics and prevent a ‘race to the bottom’ in terms of alignment effort.
Other ideas
Benchmarks: Safety benchmarks can be a good way of incentivizing work on safety (and disincentivizing work on capabilities advances until certain safety benchmarks are met). For example, progress on OOD robustness does not generally come with capabilities externalities. (Note also Safe Bench is a contest by the Center for AI Safety trying to promote benchmark development.)
Competitions: Alignment competitions to get ML researchers thinking about safety problems. Contests that engage ML researchers could help them understand the difficulty of certain alignment subproblems (and could potentially help generate solutions to these problems). (Note that we’re about to launch AI Alignment Awards. We’re offering up to $100,000 for people who make progress on Goal Misgeneralization or Corrigibility.)
Overviews of open problems: In order to redirect work towards safety it is useful to have regular papers outlining open problems that are useful for people to work on. Past examples of this include Concrete problems in AI safety and Unsolved Problems in ML safety. Although it’s possible that progress on these problems directly contributes to alignment research, we think that the primary benefit of this kind of work will involve getting the mainstream ML research community more concerned about safety and AI x-risk, which ultimately influences major AI labs & slows down timelines.
X-risk analyses: We are excited about x-risk analyses described in this paper. We encourage more researchers (and AGI labs) to think explicitly about how their work could contribute to increasing or decreasing AGI x-risk.
Discussions about alignment between AGI leaders and members of the alignment community: We encourage more dialogues, discussions, and debates between AGI researchers/leaders and members of the alignment community.
Create new and better resources making the case for AGI x-risk. Current resources are decent, but we think that all of the existing ones all have drawbacks. One of our favorite intro resources is Superintelligence, but it is from 2014 and doesn’t have that much about deep learning and nothing about transformers/LLMs/scaling.
We are grateful to Ashwin Acharya, Andrea Miotti, and Jakub Kraus for feedback on this post.
Note that there are often trade-offs between legibility and other desiderata. We agree with concerns that Habryka brings up in this comment, and we think anyone performing ML outreach should be aware of these failure modes:
“I think one of the primary effects of trying to do more outreach to ML-researchers has been a lot of people distorting the arguments in AI Alignment into a format that can somehow fit into ML papers and the existing ontology of ML researchers. I think this has somewhat reliably produced terrible papers with terrible pedagogy and has then caused many people to become actively confused about what actually makes AIs safe (with people walking away thinking that AI Alignment people want to teach AIs how to reproduce moral philosophy, or that OpenAIs large language model have been successfully “aligned”, or that we just need to throw some RLHF at the problem and the AI will learn our values fine). I am worried about seeing more of this, and I think this will overall make our job of actually getting humanity to sensibly relate to AI harder, not easier.”
Ways to buy time
In our last post, we claimed:
But what does “buying time” actually look like? In this post, we list some interventions that have the potential to buy time (some of which also have other benefits, like increasing coordination, accelerating community growth, and reducing the likelihood that labs deploy dangerous systems).
If you are interested in any of these, please reach out to us. Note also that Thomas has a list of specific technical projects (with details about how they would be implemented), and he is looking for collaborators.
Summary
Ideas to buy time:
Direct outreach to AGI researchers
Develop new resources that make AI x-risk arguments & problems more concrete
Demonstrate concerning capabilities & alignment failures
Break and redteam alignment proposals (especially those that will likely be used by major AI labs)
Organize coordination events
Support safety and governance teams at major AI labs
Develop and promote reasonable safety standards for AI Labs
Other ideas
Summary table:
Intervention
Why it buys time
Other possible benefits
Technical or non-technical
Direct outreach (through written resources and 1-1 conversations)
Some ML/AGI researchers haven’t heard the core arguments around AI x-risk. Engaging with written resources will cause some of them to be more concerned about AI x-risk.
Some ML/AGI researchers have heard the core arguments. 1-1 conversations can allow safety researchers to better understand & address their cruxes (and vice-versa)
Generates new critiques of existing alignment ideas, arguments, and proposals
More people do alignment research
Increased trust and coordination between labs and non-lab safety researchers.
If an AI lab is about to deploy a system that could destroy the world, a compelling demo might convince them not to deploy.
More people do alignment research
Increased trust and coordination between labs and non-lab safety researchers.
New critiques of existing alignment ideas, arguments, and proposals
Lab safety standards
[Ignore this part. We needed to put filler text here to format the table properly; for some reason the table looks better when there is a lot of text here.]
Increased trust and coordination between labs and non-lab safety researchers.
Some standards could reduce the likelihood that labs deploy dangerous systems (e.g., a policy that a system must first pass an interpretability check or deception check).
Disclaimers
Feel free to skip this section if you’re interested in learning more about our proposed “buying time” ideas.
Disclaimer #1: Some of these interventions assume that timelines are largely a function of the culture at major AI labs. More specifically, we expect that timelines are largely a function of (a) the extent to which leaders and researchers at AI labs are concerned about AI x-risk and (b) the extent to which they have concrete interventions they can implement to reduce AI x-risk, and (c) how costly it is to implement those interventions.
Disclaimer #2a: We don’t spend much time arguing which of these interventions are most impactful. This is partly because many of these need to be executed by people with specific skill sets, so personal fit considerations will be especially relevant.
Nonetheless, we currently think that the following three areas are the most important:
Direct outreach to AGI researchers (more here)
Demonstrate concerning behavior & alignment failures in current (and future) models (more here)
Organize coordination events (more here)
Disclaimer #2b: The most important interventions do not necessarily need the most people. As an example, 1-2 (highly competent) teams organizing coordination events is likely sufficient to saturate the space, whereas we could see 5+ teams working on demonstrating alignment failures. Additionally, projects with minimal downside risks are best-suited to absorb the most people.
We currently think that the following three projects could absorb lots of talented people:
Demonstrate concerning behavior & alignment failures in current (and future) models (more here)
Develop new resources that make AI x-risk arguments & problems more concrete (more here)
Break and redteam alignment proposals (more here)
Disclaimer #3: Many of these interventions have serious downside risks. We also think many of them are difficult, and they only have a shot at working if they are executed extremely well by people who have (a) strong models of downside risks, (b) the ability to notice when their work might be accelerating AGI timelines, and (c) the ability to notice when their work is reducing their ability to think well & see the world clearly. See also Habryka’s comment here (and note that although we’re still excited about more people considering this kind of work, we agree with many of the concerns he lists, and we think people should understand these concerns deeply before performing this kind of work. Please feel free to reach out to us before doing anything risky).
Disclaimer #4: Many of these interventions have large benefits other than buying time. For the most part, we think that the main benefit from most of these interventions is their effect on buying time, but we won’t be presenting those arguments here.
Disclaimer #5: We have several “background assumptions” that inform our thinking. Some examples include (a) somewhat short AI timelines (AGI likely developed in 5-15 years), (b) high alignment difficulty (alignment by default is unlikely, and current approaches seem unlikely to work), and (c) there has been some work done in each of these areas, but there are opportunities to do things that are much more targeted & ambitious than previous/existing projects.
Disclaimer #6: This was written before the FTX crisis. We think the points still stand.
Ideas to buy time
Direct outreach to AGI researchers
The case for AGI risk is relatively nuanced and non-obvious. There is value in raising awareness about basic arguments about AI x-risk and why alignment might fail by default. This makes it easier for people to quickly understand the concerns for AI x-risk, which means that more people will buy-in to alignment being hard.
Examples of work that we’d be excited to see disseminated more widely:
Superintelligence
Many MIRI analyses, including this talk and the 2022 MIRI Discussion
AGI safety fundamentals
The case for taking AI seriously as a threat to humanity
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
Resources that make AI x-risk arguments more concrete (see here)
Note these resources, while a lot better than nothing, are still pretty far from ideal. In particular, we wish that there was an accessible version of AGI Ruin.
Additionally, written resources are often not sufficient to address the cruxes and worldviews of people who are performing AGI research. Individualized conversations between AGI researchers and knowledgeable alignment researchers could help address cruxes around safety.
It is important for these conversations to be conducted by people who are deeply familiar with AI alignment arguments and also have a strong understanding of the ML/AGI community. However, we think that non-technical people could play an important role in organizing these efforts (e.g., by setting up conversations between safety researchers and AGI researchers, setting up talks for technical people to give at major AI labs, and doing much of the logistics/ops/non-technical work required to coordinate an ambitious series of outreach activities).
Disclaimer: there is also a lot of downside risk here. Doing this type of outreach without adequate preparation or respect may cause the community to lose the respect of AGI researchers or make people confused about AI x-risk concerns. We encourage people interested in this work to reach out to us. We also suggest this post by Vael Gates and this post by Maris Hobbhahn. Note also that these posts focus on outreach to ML academics, whereas we’re most excited about well-conducted outreach efforts to AGI researchers at leading AI labs.
Develop new resources that make AI x-risk arguments & problems more concrete
Many of the existing AI x-risk resources focus on theoretical/conceptual arguments. Additionally, many of them were written before we knew much about deep learning or large language models.
Some people find these philosophical arguments compelling, but others demand evidence that is more concrete, more grounded in empirical research, and more rooted in the “engineering mindset.”
We believe there is a clear gap in the AI x-risk space right now: many theoretical and conceptual arguments can be discussed in the context of present-day AI systems, concretizing and strengthening the case for AI x-risk.
By creating better AGI risk resources, we can (a) find new alignment researchers and (b) get people who are building AGI to be more cautious and more safety-focused.
Examples of this work include:
Goal Misgeneralization in Deep Reinforcement Learning and Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals found concrete examples of inner misalignment in reinforcement learning settings.
Optimal Policies Tend to Seek Power formally demonstrated how power is an instrumentally convergent goal.
Scaling laws for Reward Model Overoptimization explored how putting too much optimization pressure on imperfect reward proxies resulted in failure to generalize, as is predicted by Goodhart’s law.
Specification gaming: the flip side of AI ingenuity described how often the reward functions given to RL agents can be ‘gamed’ — the RL agent can take actions that achieve high reward but do not achieve the intended outcome of the designer.
Why AI alignment could be hard with modern deep learning describes how modern deep learning methods are likely to favor unaligned AI
The alignment problem from a deep learning perspective grounds the alignment problem in a deep learning perspective
X-risk analysis for AI research presents a concrete checklist that ML researchers can use to evaluate their research from an X-risk perspective
Demonstrate concerning capabilities & alignment failures
Another way to make AI x-risk ideas more concrete is to actually observe problems in existing models. As an example, we might detect power-seeking behavior or deceptive tendencies in large language models. We’re sympathetic to the idea that some alignment failures may not occur until we get to AGI (especially if we expect a sharp increase in capabilities). But it seems plausible that at least some alignment failures could be identified with sub-AGI models.
Examples of this kind of work:
The Evaluations Project (led by Beth Barnes)
Beth’s team is trying to develop evaluations that help us understand when AI models might be dangerous.
We consider this to be a sufficiently ambitious intervention. If a reasonable evaluation was developed and Magma[1] decided to implement it, it could improve Magma’s ability to identify dangerous models. As a toy example, you could imagine Magma is about to deploy a model. Before they do so, they contact ARC, ARC implements an eval tool, and the eval tool reveals the model (a) has the power to change the world, (b) actively deceives humans in certain contexts, or (c) learns incorrect goals when implemented out-of-distribution. This eval could then lead Magma to (a) delay deployment and (b) work with ARC [and others] to figure out how to improve the model.
Encultured (led by Andrew Critch)
Encultured is trying to develop a video game that could serve as “an amazing sandbox for play-testing AI alignment & cooperation.”
From a “buying time” perspective, the theory of change seems very similar to that of the evals project. Encultured’s video game essentially serves as an eval. Magma could deploy its AI system in the video game, allowing them to detect undesirable behavior (e.g., power-seeking, deception), and then causing them to delay the deployment of their powerful model as they try to improve the safety/alignment of the model.
Ideas for new projects in this vein include:
A thorough analysis of whether deception achieves higher reward from RLHF and is therefore selected for.
Empirical demonstrations of various instrumentally convergent goals like self-preservation, power-seeking, and self-improvement. This could be especially interesting in a chain of thought language model that is operating at a high capabilities level and for which you can see the model’s reasoning for selecting instrumentally convergent actions. (Tamera Lanham’s externalized reasoning oversight agenda is an example of good work in this direction)
An empirical analysis of Goodharting. Find a domain for which humans can’t give an exact reward signal, and then demonstrate all the difficulties that arise when working with an imperfect reward signal. This is similar to specification gaming and would be built on top of this.
Break and red team alignment proposals (especially those that will likely be used by major AI labs)
Many AGI researchers already know about the alignment problem, but they don’t expect it to be as difficult as we do. One reason for this is they often believe that current alignment proposals will be sufficient.
We think it’s useful for people to focus on (a) finding problems with existing alignment proposals and (b) making stronger arguments about already-known problems. (Often, critiques are already being made informally in office conversations or LessWrong posts, but they aren’t reaching key stakeholders at labs).
Examples of previous work:
Vivek Hebbar’s SERI MATS application questions break down how people can approach (a) finding problems with existing alignment proposals and (b) making stronger arguments about already-known problems
Nate Soares’s critiques of Eliciting Latent Knowledge, Shard Theory, and various other alignment proposals.
Critics of CIRL argue that it fails as an alignment solution due to the problem of fully updated deference. In a nutshell, the idea of CIRL is to induce corrigibility by maintaining uncertainty over the human’s values, and the failure mode is that once the model learns a sufficiently narrow distribution over the human’s values, it optimizes that in an unbounded fashion (see also the ACX post and Ryan Carey’s paper on this topic).
Critiques of RLHF argue that it selects for policies which involve deceiving the human giving feedback.
We would be especially excited for more breaking & redteaming projects that engage with proposals that AGI researchers think will work (e.g., RRM and RLHF). Ideally, these projects would present ideas that are legible to researchers[2] at AI labs & the ML community, and involve back-and-forth discussions between AGI researchers and safety teams at AI labs.
Organize coordination events
Events that get alignment researchers and AGI researchers together in the same room discussing AGI, core alignment difficulties, and alignment proposals.
On a small scale, such events could include safety talks at leading AGI labs. As an example of a more ambitious event, Anthropic’s interpretability retreat involved many alignment researchers and AGI researchers discussing interpretability proposals, their limitations, and some future directions.
Thinking even more ambitiously, there could be fellowships & residencies that bring AGI researchers and the broader alignment community. Imagine a hypothetical training program run via a collaboration between an AI lab and an AI alignment organization. This program could help employees learn about the latest developments in large language models, learn about the importance of & latest developments in AI alignment research, and lead to collaborations/friendships between incoming AGI researchers and incoming alignment researchers.
One could also imagine programs for senior researchers could involve collaborations between experienced members of AI labs and experienced members of the AI alignment community.
Examples of previous work:
AI safety conferences in Puerto Rico by FLI
Singularity Summits by MIRI
Interpretability retreat by Anthropic
Talks at OpenAI, DeepMind, etc. by various alignment researchers and thinkers
Note that there are downside risks of such programs, especially insofar as they could lead to new capabilities insights that accelerate AGI timelines). We think researchers should be extremely cautious about advancing capabilities and should generally keep such research private.
Support safety and governance teams at major AI labs
It will largely be the responsibility of safety and governance teams to push labs to not publish papers that differentially-advance-capabilities, maintain strong information security, invest in alignment research, use alignment strategies, and not deploy potentially dangerous models. As such, it’s really important that members of the AI alignment community support safety-conscious individuals at the labs (and consider joining the lab safety/governance teams).
Examples of teams that people could support:
OpenAI alignment or governance teams
Deepmind safety or governance teams
Anthropic alignment, interpretability, or governance teams
Teams at Google Brain, Meta AI research, Stability AI, etc.
Note that people should be aware of the risk that alignment-concerned people joining labs can lead to differential increases in capabilities, as reported here. People considering supporting lab teams are welcome to reach out to us to discuss this tradeoff.
Develop and promote reasonable safety standard for AI labs
Organizations that are concerned about alignment / x-risk can set robustly good standards that could propagate to other labs. Examples of policies that can work well at this:
Infosecurity policies, for example, Conjecture’s Internal Infohazard policy is explicitly aimed at promoting cross lab coordination and trust, and one of the hopes is that other organizations will publicly commit to similar policies.
Publication policies that reduce the spread of capabilities insights. For example, Anthropic has committed to not publish capabilities beyond the state of the art, and one of their hopes in this is to set an example that other labs could follow.
Cooperation agreements. For example, OpenAI has a merge clause in their charter that triggers if someone else is close to building AGI, they will stop and assist with that project instead of racing themselves. If more people have this kind of agreement, this can counteract race dynamics and prevent a ‘race to the bottom’ in terms of alignment effort.
Other ideas
Benchmarks: Safety benchmarks can be a good way of incentivizing work on safety (and disincentivizing work on capabilities advances until certain safety benchmarks are met). For example, progress on OOD robustness does not generally come with capabilities externalities. (Note also Safe Bench is a contest by the Center for AI Safety trying to promote benchmark development.)
Competitions: Alignment competitions to get ML researchers thinking about safety problems. Contests that engage ML researchers could help them understand the difficulty of certain alignment subproblems (and could potentially help generate solutions to these problems). (Note that we’re about to launch AI Alignment Awards. We’re offering up to $100,000 for people who make progress on Goal Misgeneralization or Corrigibility.)
Overviews of open problems: In order to redirect work towards safety it is useful to have regular papers outlining open problems that are useful for people to work on. Past examples of this include Concrete problems in AI safety and Unsolved Problems in ML safety. Although it’s possible that progress on these problems directly contributes to alignment research, we think that the primary benefit of this kind of work will involve getting the mainstream ML research community more concerned about safety and AI x-risk, which ultimately influences major AI labs & slows down timelines.
X-risk analyses: We are excited about x-risk analyses described in this paper. We encourage more researchers (and AGI labs) to think explicitly about how their work could contribute to increasing or decreasing AGI x-risk.
Discussions about alignment between AGI leaders and members of the alignment community: We encourage more dialogues, discussions, and debates between AGI researchers/leaders and members of the alignment community.
Create new and better resources making the case for AGI x-risk. Current resources are decent, but we think that all of the existing ones all have drawbacks. One of our favorite intro resources is Superintelligence, but it is from 2014 and doesn’t have that much about deep learning and nothing about transformers/LLMs/scaling.
We are grateful to Ashwin Acharya, Andrea Miotti, and Jakub Kraus for feedback on this post.
We use “Magma” to refer to a (fictional) leading AI lab that is concerned about safety (see more here).
Note that there are often trade-offs between legibility and other desiderata. We agree with concerns that Habryka brings up in this comment, and we think anyone performing ML outreach should be aware of these failure modes:
“I think one of the primary effects of trying to do more outreach to ML-researchers has been a lot of people distorting the arguments in AI Alignment into a format that can somehow fit into ML papers and the existing ontology of ML researchers. I think this has somewhat reliably produced terrible papers with terrible pedagogy and has then caused many people to become actively confused about what actually makes AIs safe (with people walking away thinking that AI Alignment people want to teach AIs how to reproduce moral philosophy, or that OpenAIs large language model have been successfully “aligned”, or that we just need to throw some RLHF at the problem and the AI will learn our values fine). I am worried about seeing more of this, and I think this will overall make our job of actually getting humanity to sensibly relate to AI harder, not easier.”