This post is the first in a sequence of posts about AI strategy co-authored by Thomas Larsen, Akash Wasil, and Olivia Jimenez (TAO). In the next post, we’ll provide more examples of “buying time” interventions that we’re excited about.
We’re grateful to Ajeya Cotra, Daniel Kokotajlo, Ashwin Acharya, and Andrea Miotti for feedback on this post.
If anyone is interested in working on “buying time” interventions, feel free to reach out. (Note that Thomas has a list of technical projects with specifics about how to implement them. We also have a list of non-technical projects).
Summary
A few months ago, when we met technical people interested in reducing AI x-risk, we were nearly always encouraging them to try to solve what we see as the core challenges of the alignment problem (e.g., inner misalignment, corrigibility, interpretability that generalizes to advanced systems).
But we’ve changed our mind.
On the margin, we think more alignment researchers should work on “buying time” interventions instead of technical alignment research (or whatever else they were doing).
To state the claim another way: on the margin, more researchers should backchain from “how do I make AGI timelines longer, make AI labs more concerned about x-risk, and present AI labs with clear things to do to reduce x-risk”, instead of “how do I solve the technical alignment problem?”
Some “buying time” interventions involve performing research that makes AI safety arguments more concrete or grounds them in ML (e.g., writing papers like Goal misgeneralization in deep reinforcement learning and Alignment from a deep learning perspective & discussing these with members of labs).
Some “buying time” interventions involve outreach and engagement with capabilities researchers, leaders in AI labs, and (to a lesser extent) the broader ML community (e.g., giving a presentation to a leading AI lab about power-seeking and deception). We expect that successful outreach efforts will also involve understanding the cruxes/counterarguments of the relevant stakeholders, identifying limitations of existing arguments, and openly acknowledging when the AI safety community is wrong/confused/uncertain about certain points.
We are excited about “buying time” interventions for four main reasons:
Multiplier effects: Delaying timelines by 1 year gives the entire alignment community an extra year to solve the problem.
End time: Some buying time interventions give the community a year at the end, where we have the most knowledge about the problem, access to near-AGI systems, the largest community size, the broadest network across other influential actors, and the most credibility at labs. Buying end time also increases the amount of serial alignment research, which some believe to be the bottleneck. We discuss this more below.
Comparative advantage: Many people would be better-suited for buying time than technical alignment work (see figure 1).
Buying time for people at the tails: We expect that alignment research is extremely heavy-tailed. A median researcher who decides to buy time is buying time for people at the tails, which is (much) more valuable than the median researcher’s time.
Externalities/extra benefits: Many buying time interventions have additional benefits (e.g., improving coordination, reducing race dynamics, increasing the willingness to pay a high alignment tax, getting more people working on alignment research).
Figure 1: Impact by percentile for technical alignment research and buying time interventions.
Caption: We believe that impact in technical alignment research is more heavy-tailed than impact for buying time interventions. Figure 1 illustrates this belief. Note that this is a rough approximation. Note also that both curves should also go below the 0-point of the y-axis, as both kinds of interventions can be net negative.
Concretely, we recommend that ~40-60% of alignment researchers should focus on “buying time” interventions rather than technical alignment research (whereas we currently think that only ~20% are focusing on buying time). We also recommend that ~20-40% of community-builders focus on “buying time” interventions rather than typical community-building (whereas we currently think that ~10% are focusing on buying time).
Disclaimer #1: We’re not claiming that “buying time” is the only way to categorize the kinds of interventions we describe, and we encourage readers to see if they can come up with alternative frames/labels. Many “buying time” interventions also have other benefits (e.g., improving coordination, getting more people to work on AI safety researchers, and making it less likely that labs deploy dangerous systems). We chose to go with the “buying time” frame for two main reasons
For nearly all of the interventions we describe, we think that most of the benefit comes from buying time, and these other benefits are side benefits. One important exception to this is that much of the impact of evals/demos may come from their ability to prevent labs from deploying dangerous systems. This buys time, but it’s plausible that the main benefit is “the world didn’t end”.
We have found backchaining from “buying time” more useful than other frames we brainstormed. Some alternative frames have felt too limiting (e.g., “outreach to the ML community” doesn’t cover some governance interventions).
Disclaimer #2: Many of these interventions have serious downside risks. We also think many of them are difficult, and they only have a shot at working if they are executed extremely well.
Disclaimer #3: We have several “background assumptions” that inform our thinking. Some examples include (a) somewhat short AI timelines (AGI likely developed in 5-15 years), (b) high alignment difficulty (alignment by default is unlikely, and current approaches seem unlikely to work), and (c) there has been some work done in each of these areas, but we are far behind what we would expect in winning worlds, and there are opportunities to do things that are much more targeted & ambitious than previous/existing projects.
Disclaimer #4: Much of our thinking is informed by conversations with technical AI safety researchers. We have less experience interacting with the governance community and even less experience thinking about interventions that involve the government. It’s possible that some of these ideas are already widespread among EAs who focus on governance interventions, and a lot of our arguments are directed at the thinking we see in the technical AI safety community.
Why are we so excited about “buying time” interventions?”
Large upsides of buying time
Some time-buying interventions buy a year at the end. If capabilities growth continues as normal until someone is about to deploy an AI model that would improve into a TAI, but an evaluation triggers and reveals misaligned behavior, causing this lab to slow down and warn the other labs, the time from this event until when AGI is deployed is very valuable for the following reasons:
You have bought one month for the entire AI safety community[1].
More researchers: The number of alignment researchers is growing each year, so we expect to have the most alignment researchers at the end.
Better understanding of alignment: We understand more about the alignment problem each year, and the field becomes less pre-paradigmatic. This makes it easier to make progress each year.
AGI assisted alignment: Some alignment agendas involve using AI assistants to boost alignment research. It’s plausible that a year of alignment research with AI assistants is 5-10X more valuable than a year of alignment research right now. If we’re able to implement interventions that buy time once we have powerful AI assistants, this intervention would be especially valuable (assuming that these assistants can make differential alignment progress or that we can buy time once we have the assistants).
Better understanding of architecture: when we are close to AGI, we have a better understanding of the architectures and training paradigms that will be used to actually build AGI, allowing for alignment solutions to be much more concrete and informed.
Other interventions have a different shape, and do not buy as valuable time. If you simply slow the rate of capabilities progress through publication policies or convince some capabilities researchers to transition, such that on net then AI will take one more year to generate, this has the benefit of 1-4, but not 5 and 6. However, we are proposing to reallocate alignment researchers to buying time interventions, which means that less alignment research is being made this year, so reason 3 might be less strong.
Some interventions that buy time involve coordinating with members of major AI labs (e.g., OpenAI, DeepMind, Anthropic). As a result, these interventions often have the additional benefit of increasing communication, coordination, trust, and shared understanding between major AI labs and members of the AI safety community who are not part of the AI labs. (Note that this is not true of all “buying time” interventions, and several of them could also lead to less coordination or less trust).
Tractability and comparative advantage: lots of people can have a solid positive impact by buying time, while fewer can do great alignment work
Buying a year buys a year for researchers at the tails. If you buy a year of time, you buy a year of time for some of the best alignment researchers. It’s plausible to us that researchers at the tails are >50-100X more valuable than median researchers.
Many projects that are designed to buy time require different skills than technical AI safety research.
Skills that seem uniquely valuable for buying time interventions: general researcher aptitudes, ability to take existing ideas and strengthen them, experimental design skills, ability to iterate in response to feedback, ability to build on the ideas of others, ability to draw connections between ideas, experience conducting “typical ML research,” strong models of ML/capabilities researchers, strong communication skills
Skills that seem uniquely valuable for technical AI safety research: abstract thinking, ability to work well with very little structure or guidance, ability to generate and formalize novel ideas, focus on “the hard parts of the problem”, ability to be comfortable being confused for long periods of time.
Skills that seem roughly as useful in both: Strong understanding of AI safety material, machine learning knowledge.
Alignment research seems heavy-tailed. It’s often easy to identify whether or not someone has a reasonable chance of being at the tail (e.g., after 6-12 months of trying to solve alignment). People who are somewhat likely to be at the tail should keep doing alignment research; other people should buy time.
A reasonable counterpoint is that “buying time” might also be heavily-tailed. However, we currently expect it to be less heavy-tailed than alignment research. It seems plausible to us that many “median SERI-MATS scholars” could write papers like the goal misgeneralization paper, explain alignment difficulties in clearer and more compelling ways, conduct (or organize) high-quality outreach and coordination activities and perform many other interventions we’re excited about. On the other hand, we don’t expect that “median SERI-MATS scholars” would be able to make progress on heuristic arguments, create their own alignment agendas, or come up with other major conceptual advancements.
Nonetheless, a lot of the argument depends on the specific time-buying and the specific alignment research. It seems plausible to us that some of the most difficult time-buying interventions are more heavy-tailed than some of the more straightforward alignment research projects (e.g., coming up with good eval tools and demos might be more heavy-tailed than performing interpretability experiments).
What are some examples of “buying time” interventions?
The next post in this sequence (rough draft here) outlines more concrete interventions that we are excited about in this space, but we highlight three interventions here that are especially exciting to us. We briefly provide some examples below.
Outreach efforts that involve interactions between the AI safety community and (a) members of AI labs + (b) members of the ML community.
Some specific examples:
More conferences that bring together researchers from different groups who are working on similar topics (e.g., Anthropic recently organized an interpretability retreat with members from various different AI labs and AI alignment organizations).
More conferences that bring together strategy/governance thinkers from different groups (e.g., Olivia and Akash recently ran a small 1-day strategy retreat with a few members from AI labs and members).
Discussions like the MIRI 2021 conversations, except with a greater emphasis on engaging with researchers and decision-makers at major AI labs by directly touching on their cruxes.
Collaborations on interventions that involve coordinating with AI labs (e.g., figuring out if there are ways to collaborate on research projects, efforts to implement publication policies and information-sharing agreements, efforts to monitor new actors that are developing AGI, etc.)
Beth’s team is trying to develop evaluations that help us understand when AI models might be dangerous. The path to impact is that an AI company will likely use the eval tool on advanced AI models that they train, and this eval could then lead them to delay deployment of a model for which the eval unveiled scary behavior. In an ideal world, this would be so compelling that multiple AI labs slow down, potentially extending timelines by multiple years.
Papers that take theoretical/conceptual safety ideas and ground them in empirical research.
Specific examples of this type of work include Lauro Langosco’s goal misgeneralization paper (which shows how an RL agent can appear to learn goal X but actually learn goal Y) and Alex Turner’s optimal policies tend to seek power paper. Theoretical alignment researchers had already proposed that agents could learn unintended goals and that agents would have incentives to seek power. The papers by Lauro and Alex take these theoretical ideas (which are often perceived as fuzzy and lacking concreteness), formalize them more crisply, and offer examples of how they affect modern ML systems.
We think that this buys time primarily by convincing labs and academics of alignment difficulty. In the next section, we give more detail on the theory of chance.
Theory of Change
In the previous section, we talked about why we were so excited about buying time interventions. However, the interventions we have in mind often have a number of other positive impacts. In this section, we provide more detail about these impacts as well as why we think these impacts end up buying time.
We summarize our theory of change in the following diagram:
Labs take AGI x-risk seriously + Labs have concrete things they can do → More Time
We think that timelines are largely a function of (a) the extent to which leaders and researchers at AI labs are concerned about AI x-risk and (b) the extent to which they have solutions that can be (feasibly) implemented.
If conditions (a) and (b) are met, we expect the following benefits:
Less capabilities research. There is less scaling and less algorithmic progress.
More coordination between labs. There are more explicit and trusted agreements to help each other with safety research, avoid deploying AGI prematurely, and avoid racing.
Less publishing capabilities advances. Capabilities knowledge is siloed, so when one lab discovers something, it doesn’t get used by the rest of the world. For example, PaLM claims a 15% speed up from parallelizing layers. If this insight hadn’t been published, PaLM would likely have been 15% slower.
Labs being less likely to deploy AGI and scale existing models.
These all lead to relative slowdowns of AGI timelines, giving everyone more time to solve the alignment problem.
Benefits other than buying time
Many of the interventions we describe also have benefits other than buying time. We think the most important ones are:
Willingness to pay a higher alignment tax: Concern about alignment going poorly means that the labs invest more resources into safety. An obvious resource is that a computationally expensive solution to alignment becomes a lot more likely to be implemented, as the labs recognize how important it is. Concretely, we are substantially more excited about worlds in the lab building AGI actually implements all of the interventions described in Holden’s How might we align transformative AI if it’s developed very soon?. By default, if AGI were developed in the next few years at OpenAI or DeepMind, we would put ~20% on each of these solutions actually being used.
Less likely to deploy dangerous systems: If evaluations are used and safety standards are implemented, labs could catch misaligned behavior and decide not to deploy systems that would have ended the world. This leads to timelines increases, but it also has the more direct impact of literally saving the world (at least temporarily). We expect that this causes the probability of a naive accident risk to go down substantially.
More alignment research: If labs are convinced by AI safety arguments, they may shift more of their (capabilities) researchers toward alignment issues.
Some objections and our responses
1. There are downside risks from low-quality outreach and coordination efforts with AI labs
Response: We agree. Members of AI labs have their own opinions about AI safety; efforts to come in and proselytize are likely to fail. We think the best efforts will be conducted by people who have (a) strong understandings of technical AI safety arguments, (b) strong interpersonal skills and ability to understand different perspectives, (c) caution and good judgment, and (d) collaborators or advisors who can help them understand the space. However, this depends on the intervention. Caution is especially warranted when doing direct outreach that involves interaction with capabilities researchers, but more technical work such as empirically grounding alignment arguments pretty much only requires technical skill.
2. Labs perceive themselves to be in a race, so they won’t slow down.
Response #1: We think that some of the concrete interventions we have in mind contribute to coordination and reduce race dynamics. In particular, efforts to buy time by conveying the difficulty of alignment could lead multiple players to become more concerned about x-risk (causing all of the leading labs to slow down).
Furthermore, we’re optimistic that sufficiently well-executed coordination events could lead to increased trust and potentially concrete agreements between labs. We think that differences in values (company A is worried that company B would not use AI responsibly) and worries about misuse risk (company A is worried that company B’s AI is likely to be unaligned) are two primary drivers of race dynamics.
However, to the extent that A and B are value-aligned, both are aware that each of them are taking reasonable safety precautions, and leaders at both companies trust each other, they are less incentivized to race each other. Coordination events could help with each of these factors.
Response #2: Some interventions don’t reduce race dynamics (e.g., slowing down the leading lab). These are high EV in worlds where the safety-conscious lab (or labs) has a sizable advantage. On the margin, we think more people should be investing into these interventions, but they should be deployed more carefully (ideally after some research has been conducted to compare the upside of buying time to the downside of increasing race dynamics).
3. Labs being more concerned about safety isn’t that helpful. They already care; they just lack solutions.
Response: Our current impression is that many leaders at major AGI labs are concerned about safety. However, we don’t think everyone is safety-conscious, and we think there are some policies that labs could adopt to buy time (e.g., adopting publication policies that reduce the rate of capabilities papers).
4. Slowing down ODA+[2] could increase the chance that a new (and less safety-conscious actor) develops AGI.
Response #1: Our current best guess is that ODA+ has a >6 month lead over less safety-conscious competitors. However, this is fairly sensitive to timelines. If scale is critical, then one would expect a small number of very large projects to be in the lead for AGI, and differentially slowing the most receptive / safety oriented / cautious of those labs seems on net negative. However, interventions that slow the whole field such as a blanket slowdown in publishing or increases the extent to which all labs are safety oriented are robustly good.
Response #2: Even if ODA+ does not have a major lead, many of the interventions (like third-party audits) could scale to new AGI developers too. For example, if there’s a culture in the field of doing audits, and pressure to do so, talented researchers are likely not to want to work for you unless you participate, or if later there’s a regulatory regime attached to all that.
Response #3:Some interventions to buy time increase lead time of labs and slow research overall (e.g., making it more difficult for new players to enter the space; compelling evals or concretizations of alignment difficulties could cause many labs to slow down).
Response #4: Under our current model, most P(doom) comes from not having a solution to the alignment problem. So we’re willing to trade some P(solution gets implemented) and some P(AGI is aligned to my values) in exchange for a higher P(we find a solution). However, we acknowledge that there is a genuine tradeoff here, and given the uncertainty of the situation, Thomas thinks that this is the strongest argument against buying time interventions.
5. Buying time is not tractable.
Response: This is possible, but we currently doubt it. There seems to be a bunch of stuff that no one has tried (we will describe this further in a follow-up post).
6. In general, problems get solved by people actually trying to solve them. Not by avoiding the hard problems and hoping that people solve them in the future.
Response #1: Getting mainstream ML on board with alignment concerns is solving one of the hardest problems for the alignment field.
Response #2: Although some of the benefit from “buying time” involves hoping that new researchers show up with new ideas, we’re also buying time for existing researchers who are tackling the hard problems.
Response #3: People who have promising agendas that are attacking the core of the problem should continue doing technical alignment research. There are a lot of people who have been pushed to do technical alignment work who don’t have promising ideas (even after trying for years), or feel like they have gotten substantial signal that they are worse at thinking about alignment than others. These are the people we would be most excited to reallocate. (Note though that we think feedback loops in alignment are poor. Our current guess is that among the top 10% of junior researchers, it seems extremely hard to tell who will be in the tail. But it’s relatively easy to tell who is in the top 10-20% of junior researchers).
7. There’s a risk of overcorrection: maybe too many people will go into “buying time” interventions and too few people will go into technical alignment.
Response: Currently, we think that the AIS community heavily emphasizes the value of technical alignment relative to “buying time.” We think it’s unlikely that the culture shifts too far in the other direction.
8. This doesn’t seem truth-seeky or epistemically virtuous— trying to convince ML people of some specific claims feels wrong, especially given how confused we are and how much disagreement there is between alignment researchers.
Response #1: There are a core set of claims that are pretty well supported that the majority of the ML community has not substantially engaged with (e.g. goodharting, convergent instrumental goals, reward misspecification, goal misgeneralization, risks from power-seeking, risks from deception).
Response #2: We’re most excited about outreach efforts and coordination efforts that actually allow us to figure out how we’re wrong about things. If the alignment community is wrong about something, these interventions make it more likely that we find out (compared to a world in which we engage rather little with capabilities researchers + ML experts and only talk to people in our community). If others are able to refute or deconfuse points that are made in this outreach effort, this seems robustly good, and it seems possible that sufficiently good arguments would convince us (or the alignment community) about potentially cruxy issues.
What changes do we want to see?
Allocation of talent: On the margin, we think more people should be going into “buying time” interventions (and fewer people should be going into traditional safety research or traditional community-building).
Concretely, we think that roughly 80% of alignment researchers are working on directly solving alignment. We think that roughly 50% should be working on alignment, while 50% should be reallocated toward buying time.
We also think that roughly 90% of (competent) community-builders are focused on “typical community-building” (designed to get more alignment researchers). We think that roughly 70% should do typical community-building, and 30% should be buying time.
Culture: We think that “buying time” interventions should be thought of as a comparable or better path than (traditional) technical AI safety research and (traditional) community-building.
Funding: We’d be excited for funders to encourage more projects that buy time via ML outreach, improved coordination, taking conceptual ideas and grounding them empirically, etc.
Strategy: We’d be excited for more strategists to think concretely about what buying time looks like, how it could go wrong, and if/when it would make sense to accelerate capabilities (e.g., how would we know if OpenAI is about to lose to a less safety-conscious AI lab, and what would we want to do in this world?)
We did a BOTEC that suggested that 1 hour of alignment researcher time would buy, in expectation, 1.5-5 quality-adjusted research hours. The BOTEC made several conservative assumptions (e.g., it did not account for the fact that we expect alignment research to be more heavy-tailed than buying time interventions). We are in the process of revising our BOTEC, and we hope to post it once we have revised it.
Instead of technical research, more people should focus on buying time
This post is the first in a sequence of posts about AI strategy co-authored by Thomas Larsen, Akash Wasil, and Olivia Jimenez (TAO). In the next post, we’ll provide more examples of “buying time” interventions that we’re excited about.
We’re grateful to Ajeya Cotra, Daniel Kokotajlo, Ashwin Acharya, and Andrea Miotti for feedback on this post.
If anyone is interested in working on “buying time” interventions, feel free to reach out. (Note that Thomas has a list of technical projects with specifics about how to implement them. We also have a list of non-technical projects).
Summary
A few months ago, when we met technical people interested in reducing AI x-risk, we were nearly always encouraging them to try to solve what we see as the core challenges of the alignment problem (e.g., inner misalignment, corrigibility, interpretability that generalizes to advanced systems).
But we’ve changed our mind.
On the margin, we think more alignment researchers should work on “buying time” interventions instead of technical alignment research (or whatever else they were doing).
To state the claim another way: on the margin, more researchers should backchain from “how do I make AGI timelines longer, make AI labs more concerned about x-risk, and present AI labs with clear things to do to reduce x-risk”, instead of “how do I solve the technical alignment problem?”
Some “buying time” interventions involve performing research that makes AI safety arguments more concrete or grounds them in ML (e.g., writing papers like Goal misgeneralization in deep reinforcement learning and Alignment from a deep learning perspective & discussing these with members of labs).
Some “buying time” interventions involve outreach and engagement with capabilities researchers, leaders in AI labs, and (to a lesser extent) the broader ML community (e.g., giving a presentation to a leading AI lab about power-seeking and deception). We expect that successful outreach efforts will also involve understanding the cruxes/counterarguments of the relevant stakeholders, identifying limitations of existing arguments, and openly acknowledging when the AI safety community is wrong/confused/uncertain about certain points.
We are excited about “buying time” interventions for four main reasons:
Multiplier effects: Delaying timelines by 1 year gives the entire alignment community an extra year to solve the problem.
End time: Some buying time interventions give the community a year at the end, where we have the most knowledge about the problem, access to near-AGI systems, the largest community size, the broadest network across other influential actors, and the most credibility at labs. Buying end time also increases the amount of serial alignment research, which some believe to be the bottleneck. We discuss this more below.
Comparative advantage: Many people would be better-suited for buying time than technical alignment work (see figure 1).
Buying time for people at the tails: We expect that alignment research is extremely heavy-tailed. A median researcher who decides to buy time is buying time for people at the tails, which is (much) more valuable than the median researcher’s time.
Externalities/extra benefits: Many buying time interventions have additional benefits (e.g., improving coordination, reducing race dynamics, increasing the willingness to pay a high alignment tax, getting more people working on alignment research).
Figure 1: Impact by percentile for technical alignment research and buying time interventions.
Caption: We believe that impact in technical alignment research is more heavy-tailed than impact for buying time interventions. Figure 1 illustrates this belief. Note that this is a rough approximation. Note also that both curves should also go below the 0-point of the y-axis, as both kinds of interventions can be net negative.
Concretely, we recommend that ~40-60% of alignment researchers should focus on “buying time” interventions rather than technical alignment research (whereas we currently think that only ~20% are focusing on buying time). We also recommend that ~20-40% of community-builders focus on “buying time” interventions rather than typical community-building (whereas we currently think that ~10% are focusing on buying time).
In the rest of the post, we:
Offer some disclaimers and caveats (here)
Elaborate on the reasons why we’re excited about “buying time” interventions (here)
Describe some examples of “buying time” interventions (here)
Explain our theory of change in greater detail (here)
Describe some potential objections & our responses (here)
Describe some changes we recommend (here)
Disclaimers
Disclaimer #1: We’re not claiming that “buying time” is the only way to categorize the kinds of interventions we describe, and we encourage readers to see if they can come up with alternative frames/labels. Many “buying time” interventions also have other benefits (e.g., improving coordination, getting more people to work on AI safety researchers, and making it less likely that labs deploy dangerous systems). We chose to go with the “buying time” frame for two main reasons
For nearly all of the interventions we describe, we think that most of the benefit comes from buying time, and these other benefits are side benefits. One important exception to this is that much of the impact of evals/demos may come from their ability to prevent labs from deploying dangerous systems. This buys time, but it’s plausible that the main benefit is “the world didn’t end”.
We have found backchaining from “buying time” more useful than other frames we brainstormed. Some alternative frames have felt too limiting (e.g., “outreach to the ML community” doesn’t cover some governance interventions).
Disclaimer #2: Many of these interventions have serious downside risks. We also think many of them are difficult, and they only have a shot at working if they are executed extremely well.
Disclaimer #3: We have several “background assumptions” that inform our thinking. Some examples include (a) somewhat short AI timelines (AGI likely developed in 5-15 years), (b) high alignment difficulty (alignment by default is unlikely, and current approaches seem unlikely to work), and (c) there has been some work done in each of these areas, but we are far behind what we would expect in winning worlds, and there are opportunities to do things that are much more targeted & ambitious than previous/existing projects.
Disclaimer #4: Much of our thinking is informed by conversations with technical AI safety researchers. We have less experience interacting with the governance community and even less experience thinking about interventions that involve the government. It’s possible that some of these ideas are already widespread among EAs who focus on governance interventions, and a lot of our arguments are directed at the thinking we see in the technical AI safety community.
Why are we so excited about “buying time” interventions?”
Large upsides of buying time
Some time-buying interventions buy a year at the end. If capabilities growth continues as normal until someone is about to deploy an AI model that would improve into a TAI, but an evaluation triggers and reveals misaligned behavior, causing this lab to slow down and warn the other labs, the time from this event until when AGI is deployed is very valuable for the following reasons:
You have bought one month for the entire AI safety community[1].
More researchers: The number of alignment researchers is growing each year, so we expect to have the most alignment researchers at the end.
Better understanding of alignment: We understand more about the alignment problem each year, and the field becomes less pre-paradigmatic. This makes it easier to make progress each year.
Serial time: buying time increases the amount of serial alignment research, which could be the bottleneck.
AGI assisted alignment: Some alignment agendas involve using AI assistants to boost alignment research. It’s plausible that a year of alignment research with AI assistants is 5-10X more valuable than a year of alignment research right now. If we’re able to implement interventions that buy time once we have powerful AI assistants, this intervention would be especially valuable (assuming that these assistants can make differential alignment progress or that we can buy time once we have the assistants).
Better understanding of architecture: when we are close to AGI, we have a better understanding of the architectures and training paradigms that will be used to actually build AGI, allowing for alignment solutions to be much more concrete and informed.
Other interventions have a different shape, and do not buy as valuable time. If you simply slow the rate of capabilities progress through publication policies or convince some capabilities researchers to transition, such that on net then AI will take one more year to generate, this has the benefit of 1-4, but not 5 and 6. However, we are proposing to reallocate alignment researchers to buying time interventions, which means that less alignment research is being made this year, so reason 3 might be less strong.
Some interventions that buy time involve coordinating with members of major AI labs (e.g., OpenAI, DeepMind, Anthropic). As a result, these interventions often have the additional benefit of increasing communication, coordination, trust, and shared understanding between major AI labs and members of the AI safety community who are not part of the AI labs. (Note that this is not true of all “buying time” interventions, and several of them could also lead to less coordination or less trust).
Tractability and comparative advantage: lots of people can have a solid positive impact by buying time, while fewer can do great alignment work
Buying a year buys a year for researchers at the tails. If you buy a year of time, you buy a year of time for some of the best alignment researchers. It’s plausible to us that researchers at the tails are >50-100X more valuable than median researchers.
Many projects that are designed to buy time require different skills than technical AI safety research.
Skills that seem uniquely valuable for buying time interventions: general researcher aptitudes, ability to take existing ideas and strengthen them, experimental design skills, ability to iterate in response to feedback, ability to build on the ideas of others, ability to draw connections between ideas, experience conducting “typical ML research,” strong models of ML/capabilities researchers, strong communication skills
Skills that seem uniquely valuable for technical AI safety research: abstract thinking, ability to work well with very little structure or guidance, ability to generate and formalize novel ideas, focus on “the hard parts of the problem”, ability to be comfortable being confused for long periods of time.
Skills that seem roughly as useful in both: Strong understanding of AI safety material, machine learning knowledge.
Alignment research seems heavy-tailed. It’s often easy to identify whether or not someone has a reasonable chance of being at the tail (e.g., after 6-12 months of trying to solve alignment). People who are somewhat likely to be at the tail should keep doing alignment research; other people should buy time.
A reasonable counterpoint is that “buying time” might also be heavily-tailed. However, we currently expect it to be less heavy-tailed than alignment research. It seems plausible to us that many “median SERI-MATS scholars” could write papers like the goal misgeneralization paper, explain alignment difficulties in clearer and more compelling ways, conduct (or organize) high-quality outreach and coordination activities and perform many other interventions we’re excited about. On the other hand, we don’t expect that “median SERI-MATS scholars” would be able to make progress on heuristic arguments, create their own alignment agendas, or come up with other major conceptual advancements.
Nonetheless, a lot of the argument depends on the specific time-buying and the specific alignment research. It seems plausible to us that some of the most difficult time-buying interventions are more heavy-tailed than some of the more straightforward alignment research projects (e.g., coming up with good eval tools and demos might be more heavy-tailed than performing interpretability experiments).
What are some examples of “buying time” interventions?
The next post in this sequence (rough draft here) outlines more concrete interventions that we are excited about in this space, but we highlight three interventions here that are especially exciting to us. We briefly provide some examples below.
Outreach efforts that involve interactions between the AI safety community and (a) members of AI labs + (b) members of the ML community.
Some specific examples:
More conferences that bring together researchers from different groups who are working on similar topics (e.g., Anthropic recently organized an interpretability retreat with members from various different AI labs and AI alignment organizations).
More conferences that bring together strategy/governance thinkers from different groups (e.g., Olivia and Akash recently ran a small 1-day strategy retreat with a few members from AI labs and members).
Discussions like the MIRI 2021 conversations, except with a greater emphasis on engaging with researchers and decision-makers at major AI labs by directly touching on their cruxes.
Collaborations on interventions that involve coordinating with AI labs (e.g., figuring out if there are ways to collaborate on research projects, efforts to implement publication policies and information-sharing agreements, efforts to monitor new actors that are developing AGI, etc.)
More ML community outreach. Examples include projects by the Center for AI Safety (led by Dan Hendrycks) and AIS field-building hub (led by Vael Gates).
The Evaluations Project (led by Beth Barnes)
Beth’s team is trying to develop evaluations that help us understand when AI models might be dangerous. The path to impact is that an AI company will likely use the eval tool on advanced AI models that they train, and this eval could then lead them to delay deployment of a model for which the eval unveiled scary behavior. In an ideal world, this would be so compelling that multiple AI labs slow down, potentially extending timelines by multiple years.
Papers that take theoretical/conceptual safety ideas and ground them in empirical research.
Specific examples of this type of work include Lauro Langosco’s goal misgeneralization paper (which shows how an RL agent can appear to learn goal X but actually learn goal Y) and Alex Turner’s optimal policies tend to seek power paper. Theoretical alignment researchers had already proposed that agents could learn unintended goals and that agents would have incentives to seek power. The papers by Lauro and Alex take these theoretical ideas (which are often perceived as fuzzy and lacking concreteness), formalize them more crisply, and offer examples of how they affect modern ML systems.
We think that this buys time primarily by convincing labs and academics of alignment difficulty. In the next section, we give more detail on the theory of chance.
Theory of Change
In the previous section, we talked about why we were so excited about buying time interventions. However, the interventions we have in mind often have a number of other positive impacts. In this section, we provide more detail about these impacts as well as why we think these impacts end up buying time.
We summarize our theory of change in the following diagram:
Labs take AGI x-risk seriously + Labs have concrete things they can do → More Time
We think that timelines are largely a function of (a) the extent to which leaders and researchers at AI labs are concerned about AI x-risk and (b) the extent to which they have solutions that can be (feasibly) implemented.
If conditions (a) and (b) are met, we expect the following benefits:
Less capabilities research. There is less scaling and less algorithmic progress.
More coordination between labs. There are more explicit and trusted agreements to help each other with safety research, avoid deploying AGI prematurely, and avoid racing.
Less publishing capabilities advances. Capabilities knowledge is siloed, so when one lab discovers something, it doesn’t get used by the rest of the world. For example, PaLM claims a 15% speed up from parallelizing layers. If this insight hadn’t been published, PaLM would likely have been 15% slower.
Labs being less likely to deploy AGI and scale existing models.
These all lead to relative slowdowns of AGI timelines, giving everyone more time to solve the alignment problem.
Benefits other than buying time
Many of the interventions we describe also have benefits other than buying time. We think the most important ones are:
Willingness to pay a higher alignment tax: Concern about alignment going poorly means that the labs invest more resources into safety. An obvious resource is that a computationally expensive solution to alignment becomes a lot more likely to be implemented, as the labs recognize how important it is. Concretely, we are substantially more excited about worlds in the lab building AGI actually implements all of the interventions described in Holden’s How might we align transformative AI if it’s developed very soon?. By default, if AGI were developed in the next few years at OpenAI or DeepMind, we would put ~20% on each of these solutions actually being used.
Less likely to deploy dangerous systems: If evaluations are used and safety standards are implemented, labs could catch misaligned behavior and decide not to deploy systems that would have ended the world. This leads to timelines increases, but it also has the more direct impact of literally saving the world (at least temporarily). We expect that this causes the probability of a naive accident risk to go down substantially.
More alignment research: If labs are convinced by AI safety arguments, they may shift more of their (capabilities) researchers toward alignment issues.
Some objections and our responses
1. There are downside risks from low-quality outreach and coordination efforts with AI labs
Response: We agree. Members of AI labs have their own opinions about AI safety; efforts to come in and proselytize are likely to fail. We think the best efforts will be conducted by people who have (a) strong understandings of technical AI safety arguments, (b) strong interpersonal skills and ability to understand different perspectives, (c) caution and good judgment, and (d) collaborators or advisors who can help them understand the space. However, this depends on the intervention. Caution is especially warranted when doing direct outreach that involves interaction with capabilities researchers, but more technical work such as empirically grounding alignment arguments pretty much only requires technical skill.
2. Labs perceive themselves to be in a race, so they won’t slow down.
Response #1: We think that some of the concrete interventions we have in mind contribute to coordination and reduce race dynamics. In particular, efforts to buy time by conveying the difficulty of alignment could lead multiple players to become more concerned about x-risk (causing all of the leading labs to slow down).
Furthermore, we’re optimistic that sufficiently well-executed coordination events could lead to increased trust and potentially concrete agreements between labs. We think that differences in values (company A is worried that company B would not use AI responsibly) and worries about misuse risk (company A is worried that company B’s AI is likely to be unaligned) are two primary drivers of race dynamics.
However, to the extent that A and B are value-aligned, both are aware that each of them are taking reasonable safety precautions, and leaders at both companies trust each other, they are less incentivized to race each other. Coordination events could help with each of these factors.
Response #2: Some interventions don’t reduce race dynamics (e.g., slowing down the leading lab). These are high EV in worlds where the safety-conscious lab (or labs) has a sizable advantage. On the margin, we think more people should be investing into these interventions, but they should be deployed more carefully (ideally after some research has been conducted to compare the upside of buying time to the downside of increasing race dynamics).
3. Labs being more concerned about safety isn’t that helpful. They already care; they just lack solutions.
Response: Our current impression is that many leaders at major AGI labs are concerned about safety. However, we don’t think everyone is safety-conscious, and we think there are some policies that labs could adopt to buy time (e.g., adopting publication policies that reduce the rate of capabilities papers).
4. Slowing down ODA+[2] could increase the chance that a new (and less safety-conscious actor) develops AGI.
Response #1: Our current best guess is that ODA+ has a >6 month lead over less safety-conscious competitors. However, this is fairly sensitive to timelines. If scale is critical, then one would expect a small number of very large projects to be in the lead for AGI, and differentially slowing the most receptive / safety oriented / cautious of those labs seems on net negative. However, interventions that slow the whole field such as a blanket slowdown in publishing or increases the extent to which all labs are safety oriented are robustly good.
Response #2: Even if ODA+ does not have a major lead, many of the interventions (like third-party audits) could scale to new AGI developers too. For example, if there’s a culture in the field of doing audits, and pressure to do so, talented researchers are likely not to want to work for you unless you participate, or if later there’s a regulatory regime attached to all that.
Response #3: Some interventions to buy time increase lead time of labs and slow research overall (e.g., making it more difficult for new players to enter the space; compelling evals or concretizations of alignment difficulties could cause many labs to slow down).
Response #4: Under our current model, most P(doom) comes from not having a solution to the alignment problem. So we’re willing to trade some P(solution gets implemented) and some P(AGI is aligned to my values) in exchange for a higher P(we find a solution). However, we acknowledge that there is a genuine tradeoff here, and given the uncertainty of the situation, Thomas thinks that this is the strongest argument against buying time interventions.
5. Buying time is not tractable.
Response: This is possible, but we currently doubt it. There seems to be a bunch of stuff that no one has tried (we will describe this further in a follow-up post).
6. In general, problems get solved by people actually trying to solve them. Not by avoiding the hard problems and hoping that people solve them in the future.
Response #1: Getting mainstream ML on board with alignment concerns is solving one of the hardest problems for the alignment field.
Response #2: Although some of the benefit from “buying time” involves hoping that new researchers show up with new ideas, we’re also buying time for existing researchers who are tackling the hard problems.
Response #3: People who have promising agendas that are attacking the core of the problem should continue doing technical alignment research. There are a lot of people who have been pushed to do technical alignment work who don’t have promising ideas (even after trying for years), or feel like they have gotten substantial signal that they are worse at thinking about alignment than others. These are the people we would be most excited to reallocate. (Note though that we think feedback loops in alignment are poor. Our current guess is that among the top 10% of junior researchers, it seems extremely hard to tell who will be in the tail. But it’s relatively easy to tell who is in the top 10-20% of junior researchers).
7. There’s a risk of overcorrection: maybe too many people will go into “buying time” interventions and too few people will go into technical alignment.
Response: Currently, we think that the AIS community heavily emphasizes the value of technical alignment relative to “buying time.” We think it’s unlikely that the culture shifts too far in the other direction.
8. This doesn’t seem truth-seeky or epistemically virtuous— trying to convince ML people of some specific claims feels wrong, especially given how confused we are and how much disagreement there is between alignment researchers.
Response #1: There are a core set of claims that are pretty well supported that the majority of the ML community has not substantially engaged with (e.g. goodharting, convergent instrumental goals, reward misspecification, goal misgeneralization, risks from power-seeking, risks from deception).
Response #2: We’re most excited about outreach efforts and coordination efforts that actually allow us to figure out how we’re wrong about things. If the alignment community is wrong about something, these interventions make it more likely that we find out (compared to a world in which we engage rather little with capabilities researchers + ML experts and only talk to people in our community). If others are able to refute or deconfuse points that are made in this outreach effort, this seems robustly good, and it seems possible that sufficiently good arguments would convince us (or the alignment community) about potentially cruxy issues.
What changes do we want to see?
Allocation of talent: On the margin, we think more people should be going into “buying time” interventions (and fewer people should be going into traditional safety research or traditional community-building).
Concretely, we think that roughly 80% of alignment researchers are working on directly solving alignment. We think that roughly 50% should be working on alignment, while 50% should be reallocated toward buying time.
We also think that roughly 90% of (competent) community-builders are focused on “typical community-building” (designed to get more alignment researchers). We think that roughly 70% should do typical community-building, and 30% should be buying time.
Culture: We think that “buying time” interventions should be thought of as a comparable or better path than (traditional) technical AI safety research and (traditional) community-building.
Funding: We’d be excited for funders to encourage more projects that buy time via ML outreach, improved coordination, taking conceptual ideas and grounding them empirically, etc.
Strategy: We’d be excited for more strategists to think concretely about what buying time looks like, how it could go wrong, and if/when it would make sense to accelerate capabilities (e.g., how would we know if OpenAI is about to lose to a less safety-conscious AI lab, and what would we want to do in this world?)
We did a BOTEC that suggested that 1 hour of alignment researcher time would buy, in expectation, 1.5-5 quality-adjusted research hours. The BOTEC made several conservative assumptions (e.g., it did not account for the fact that we expect alignment research to be more heavy-tailed than buying time interventions). We are in the process of revising our BOTEC, and we hope to post it once we have revised it.
ODA+ = OpenAI, DeepMind, Anthropic, and a small number of other actors.