I recently held a workshop with PIBBSS fellows on the MTAIR model and thought some points from the overall discussion were valuable:
The discussants went over various scenarios related to AI takeover, including a superficially aligned system being delegated lots of power and gaining resources by entirely legitimate means, a WFLL2-like automation failure, and swift foom takeover. Some possibilities involved a more covert, silent coup where most of the work was done through manipulation and economic pressure. The concept of “$1T damage” as an intermediate stage to takeover appeared to be an unnatural fit with some of these diverse scenarios. There was some mention of whether mitigation or defensive spending should be considered as part of that $1T figure.
Alignment Difficulty and later steps merge many scenarios
The discussants interpreted “alignment is hard” (step 3) as implying that alignment is sufficiently hard that (given that APS is built), at least one APS is misaligned somewhere, and also that there’s some reasonable probability that any given deployed APS is unaligned. This is the best way of making the whole statement deductively valid.
However, proposition 3 being true doesn’t preclude the existence of other aligned APS AI (hard alignment and at least one unaligned APS might mean that there are leading conscientious aligned APS projects but unaligned reckless competitors). This makes discussion of the subsequent questions harder, as we have to condition on there possibly being aligned APS present as well which might reduce the risk of takeover.
This means that when assessing proposition 4, we have to condition on some worlds where aligned APS has already been deployed and used for defense, some where there have been warning shots and strong responses without APS, some where misaligned APS emerges out of nowhere and FOOMs too quickly for any response, and a slow takeoff where nonetheless every system is misaligned and there is a WFLL2 like takeover attempt, and add up the chance of large scale damage in all of these scenarios, conditioning on their probability, which makes coming to an overall answer to 4 and 5 challenging.
Definitions are value-laden and don’t overlap: TAI, AGI, APS
We differentiated between Transformative AI (TAI), defined by Karnofsky, Barnett and Cotra entirely by its impact on the world, which can either be highly destructive or drive very rapid economic growth; General AI (AGI), defined through a variety of benchmarks including passing hard adversarial Turing tests and human-like capabilities on enough intellectual tasks; and APS, which focuses on long-term planning and human-like abilities only on takeover-relevant tasks. We also mentioned Paul Christiano’s notion of the relevant metric being AI ‘as economically impactful as a simulation of any human expert’ which technically blends the definitions of AGI and TAI (since it doesn’t necessarily require very fast growth but implies it strongly). Researchers disagree quite a lot on even which of these are harder: Daniel Kokotaljo has argued that APS likely comes before TAI and maybe even before (the Matthew Barnett definition of) AGI, while e.g. Barnett thinks that TAI comes after AGI with APS AI somewhere in the middle (and possibly coincident with TAI).
In particular, some definitions of ‘AGI’, i.e. human-level performance on a wide range of tasks, could be much less than what is required for APS depending on what the specified task range is. If the human-level performance is only on selections of tasks that aren’t useful for outcompeting humans strategically (which could still be very many tasks, for example, human-level performance on everything that requires under a minute of thinking), the ‘AGI system’ could almost entirely lack the capabilities associated with APS. However, most of the estimates that could be used in a timelines estimate will revolve around AGI predictions (since they will be estimates of performance or accuracy benchmarks), which we risk anchoring on if we try to adjust them to predict the different milestones of APS.
In general it is challenging to use the probabilities from one metric like TAI to inform other predictions like APS, because each definition includes many assumptions about things that don’t have much to do with AI progress (like how qualitatively powerful intelligence is in the real world, what capabilities are needed for takeover, what bottlenecks are there to economic automation or research automation etc.) In other words, APS and TAI are value-laden terms that include many assumptions about the strategic situation with respect to AI takeover, world economy and likely responses.
APS is less understood and more poorly forecasted compared to AGI. Discussants felt the current models for AGI can’t be easily adapted for APS timelines or probabilities. APS carries much of the weight in the assessment due to its specific properties: i.e. many skeptics might argue that even if AGI is built, things which don’t meet the definition of APS might not be built.
Alignment and Deployment Decisions
Several discussants suggested splitting the model’s third proposition into two separate components: one focusing on the likelihood of building misaligned APS systems (3a) and the other on the difficulty of creating aligned ones (3b). This would allow a more nuanced understanding of how alignment difficulties influence deployment decisions. They also emphasized that detection of misalignment would impact deployment, which wasn’t sufficiently clarified in the original model.
Advanced Capabilities
There was a consensus that ‘advanced capabilities’ as a term is too vague. The discussants appreciated the attempt to narrow it down to strategic awareness and advanced planning but suggested breaking it down even further into more measurable skills, like hacking ability, economic manipulation, or propaganda dissemination. There are, however, disagreements regarding which capabilities are most critical (which can be seen as further disagreements about the difficulty of APS relative to AGI).
If strategic awareness comes before advanced planning, we might see AI systems capable of manipulating people, but not in ways that greatly exceed human manipulative abilities. As a result, these manipulations could potentially be detected and mitigated and even serve as warning signs that lower total risk. On the other hand, if advanced capabilities develop before strategic awareness or advanced planning, we could encounter AI systems that may not fully understand the world or their position in it, nor possess the ability to plan effectively. Nevertheless, these systems might still be capable of taking single, highly dangerous actions, such as designing and releasing a bioweapon.
Outside View & Futurism Reliability
We didn’t cover the outside view considerations extensively, but various issues under the “accuracy of futurism” umbrella arose which weren’t specifically mentioned.
The fact that markets don’t seem to have reacted as if Transformative AI is a near-term prospect, and the lack of wide scale scrutiny and robust engagement with risk arguments (especially those around alignment difficulty), were highlighted as reasons to doubt this kind of projection further.
The Fermi Paradox implies a form of X-risk that is self-destructive and not that compatible with AI takeover worries, while market interest rates also push the probability of such risks downward. The discussants recommended placing more weight on outside priors than we did in the default setting for the model, suggesting a 1:1 ratio compared to the model’s internal estimations.
Discussants also agreed with the need to balance pessimistic surviva- is-conjunctive views and optimistic survival-is-disjunctive views, arguing that the Carlsmith model is biased towards optimism and survival being disjunctive but that the correct solution is not to simply switch to a pessimism-biased survival is conjunctive model in response.
There’s a tendency to focus exclusively on the risks associated with misaligned APS systems seeking power, which can introduce a bias towards survival being predicated solely on avoiding APS takeover. However, this overlooks other existential risk scenarios that are more structural. There are potential situations without agentic power-seeking behavior but characterized by rapid changes could for less causally clear reasons include technological advancements or societal shifts that may not necessarily have a ‘bad actor’ but could still spiral into existential catastrophe. This post describes some of these scenarios in more detail.
>APS is less understood and poorly forecasted compared to AGI.
I disagree with this. I have come to dislike the term AGI because (a) its meaning is so poorly defined, (b) the concept most people have in mind will never be achieved, nor needs to be to get to the capability level that is necessary for catastrophic scenarios, and (c) the concept of “AGI” doesn’t get at the part of advanced AI that is relevant for reasoning about x-risk.
In this diagram, the AGI circle captures the concept that AGI is a system that subsumes all human capabilities. Realistically, this will never exactly happen. Is completing that AGI circle really the relevant concept? If we can shift that big Future AI bubble to the left and reduce the overlap with AGI, does that make us safer? No.
Granted, the APS concept is also highly deficient at this point in time. It also is too vague & ambiguous in its current form, especially in terms of how differently people interpret each of the six propositions. But, compared to the AGI term, it is at least a more constructive and relevant term.
I do like the idea of adopting a concept of “general intelligence” that is contrasted to “narrow intelligence.” It applies to a system can operate across a broad spectrum different domains and tasks, including ones it wasn’t specifically designed or trained for, and avoid brittleness when slightly out-of-distribution. IMHO, GPT-4 crossed that line for the first time this year. I.e., I don’t think it can be considered narrow (and brittle) anymore. However, this is not a dangerous line—GPT-4 is not an x-risk. So “general” is a useful concept, but misses something relevant to x-risk (probably many things).
>APS is less understood and poorly forecasted compared to AGI.
I should clarify that I was talking about the definition used by forecasts like the Direct Approach methodology and/or the definition given in the metaculus forecast or in estimates like the Direct Approach. The latter is roughly speaking, capability sufficient to pass a hard adversarial Turing tests and human-like capabilities on enough intellectual tasks as measured by certain tests.This is something that can plausibly be upper bounded by the direct approach methodology which aims to predict when an AI could get a negligible error in predicting what a human expert would say over a specific time horizon. So this forecast is essentially a forecast of ‘human-expert-writer-simulator AI’, and that is the definition that’s used in public elicitations like the metaculus forecasts.
However, I agree with you that while in some of the sources I cite that’s how the term is defined it’s not what the word denotes (just generality, which e.g. GPT-4 plausibly is for some weak sense of the word), and you also don’t get from being able to simulate the writing of any human expert to takeover risk without making many additional assumptions.
I recently held a workshop with PIBBSS fellows on the MTAIR model and thought some points from the overall discussion were valuable:
The discussants went over various scenarios related to AI takeover, including a superficially aligned system being delegated lots of power and gaining resources by entirely legitimate means, a WFLL2-like automation failure, and swift foom takeover. Some possibilities involved a more covert, silent coup where most of the work was done through manipulation and economic pressure. The concept of “$1T damage” as an intermediate stage to takeover appeared to be an unnatural fit with some of these diverse scenarios. There was some mention of whether mitigation or defensive spending should be considered as part of that $1T figure.
Alignment Difficulty and later steps merge many scenarios
The discussants interpreted “alignment is hard” (step 3) as implying that alignment is sufficiently hard that (given that APS is built), at least one APS is misaligned somewhere, and also that there’s some reasonable probability that any given deployed APS is unaligned. This is the best way of making the whole statement deductively valid.
However, proposition 3 being true doesn’t preclude the existence of other aligned APS AI (hard alignment and at least one unaligned APS might mean that there are leading conscientious aligned APS projects but unaligned reckless competitors). This makes discussion of the subsequent questions harder, as we have to condition on there possibly being aligned APS present as well which might reduce the risk of takeover.
This means that when assessing proposition 4, we have to condition on some worlds where aligned APS has already been deployed and used for defense, some where there have been warning shots and strong responses without APS, some where misaligned APS emerges out of nowhere and FOOMs too quickly for any response, and a slow takeoff where nonetheless every system is misaligned and there is a WFLL2 like takeover attempt, and add up the chance of large scale damage in all of these scenarios, conditioning on their probability, which makes coming to an overall answer to 4 and 5 challenging.
Definitions are value-laden and don’t overlap: TAI, AGI, APS
We differentiated between Transformative AI (TAI), defined by Karnofsky, Barnett and Cotra entirely by its impact on the world, which can either be highly destructive or drive very rapid economic growth; General AI (AGI), defined through a variety of benchmarks including passing hard adversarial Turing tests and human-like capabilities on enough intellectual tasks; and APS, which focuses on long-term planning and human-like abilities only on takeover-relevant tasks. We also mentioned Paul Christiano’s notion of the relevant metric being AI ‘as economically impactful as a simulation of any human expert’ which technically blends the definitions of AGI and TAI (since it doesn’t necessarily require very fast growth but implies it strongly). Researchers disagree quite a lot on even which of these are harder: Daniel Kokotaljo has argued that APS likely comes before TAI and maybe even before (the Matthew Barnett definition of) AGI, while e.g. Barnett thinks that TAI comes after AGI with APS AI somewhere in the middle (and possibly coincident with TAI).
In particular, some definitions of ‘AGI’, i.e. human-level performance on a wide range of tasks, could be much less than what is required for APS depending on what the specified task range is. If the human-level performance is only on selections of tasks that aren’t useful for outcompeting humans strategically (which could still be very many tasks, for example, human-level performance on everything that requires under a minute of thinking), the ‘AGI system’ could almost entirely lack the capabilities associated with APS. However, most of the estimates that could be used in a timelines estimate will revolve around AGI predictions (since they will be estimates of performance or accuracy benchmarks), which we risk anchoring on if we try to adjust them to predict the different milestones of APS.
In general it is challenging to use the probabilities from one metric like TAI to inform other predictions like APS, because each definition includes many assumptions about things that don’t have much to do with AI progress (like how qualitatively powerful intelligence is in the real world, what capabilities are needed for takeover, what bottlenecks are there to economic automation or research automation etc.) In other words, APS and TAI are value-laden terms that include many assumptions about the strategic situation with respect to AI takeover, world economy and likely responses.
APS is less understood and more poorly forecasted compared to AGI. Discussants felt the current models for AGI can’t be easily adapted for APS timelines or probabilities. APS carries much of the weight in the assessment due to its specific properties: i.e. many skeptics might argue that even if AGI is built, things which don’t meet the definition of APS might not be built.
Alignment and Deployment Decisions
Several discussants suggested splitting the model’s third proposition into two separate components: one focusing on the likelihood of building misaligned APS systems (3a) and the other on the difficulty of creating aligned ones (3b). This would allow a more nuanced understanding of how alignment difficulties influence deployment decisions. They also emphasized that detection of misalignment would impact deployment, which wasn’t sufficiently clarified in the original model.
Advanced Capabilities
There was a consensus that ‘advanced capabilities’ as a term is too vague. The discussants appreciated the attempt to narrow it down to strategic awareness and advanced planning but suggested breaking it down even further into more measurable skills, like hacking ability, economic manipulation, or propaganda dissemination. There are, however, disagreements regarding which capabilities are most critical (which can be seen as further disagreements about the difficulty of APS relative to AGI).
If strategic awareness comes before advanced planning, we might see AI systems capable of manipulating people, but not in ways that greatly exceed human manipulative abilities. As a result, these manipulations could potentially be detected and mitigated and even serve as warning signs that lower total risk. On the other hand, if advanced capabilities develop before strategic awareness or advanced planning, we could encounter AI systems that may not fully understand the world or their position in it, nor possess the ability to plan effectively. Nevertheless, these systems might still be capable of taking single, highly dangerous actions, such as designing and releasing a bioweapon.
Outside View & Futurism Reliability
We didn’t cover the outside view considerations extensively, but various issues under the “accuracy of futurism” umbrella arose which weren’t specifically mentioned.
The fact that markets don’t seem to have reacted as if Transformative AI is a near-term prospect, and the lack of wide scale scrutiny and robust engagement with risk arguments (especially those around alignment difficulty), were highlighted as reasons to doubt this kind of projection further.
The Fermi Paradox implies a form of X-risk that is self-destructive and not that compatible with AI takeover worries, while market interest rates also push the probability of such risks downward. The discussants recommended placing more weight on outside priors than we did in the default setting for the model, suggesting a 1:1 ratio compared to the model’s internal estimations.
Discussants also agreed with the need to balance pessimistic surviva- is-conjunctive views and optimistic survival-is-disjunctive views, arguing that the Carlsmith model is biased towards optimism and survival being disjunctive but that the correct solution is not to simply switch to a pessimism-biased survival is conjunctive model in response.
Difficult to separate takeover from structural risk
There’s a tendency to focus exclusively on the risks associated with misaligned APS systems seeking power, which can introduce a bias towards survival being predicated solely on avoiding APS takeover. However, this overlooks other existential risk scenarios that are more structural. There are potential situations without agentic power-seeking behavior but characterized by rapid changes could for less causally clear reasons include technological advancements or societal shifts that may not necessarily have a ‘bad actor’ but could still spiral into existential catastrophe. This post describes some of these scenarios in more detail.
I should clarify that I was talking about the definition used by forecasts like the Direct Approach methodology and/or the definition given in the metaculus forecast or in estimates like the Direct Approach. The latter is roughly speaking, capability sufficient to pass a hard adversarial Turing tests and human-like capabilities on enough intellectual tasks as measured by certain tests.This is something that can plausibly be upper bounded by the direct approach methodology which aims to predict when an AI could get a negligible error in predicting what a human expert would say over a specific time horizon. So this forecast is essentially a forecast of ‘human-expert-writer-simulator AI’, and that is the definition that’s used in public elicitations like the metaculus forecasts.
However, I agree with you that while in some of the sources I cite that’s how the term is defined it’s not what the word denotes (just generality, which e.g. GPT-4 plausibly is for some weak sense of the word), and you also don’t get from being able to simulate the writing of any human expert to takeover risk without making many additional assumptions.