Necessary conditions for successful control proposals
At this point in the framework, let’s stipulate that we have placed our bets for the most plausible learning architecture we would expect to see in an AGI and we have a coherent account of the existential risks most likely to emerge from the learning architecture in question. At last, now comes the time for addressing the AGI safety control problem: how are we actually going to minimize the probability of the existential risks we care about?
As was the case previously, it would be far too ambitious (and inefficient) for us to attempt to enumerate every plausible control proposal for every plausible existential risk for every plausible AGI learning architecture. In place of this, I will focus on a framework for control proposals that I hope is generally risk-independent and architecture-independent—i.e., important control-related questions that should probably be answered regardless of the specific architectures and existential risks that any one particular researcher is most concerned about.
Of course, there are going to be lots of control-related questions that should be answered about the specific risks associated with specific architectures (e.g., “what control proposals would most effectively mitigate inner-misalignment-related existential risks in a human-level AGI using weak online RL?”, etc.) that I am not going to address here. This obviously does not mean I think that these are unimportant questions—indeed, these are probably the most important questions. They are simply too specific—and number too many—to include within a parsimonious theoretical framework.
Specifically, we will build from the following foundation:
In other words, we are looking for whatever prior conditions seem necessary for ending up with ‘comprehensive alignment’—a set-up in which an AGI is pursuing the right goal in the right way for the right reasons (i.e., AGI alignment + human alignment). Again, note that whatever necessary conditions we end up discussing are almost certainly not going to be sufficient—getting to sufficiency will almost certainly require additional risk- and architecture-specific control proposals.
Within this ‘domain-general’ framework, I think two control-related concepts emerge as most important: interpretability and corrigibility. These seem to be (at least) two background conditions that are absolutely necessary for maximizing the likelihood of AGI achieving the right goals in the right ways for the right reasons: (1), the goals, ways, and reasons in question can be translated with high fidelity from the relevant substrate (i.e., interpretability), and (2) the goals, ways, and reasons (or their proximal upstream causes) can be successfully tweaked to more closely approximate whatever happen to be the right goals, ways, and/or reasons (i.e., corrigibility). Again, good interpretability and corrigibility proposals will not alone solve the AGI safety control problem, but they are nonetheless necessary for solving the problem. I’ll now talk a bit more about each.
Interpretability
Interpretability is fundamentally a translation problem. I think there are two relevant dimensions across which we should evaluate particular proposals: scope of interpretability and ease of interpretability. By scope, I am referring to the fact that there are multiple discrete computations that require interpretation: reasons, ways, and goals. By ease, I am referring to how straightforward it is to confidently interpret each of these computations. I represent this graphically below:
There are a few things to discuss here. First, the actual position of each bar is arbitrary; I am just presenting one possible calibration, not making any claim about the likelihood of this particular calibration. Second, I am considering the worst-case state of affairs for interpretability to be one where no meaningful interpretation is possible (e.g., a recursively self-improving AGI set-up where we genuinely have no idea what is going on) and the best-case state of affairs for interpretability to be one where the relevant interpretation is built directly into the relevant architecture (e.g., the AGI does the work of explaining its own behavioral decision-making process). Between these two extremes span low- to high-fidelity interpretations, which would directly correspond to the confidence we place in our yielded interpretations actually being correct. The noisier the interpretive process, the less confident we ought to be in it.
Another noteworthy aspect of this conceptualization is that interpretability here does not merely apply to AGI, but also to the humans supervising it. Just as it seems necessary for us to understand exactly what is actually motivating the AGI to behave in a certain way, I think it also makes sense for us to understand exactly what goal(s) the human is attempting to assign to the AGI. While this might at first seem trivial (e.g., “humans will always just tell us what their goals are!”), I think it is extremely important for safety that the human’s goal is formulated as precisely as possible, including how that goal directly translates into the reward/loss function being employed. It is only possible to revise a goal to more closely approximate some safer/better alternative if that goal is initially rendered in sufficiently precise terms. Humans are not psychologically transparent (if they were, we wouldn’t need a field like psychology!), least of all to themselves, so implementing incentive structures and explanatory tools that facilitate crisp, computational accounts of the goals that engineers are attempting to assign to their AGIs seem just as important for safety as implementing neural-network-inspection technologies, say, that enable us to understand the motivations of the AGI for executing some action.
Corrigibility
I will use the term ‘corrigibility’ to refer to the state of affairs where a reason, way, or goal is presently suboptimal but is able to be adjusted to more closely approximate the relevant optimum. Here, the optimum is defined by whatever we take ‘right’ to mean when we’re talking about the ‘right’ reasons, ways, and goals. More on this later. Just as was the case for interpretability, I think that ‘ease of corrigibility’ and ‘scope of corrigibility’ are the relevant dimensions for understanding the idea:
As before, the actual position of each bar here is completely arbitrary. I am considering the worst-case of affairs for corrigibility to be one where the AGI or human successfully resists attempts to shift their reasons, ways, or goals toward their target value (e.g., an AGI self-copies and proceeds to overwrite any human-authored revisions it detects in its internal architecture with the original copy). On the other end of the spectrum, I am imagining the best possible case for corrigibility to be one where the AGI or human automatically self-adjusts towards the relevant target without need for intervention (e.g., the AGI self-discovers it has implicitly developed the motivation to manipulate humans and, knowing that this is wrong, it adjusts its value function accordingly). Between these two extremes, we find a spectrum of ‘manual’ adjustments at varying degrees of computational, attentional, etc. expense. For instance, an AGI whose internal computations can be adjusted in a computationally costly manner would be a better state of affairs than an AGI that successfully resists adjustment but would be a worse state of affairs than an AGI whose internal computations can be adjusted trivially.
As was also the case for interpretability, I think that this notion of corrigibility applies with equal force to humans. That is, a human that has assigned some goal to an AGI may be more or less resistant to changing that goal to better approximate whatever we determine the ‘right goal(s)’ to be (i.e., for safety, goals with minimal potential for existential risk). I think it is easy to imagine a variety of reasons why a human might be resistant to reformulating the goal it assigns to an AGI: the human may care about existential risk but disagree that their goal is risky, they may care less about existential risk than whatever their goal happens to be, they may have a personality that leads them to strongly dislike being told they are wrong, and so on. Indeed, I think it is possible that human corrigibility of this sort might be a more relevant and immediate control problem than AGI corrigibility. Here are two brief reasons why:
While on balance, most people probably wouldn’t mind interpretability-geared interventions to incentivize/help facilitate maximally clear representations of the goal they intend to assign to an AGI, I think that, on balance, most people (e.g., firms, labs, individuals) probably would mind being told that the goal they are assigning to an AGI is the incorrect goal. This is really to say something like “despite what you probably think, what you are trying to get your AGI to do is actually wrong and dangerous,” which seems almost intrinsically antagonistic. Accordingly, we end up with a probably-hard-to-enforce compliance problem.
Whereas (the AGI’s) ‘right reasons’ and ‘right ways’ are more tightly calibrated by the superordinate goal being pursued, it is far less obvious precisely what constitutes (a human’s) ‘right goal’—even if, as AGI safety researchers, we take this to mean ‘the subset of goals least likely to lead to existential risk.’ That is, there will almost certainly be reasonable differences of opinion about what goals are minimally existentially risky, with no clear means of adjudicating these differences (passing a bunch of laws probably will not solve the problem; see Question 4). This seems like a real and inevitable conflict that, at the very least, should be considered and debated by more human-leaning safety researchers—and probably also thinkers in AI governance—priorto this problem ever actually arising.
Corrigibility is a background condition that gets even ‘closer’ to what we are looking for in this section: control proposals that are most likely to mitigate anticipated existential risks from the learning architecture(s) we most expect AGI to exhibit. Again, I will point to Evan Hubinger’s work as the current gold standard for specific, computationally-framed proposals of this sort—proposals that I will not attempt to address here but that I think are well worth thinking about. My goal in this section is to build intuition for the most important background conditions that would need to be in place for any specific control proposal to be viable: namely, the capacity to translate the relevant computational or psychological activity into specific formulations that we readily understand (interpretability), and the capacity to intervene and modify this activity to more closely approximate optimal patterns of functioning—i.e., patterns of functioning that minimize the relevant existential risks (corrigibility).
As before, additional and more specific interventions will be necessary to successfully maximize the likelihood that the AGI-human dyad is achieving the right goals in the right way for the right reasons, but comprehensively enumerating these interventions (for each conceivable risk from each conceivable architecture) is a field-sized undertaking. I anticipate that the formulation, assessment, and revision of proposals of this sort will end up constituting most of what AGI safety researchers spend their time doing when all is said and done.
Directionality of control signals
Finally, it is important to consider the presupposition that the relevant control mechanisms enumerated in many specific proposals are exogenous to the AGI. In other words, many control proposals stipulate either implicitly or explicitly that the ‘control signal’ must originate from outside of the AGI (e.g., from the programmer, some other supervisory AI, etc.). This does not seem necessarily true. An intriguing and neglected direction for control proposal research concerns endogenous control—i.e., self-control. However plausible, it seems entirely possible that, much in the same way we might get interpretability and corrigibility “for free,” we build an AGI that supervises its own behavior in the relevant ways. This is not an unfamiliar concept: in everyday life, we are constantly self-regulating what we say, think, and do in the service of higher goals/values (e.g., “that piece of chocolate cake sure looks great, but...”). There is presumably some set of algorithms running in the (adult human) brain that instantiate this form of self-control, demonstrating that such algorithms are indeed possible. I have written before about thorny safety issues surrounding self-awareness in AGI; proponents of ‘AGI self-control’ would need to address these kinds of concerns to ensure that their proposals do not solve one control problem at the expense of causing five more. Regardless, it is intriguing to note the possibility of endogenous control instead of or in addition to the more familiar exogenous control proposals on offer in AGI safety research.
Question 3: Control proposals for minimizing bad outcomes
Necessary conditions for successful control proposals
At this point in the framework, let’s stipulate that we have placed our bets for the most plausible learning architecture we would expect to see in an AGI and we have a coherent account of the existential risks most likely to emerge from the learning architecture in question. At last, now comes the time for addressing the AGI safety control problem: how are we actually going to minimize the probability of the existential risks we care about?
As was the case previously, it would be far too ambitious (and inefficient) for us to attempt to enumerate every plausible control proposal for every plausible existential risk for every plausible AGI learning architecture. In place of this, I will focus on a framework for control proposals that I hope is generally risk-independent and architecture-independent—i.e., important control-related questions that should probably be answered regardless of the specific architectures and existential risks that any one particular researcher is most concerned about.
Of course, there are going to be lots of control-related questions that should be answered about the specific risks associated with specific architectures (e.g., “what control proposals would most effectively mitigate inner-misalignment-related existential risks in a human-level AGI using weak online RL?”, etc.) that I am not going to address here. This obviously does not mean I think that these are unimportant questions—indeed, these are probably the most important questions. They are simply too specific—and number too many—to include within a parsimonious theoretical framework.
Specifically, we will build from the following foundation:
In other words, we are looking for whatever prior conditions seem necessary for ending up with ‘comprehensive alignment’—a set-up in which an AGI is pursuing the right goal in the right way for the right reasons (i.e., AGI alignment + human alignment). Again, note that whatever necessary conditions we end up discussing are almost certainly not going to be sufficient—getting to sufficiency will almost certainly require additional risk- and architecture-specific control proposals.
Within this ‘domain-general’ framework, I think two control-related concepts emerge as most important: interpretability and corrigibility. These seem to be (at least) two background conditions that are absolutely necessary for maximizing the likelihood of AGI achieving the right goals in the right ways for the right reasons: (1), the goals, ways, and reasons in question can be translated with high fidelity from the relevant substrate (i.e., interpretability), and (2) the goals, ways, and reasons (or their proximal upstream causes) can be successfully tweaked to more closely approximate whatever happen to be the right goals, ways, and/or reasons (i.e., corrigibility). Again, good interpretability and corrigibility proposals will not alone solve the AGI safety control problem, but they are nonetheless necessary for solving the problem. I’ll now talk a bit more about each.
Interpretability
Interpretability is fundamentally a translation problem. I think there are two relevant dimensions across which we should evaluate particular proposals: scope of interpretability and ease of interpretability. By scope, I am referring to the fact that there are multiple discrete computations that require interpretation: reasons, ways, and goals. By ease, I am referring to how straightforward it is to confidently interpret each of these computations. I represent this graphically below:
There are a few things to discuss here. First, the actual position of each bar is arbitrary; I am just presenting one possible calibration, not making any claim about the likelihood of this particular calibration. Second, I am considering the worst-case state of affairs for interpretability to be one where no meaningful interpretation is possible (e.g., a recursively self-improving AGI set-up where we genuinely have no idea what is going on) and the best-case state of affairs for interpretability to be one where the relevant interpretation is built directly into the relevant architecture (e.g., the AGI does the work of explaining its own behavioral decision-making process). Between these two extremes span low- to high-fidelity interpretations, which would directly correspond to the confidence we place in our yielded interpretations actually being correct. The noisier the interpretive process, the less confident we ought to be in it.
Another noteworthy aspect of this conceptualization is that interpretability here does not merely apply to AGI, but also to the humans supervising it. Just as it seems necessary for us to understand exactly what is actually motivating the AGI to behave in a certain way, I think it also makes sense for us to understand exactly what goal(s) the human is attempting to assign to the AGI. While this might at first seem trivial (e.g., “humans will always just tell us what their goals are!”), I think it is extremely important for safety that the human’s goal is formulated as precisely as possible, including how that goal directly translates into the reward/loss function being employed. It is only possible to revise a goal to more closely approximate some safer/better alternative if that goal is initially rendered in sufficiently precise terms. Humans are not psychologically transparent (if they were, we wouldn’t need a field like psychology!), least of all to themselves, so implementing incentive structures and explanatory tools that facilitate crisp, computational accounts of the goals that engineers are attempting to assign to their AGIs seem just as important for safety as implementing neural-network-inspection technologies, say, that enable us to understand the motivations of the AGI for executing some action.
Corrigibility
I will use the term ‘corrigibility’ to refer to the state of affairs where a reason, way, or goal is presently suboptimal but is able to be adjusted to more closely approximate the relevant optimum. Here, the optimum is defined by whatever we take ‘right’ to mean when we’re talking about the ‘right’ reasons, ways, and goals. More on this later. Just as was the case for interpretability, I think that ‘ease of corrigibility’ and ‘scope of corrigibility’ are the relevant dimensions for understanding the idea:
As before, the actual position of each bar here is completely arbitrary. I am considering the worst-case of affairs for corrigibility to be one where the AGI or human successfully resists attempts to shift their reasons, ways, or goals toward their target value (e.g., an AGI self-copies and proceeds to overwrite any human-authored revisions it detects in its internal architecture with the original copy). On the other end of the spectrum, I am imagining the best possible case for corrigibility to be one where the AGI or human automatically self-adjusts towards the relevant target without need for intervention (e.g., the AGI self-discovers it has implicitly developed the motivation to manipulate humans and, knowing that this is wrong, it adjusts its value function accordingly). Between these two extremes, we find a spectrum of ‘manual’ adjustments at varying degrees of computational, attentional, etc. expense. For instance, an AGI whose internal computations can be adjusted in a computationally costly manner would be a better state of affairs than an AGI that successfully resists adjustment but would be a worse state of affairs than an AGI whose internal computations can be adjusted trivially.
As was also the case for interpretability, I think that this notion of corrigibility applies with equal force to humans. That is, a human that has assigned some goal to an AGI may be more or less resistant to changing that goal to better approximate whatever we determine the ‘right goal(s)’ to be (i.e., for safety, goals with minimal potential for existential risk). I think it is easy to imagine a variety of reasons why a human might be resistant to reformulating the goal it assigns to an AGI: the human may care about existential risk but disagree that their goal is risky, they may care less about existential risk than whatever their goal happens to be, they may have a personality that leads them to strongly dislike being told they are wrong, and so on. Indeed, I think it is possible that human corrigibility of this sort might be a more relevant and immediate control problem than AGI corrigibility. Here are two brief reasons why:
While on balance, most people probably wouldn’t mind interpretability-geared interventions to incentivize/help facilitate maximally clear representations of the goal they intend to assign to an AGI, I think that, on balance, most people (e.g., firms, labs, individuals) probably would mind being told that the goal they are assigning to an AGI is the incorrect goal. This is really to say something like “despite what you probably think, what you are trying to get your AGI to do is actually wrong and dangerous,” which seems almost intrinsically antagonistic. Accordingly, we end up with a probably-hard-to-enforce compliance problem.
Whereas (the AGI’s) ‘right reasons’ and ‘right ways’ are more tightly calibrated by the superordinate goal being pursued, it is far less obvious precisely what constitutes (a human’s) ‘right goal’—even if, as AGI safety researchers, we take this to mean ‘the subset of goals least likely to lead to existential risk.’ That is, there will almost certainly be reasonable differences of opinion about what goals are minimally existentially risky, with no clear means of adjudicating these differences (passing a bunch of laws probably will not solve the problem; see Question 4). This seems like a real and inevitable conflict that, at the very least, should be considered and debated by more human-leaning safety researchers—and probably also thinkers in AI governance—prior to this problem ever actually arising.
Corrigibility is a background condition that gets even ‘closer’ to what we are looking for in this section: control proposals that are most likely to mitigate anticipated existential risks from the learning architecture(s) we most expect AGI to exhibit. Again, I will point to Evan Hubinger’s work as the current gold standard for specific, computationally-framed proposals of this sort—proposals that I will not attempt to address here but that I think are well worth thinking about. My goal in this section is to build intuition for the most important background conditions that would need to be in place for any specific control proposal to be viable: namely, the capacity to translate the relevant computational or psychological activity into specific formulations that we readily understand (interpretability), and the capacity to intervene and modify this activity to more closely approximate optimal patterns of functioning—i.e., patterns of functioning that minimize the relevant existential risks (corrigibility).
As before, additional and more specific interventions will be necessary to successfully maximize the likelihood that the AGI-human dyad is achieving the right goals in the right way for the right reasons, but comprehensively enumerating these interventions (for each conceivable risk from each conceivable architecture) is a field-sized undertaking. I anticipate that the formulation, assessment, and revision of proposals of this sort will end up constituting most of what AGI safety researchers spend their time doing when all is said and done.
Directionality of control signals
Finally, it is important to consider the presupposition that the relevant control mechanisms enumerated in many specific proposals are exogenous to the AGI. In other words, many control proposals stipulate either implicitly or explicitly that the ‘control signal’ must originate from outside of the AGI (e.g., from the programmer, some other supervisory AI, etc.). This does not seem necessarily true. An intriguing and neglected direction for control proposal research concerns endogenous control—i.e., self-control. However plausible, it seems entirely possible that, much in the same way we might get interpretability and corrigibility “for free,” we build an AGI that supervises its own behavior in the relevant ways. This is not an unfamiliar concept: in everyday life, we are constantly self-regulating what we say, think, and do in the service of higher goals/values (e.g., “that piece of chocolate cake sure looks great, but...”). There is presumably some set of algorithms running in the (adult human) brain that instantiate this form of self-control, demonstrating that such algorithms are indeed possible. I have written before about thorny safety issues surrounding self-awareness in AGI; proponents of ‘AGI self-control’ would need to address these kinds of concerns to ensure that their proposals do not solve one control problem at the expense of causing five more. Regardless, it is intriguing to note the possibility of endogenous control instead of or in addition to the more familiar exogenous control proposals on offer in AGI safety research.