This is preliminary description of what I dubbed Dialogic Reinforcement Learning (credit for the name goes to tumblr user @di—es—can-ic-ul-ar—es): the alignment scheme I currently find most promising.
It seems that the natural formal criterion for alignment (or at least the main criterion) is having a “subjective regret bound”: that is, the AI has to converge (in the long term planning limit, γ→1 limit) to achieving optimal expected user!utility with respect to the knowledge state of the user. In order to achieve this, we need to establish a communication protocol between the AI and the user that will allow transmitting this knowledge state to the AI (including knowledge about the user’s values). Dialogic RL attacks this problem in the manner which seems the most straightforward and powerful: allowing the AI to ask the user questions in some highly expressive formal language, which we will denote F.
F allows making formal statements about a formal model M of the world, as seen from the AI’s perspective.M includes such elements as observations, actions, rewards and corruption. That is, M reflects (i) the dynamics of the environment (ii) the values of the user (iii) processes that either manipulate the user, or damage the ability to obtain reliable information from the user. Here, we can use different models of values: a traditional “perceptible” reward function, an instrumental reward function, a semi-instrumental reward functions, dynamically-inconsistent rewards, rewards with Knightian uncertainty etc. Moreover, the setup is self-referential in the sense that, M also reflects the question-answer interface and the user’s behavior.
A single question can consist, for example, of asking for the probability of some sentence in F or the expected value of some expression of numerical type in F. However, in order to address important features of the world, such questions have to be very complex. It is infeasible to demand that the user understands such complex formal questions unaided. Therefore, the AI always produces a formal question qF together with a natural language (N) annotationqN. This annotation has to explain the question in human understandable terms, and also convince the user that qN is indeed an accurate natural language rendering of qF. The user’s feedback then consists of (i) accepting/rejecting/grading the annotation (ii) answering the question if the annotation is correct and the user can produce the answer. Making this efficient requires a process of iteratively constructing a correspondence between N and F, i.e effectively building a new shared language between the user and the AI. We can imagine concepts defined in F and explained in N that serve to define further, more complex, concepts, where at each stage the previous generation of concepts can be assumed given and mutually understandable. In addition to such intensional definitions we may also allow extensional definitions, as long as the generalization is assumed to be via some given function space that is relatively restricted (e.g. doesn’t admit subagents). There seem to be some strong connections between the subproblem of designing the annotation system and the field of transparency in AI.
The first major concern that arises at this point is, questions can serve as an attack vector. This is addressed by quantilization. The key assumption is: it requires much less optimization power to produce some useful question than to produce a malicious question. Under this assumption, the quantilization parameter can be chosen to make the question interface safe but still effective. Over time, the agent accumulates knowledge about corruption dynamics that allows it to steer even further away from malicious questions while making the choice of questions even more effective. For the attack vector of deceitful annotations, we can improve safety using the debate approach, i.e. having the agent to produce additional natural language text that attempts to refute the validity of the annotation.
Of course, in addition to the question interface, the physical interface (direct interaction with environment) is also an attack vector (like in any RL system). There, safety is initially guaranteed by following a baseline policy (which can be something like “do nothing” or human imitation). Later, the agent starts deviating from the baseline policy while staying safe, by leveraging the knowledge it previously gained through both the question and the physical interface. Besides being safe, the algorithm also need to be effective, and for this it has to (in particular) find the learning strategy that optimally combines gaining knowledge through the question interface and gaining knowledge through autonomous exploration.
Crucially, we want our assumptions about user competence to be weak. This means that, the user can produce answers that are (i) incomplete (just refuse to answer) (ii) fickle (change eir answers) and (iii) inconsistent (contradictory answers). We address (i) by either assuming that the answerable questions are sufficient or requiring a weaker regret bound where the reference agents knows all obtainable answers rather than knowing the full knowledge state they refer to. We address (ii) by believing later answers over earlier answers and via the commitment mechanism (see below). We address (iii) by explaining the contradictions to the user and asking for their resolution. In addition to all that, the AI can predict the answers instead of actually asking (i.e. the user’s knowledge state is also assumed to be regular to some extent). The commitment mechanism is a key tool for achieving optimal trade-off between allowing the user more time to think vs. acting quickly enough to address external dangers, while keeping the user informed of the consequences. It works by having the AI ask the user for a committed answer at the point of time when the AI has to commit to an irreversible course of action (and not much earlier), based on this particular answer. The AI provides the user with a a description of the Pareto frontier for the different possible answers. Moreover, even committed answers can be changed later. In this case, the algorithm cannot guarantee global optimality but it should still guarantee “subgame perfection”: i.e., whatever irreversible mistakes were made, the AI recovers as best as possible.
In order to improve the user’s ability to produce answers and stay informed about the AI’s strategy, we introduce another interface through which the user can ask questions from the AI. In this direction, the user asks the question in N and the AI both translates it to F and provides a justification (as well as a refutation) of the translation, again employing quantilization. The user can then accept or reject the translation. If the translation is rejected, ey may require the AI to try again. If it is accepted, the AI produces an answer to the best of its ability. In general, it may be that producing an answer requires compiling a formal proof in which case the AI may or may not succeed. Here, there is concern about the scenario where (i) the user asks a question about a possible future (ii) the AI answers with a detailed description of the future (iii) as a result, the user is exposed to infohazards (sources of corruption) within this future. To address this, we need to either restrict the interface to small volumes of data, or shape the prior s.t. the AI can on its own realize the inherent danger. In the latter approach, the AI can refuse to answer or provide a “censored” answer while pointing out the presence of a potential infohazard.
Finally, the subjective regret bound approach might seem to impose a strong cap on capability: if the AI follows a policy optimal w.r.t. the user’s knowledge state, then the limited computational abilities of the user might prevent the AI from creating models of the world that are more detailed. This can be addressed in a Turing Reinforcement Learning setting, by allowing the user’s knowledge state to refer to arbitrary computations. The user can then be incapable of executing those computations, but the AI might be more able to do so.
A variant of Dialogic RL with improved corrigibility. Suppose that the AI’s prior allows a small probability for “universe W” whose semantics are, roughly speaking, “all my assumptions are wrong, need to shut down immediately”. In other words, this is a universe where all our prior shaping is replaced by the single axiom that shutting down is much higher utility than anything else. Moreover, we add into the prior that assumption that the formal question “W?” is understood perfectly by the user even without any annotation. This means that, whenever the AI assigns a higher-than-threshold probability to the user answering “yes” if asked “W?” at any uncorrupt point in the future, the AI will shutdown immediately. We should also shape the prior s.t. corrupt futures also favor shutdown: this is reasonable in itself, but will also ensure that the AI won’t arrive at believing too many futures to be corrupt and thereby avoid the imperative to shutdown as response to a confirmation of W.
Now, this won’t help if the user only resolves to confirm W after something catastrophic already occurred, such as the AI releasing malign subagents into the wild. But, something of the sort is true for any corrigibility scheme: corrigibility is about allowing the user to make changes in the AI on eir own initiative, which can always be too late. This method doesn’t ensure safety in itself, just hardens a system that is supposed to be already close to safe.
It would be nice if we could replace “shutdown” by “undo everything you did and then shutdown” but that gets us into thorny specifications issues. Perhaps it’s possible to tackle those issues by one of the approaches to “low impact”.
Universe W should still be governed by a simplicity prior. This means that whenever the agent detects a salient pattern that contradicts the assumptions of its prior shaping, the probability of W increases leading to shutdown. This serves as an additional “sanity test” precaution.
This design is made of so many parts! It might benefit from a proof that it is the universal answer to a formal question, which I expect to seem less overly complex.
I am not sure. AI alignment seems to touch on many different aspects of the world, and it is not obvious that it can be reduced to assumptions that are extremely simple and natural. Or, if it can be reduced that way, then it might require a theory that on some level explains human civilization, its evolution and and its influence on the world (even if only on a fairly abstract level). I will share some thoughts how the various assumptions can be reduced another step back, but proceeding to reduce all of them to a simple core seems like a challenging research programme.
Most of the parts of this design can be regarded as reflecting particular assumptions we make about the user as an agent.
The core idea of having a dialogue comes from modeling the user as a “linguistic agent”. Such agents may be viewed as nodes in a distributed AI system, but where each node has different objectives. It is an interesting philosophical question whether this assumption is necessary for value learning. It currently seems plausible to me that only for linguistic agents “values” are truly well-defined, or at least sufficiently well-defined to extrapolate them outside the trajectory that the agent follows on its own.
The need to quantilize, debate and censor infohazards comes from the assumption that the user can be manipulated (there is some small fraction of possible inputs that invalidate the usual assumptions about the user’s behavior). Specifically debate might be possible to justify by some kind of Bayesian framework where every argument is a piece of evidence, and providing biased arguments is like providing selective evidence.
The need to deal with “incoherent” answers and the commitment mechanism comes from the assumption the user has limited access to its own knowledge state (including its own reward function). Perhaps we can formalize it further by modeling the user as a learning algorithm with some intrinsic source of information. Perhaps we can even explain why such agents are natural in the “distributed AI” framework, or by some evolutionary argument.
The need to translate between formal language and natural languages come from, not knowing the “communication protocol” of the “nodes”. Formalizing this idea further requires some more detailed model of what “natural language” is, which might be possible via multi-agent learning theory.
Finally, the need to start from a baseline policy (and also the need to quantilize) comes from the assumption that the environment is not entirely secure. So that’s an assumption about the current state of the world, rather than about the user. Perhaps, we can make formal the argument that this state of the world (short-term stable, long-term dangerous) is to be expected when agents populated it for a long time.
This is preliminary description of what I dubbed Dialogic Reinforcement Learning (credit for the name goes to tumblr user @di—es—can-ic-ul-ar—es): the alignment scheme I currently find most promising.
It seems that the natural formal criterion for alignment (or at least the main criterion) is having a “subjective regret bound”: that is, the AI has to converge (in the long term planning limit, γ→1 limit) to achieving optimal expected user!utility with respect to the knowledge state of the user. In order to achieve this, we need to establish a communication protocol between the AI and the user that will allow transmitting this knowledge state to the AI (including knowledge about the user’s values). Dialogic RL attacks this problem in the manner which seems the most straightforward and powerful: allowing the AI to ask the user questions in some highly expressive formal language, which we will denote F.
F allows making formal statements about a formal model M of the world, as seen from the AI’s perspective.M includes such elements as observations, actions, rewards and corruption. That is, M reflects (i) the dynamics of the environment (ii) the values of the user (iii) processes that either manipulate the user, or damage the ability to obtain reliable information from the user. Here, we can use different models of values: a traditional “perceptible” reward function, an instrumental reward function, a semi-instrumental reward functions, dynamically-inconsistent rewards, rewards with Knightian uncertainty etc. Moreover, the setup is self-referential in the sense that, M also reflects the question-answer interface and the user’s behavior.
A single question can consist, for example, of asking for the probability of some sentence in F or the expected value of some expression of numerical type in F. However, in order to address important features of the world, such questions have to be very complex. It is infeasible to demand that the user understands such complex formal questions unaided. Therefore, the AI always produces a formal question qF together with a natural language (N) annotation qN. This annotation has to explain the question in human understandable terms, and also convince the user that qN is indeed an accurate natural language rendering of qF. The user’s feedback then consists of (i) accepting/rejecting/grading the annotation (ii) answering the question if the annotation is correct and the user can produce the answer. Making this efficient requires a process of iteratively constructing a correspondence between N and F, i.e effectively building a new shared language between the user and the AI. We can imagine concepts defined in F and explained in N that serve to define further, more complex, concepts, where at each stage the previous generation of concepts can be assumed given and mutually understandable. In addition to such intensional definitions we may also allow extensional definitions, as long as the generalization is assumed to be via some given function space that is relatively restricted (e.g. doesn’t admit subagents). There seem to be some strong connections between the subproblem of designing the annotation system and the field of transparency in AI.
The first major concern that arises at this point is, questions can serve as an attack vector. This is addressed by quantilization. The key assumption is: it requires much less optimization power to produce some useful question than to produce a malicious question. Under this assumption, the quantilization parameter can be chosen to make the question interface safe but still effective. Over time, the agent accumulates knowledge about corruption dynamics that allows it to steer even further away from malicious questions while making the choice of questions even more effective. For the attack vector of deceitful annotations, we can improve safety using the debate approach, i.e. having the agent to produce additional natural language text that attempts to refute the validity of the annotation.
Of course, in addition to the question interface, the physical interface (direct interaction with environment) is also an attack vector (like in any RL system). There, safety is initially guaranteed by following a baseline policy (which can be something like “do nothing” or human imitation). Later, the agent starts deviating from the baseline policy while staying safe, by leveraging the knowledge it previously gained through both the question and the physical interface. Besides being safe, the algorithm also need to be effective, and for this it has to (in particular) find the learning strategy that optimally combines gaining knowledge through the question interface and gaining knowledge through autonomous exploration.
Crucially, we want our assumptions about user competence to be weak. This means that, the user can produce answers that are (i) incomplete (just refuse to answer) (ii) fickle (change eir answers) and (iii) inconsistent (contradictory answers). We address (i) by either assuming that the answerable questions are sufficient or requiring a weaker regret bound where the reference agents knows all obtainable answers rather than knowing the full knowledge state they refer to. We address (ii) by believing later answers over earlier answers and via the commitment mechanism (see below). We address (iii) by explaining the contradictions to the user and asking for their resolution. In addition to all that, the AI can predict the answers instead of actually asking (i.e. the user’s knowledge state is also assumed to be regular to some extent). The commitment mechanism is a key tool for achieving optimal trade-off between allowing the user more time to think vs. acting quickly enough to address external dangers, while keeping the user informed of the consequences. It works by having the AI ask the user for a committed answer at the point of time when the AI has to commit to an irreversible course of action (and not much earlier), based on this particular answer. The AI provides the user with a a description of the Pareto frontier for the different possible answers. Moreover, even committed answers can be changed later. In this case, the algorithm cannot guarantee global optimality but it should still guarantee “subgame perfection”: i.e., whatever irreversible mistakes were made, the AI recovers as best as possible.
In order to improve the user’s ability to produce answers and stay informed about the AI’s strategy, we introduce another interface through which the user can ask questions from the AI. In this direction, the user asks the question in N and the AI both translates it to F and provides a justification (as well as a refutation) of the translation, again employing quantilization. The user can then accept or reject the translation. If the translation is rejected, ey may require the AI to try again. If it is accepted, the AI produces an answer to the best of its ability. In general, it may be that producing an answer requires compiling a formal proof in which case the AI may or may not succeed. Here, there is concern about the scenario where (i) the user asks a question about a possible future (ii) the AI answers with a detailed description of the future (iii) as a result, the user is exposed to infohazards (sources of corruption) within this future. To address this, we need to either restrict the interface to small volumes of data, or shape the prior s.t. the AI can on its own realize the inherent danger. In the latter approach, the AI can refuse to answer or provide a “censored” answer while pointing out the presence of a potential infohazard.
Finally, the subjective regret bound approach might seem to impose a strong cap on capability: if the AI follows a policy optimal w.r.t. the user’s knowledge state, then the limited computational abilities of the user might prevent the AI from creating models of the world that are more detailed. This can be addressed in a Turing Reinforcement Learning setting, by allowing the user’s knowledge state to refer to arbitrary computations. The user can then be incapable of executing those computations, but the AI might be more able to do so.
I gave a talk on Dialogic Reinforcement Learning in the AI Safety Discussion Day, and there is a recording.
A variant of Dialogic RL with improved corrigibility. Suppose that the AI’s prior allows a small probability for “universe W” whose semantics are, roughly speaking, “all my assumptions are wrong, need to shut down immediately”. In other words, this is a universe where all our prior shaping is replaced by the single axiom that shutting down is much higher utility than anything else. Moreover, we add into the prior that assumption that the formal question “W?” is understood perfectly by the user even without any annotation. This means that, whenever the AI assigns a higher-than-threshold probability to the user answering “yes” if asked “W?” at any uncorrupt point in the future, the AI will shutdown immediately. We should also shape the prior s.t. corrupt futures also favor shutdown: this is reasonable in itself, but will also ensure that the AI won’t arrive at believing too many futures to be corrupt and thereby avoid the imperative to shutdown as response to a confirmation of W.
Now, this won’t help if the user only resolves to confirm W after something catastrophic already occurred, such as the AI releasing malign subagents into the wild. But, something of the sort is true for any corrigibility scheme: corrigibility is about allowing the user to make changes in the AI on eir own initiative, which can always be too late. This method doesn’t ensure safety in itself, just hardens a system that is supposed to be already close to safe.
It would be nice if we could replace “shutdown” by “undo everything you did and then shutdown” but that gets us into thorny specifications issues. Perhaps it’s possible to tackle those issues by one of the approaches to “low impact”.
Universe W should still be governed by a simplicity prior. This means that whenever the agent detects a salient pattern that contradicts the assumptions of its prior shaping, the probability of W increases leading to shutdown. This serves as an additional “sanity test” precaution.
This design is made of so many parts! It might benefit from a proof that it is the universal answer to a formal question, which I expect to seem less overly complex.
I am not sure. AI alignment seems to touch on many different aspects of the world, and it is not obvious that it can be reduced to assumptions that are extremely simple and natural. Or, if it can be reduced that way, then it might require a theory that on some level explains human civilization, its evolution and and its influence on the world (even if only on a fairly abstract level). I will share some thoughts how the various assumptions can be reduced another step back, but proceeding to reduce all of them to a simple core seems like a challenging research programme.
Most of the parts of this design can be regarded as reflecting particular assumptions we make about the user as an agent.
The core idea of having a dialogue comes from modeling the user as a “linguistic agent”. Such agents may be viewed as nodes in a distributed AI system, but where each node has different objectives. It is an interesting philosophical question whether this assumption is necessary for value learning. It currently seems plausible to me that only for linguistic agents “values” are truly well-defined, or at least sufficiently well-defined to extrapolate them outside the trajectory that the agent follows on its own.
The need to quantilize, debate and censor infohazards comes from the assumption that the user can be manipulated (there is some small fraction of possible inputs that invalidate the usual assumptions about the user’s behavior). Specifically debate might be possible to justify by some kind of Bayesian framework where every argument is a piece of evidence, and providing biased arguments is like providing selective evidence.
The need to deal with “incoherent” answers and the commitment mechanism comes from the assumption the user has limited access to its own knowledge state (including its own reward function). Perhaps we can formalize it further by modeling the user as a learning algorithm with some intrinsic source of information. Perhaps we can even explain why such agents are natural in the “distributed AI” framework, or by some evolutionary argument.
The need to translate between formal language and natural languages come from, not knowing the “communication protocol” of the “nodes”. Formalizing this idea further requires some more detailed model of what “natural language” is, which might be possible via multi-agent learning theory.
Finally, the need to start from a baseline policy (and also the need to quantilize) comes from the assumption that the environment is not entirely secure. So that’s an assumption about the current state of the world, rather than about the user. Perhaps, we can make formal the argument that this state of the world (short-term stable, long-term dangerous) is to be expected when agents populated it for a long time.