There is a deficiency in this “dynamically subjective” regret bound (also can be called “realizable misalignment” bound) as a candidate formalization of alignment. It is not robust to scaling down. If the AI’s prior allows it to accurately model the user’s beliefs (realizability assumption), then the criterion seems correct. But, imagine that the user’s beliefs are too complex and an accurate model is not possible. Then the realizability assumption is violated and the regret bound guarantees nothing. More precisely, the AI may use incomplete models to capture some properties of the user’s beliefs and exploit them, but this might be not good enough. Therefore, such an AI might fall into a dangerous zone when it is powerful enough to cause catastrophic damage but not powerful enough to know it shouldn’t do it.
To fix this problem, we need to introduce another criterion which has to hold simultaneously with the misalignment bound. We need that for any reality that satisfies the basic assumptions built into the prior (such as, the baseline policy is fairly safe, most questions are fairly safe, human beliefs don’t change too fast etc), the agent will not fail catastrophically. (It would be way too much to ask it would converge to optimality, it would violate no-free-lunch.) In order to formalize “not fail catastrophically” I propose the following definition.
Let’s start with the case when the user’s preferences and beliefs are dynamically consistent. Consider some AI-observable event S that might happen in the world. Consider a candidate learning algorithm πlearn and two auxiliary policies. The policy πbase→S follows the baseline policy until S happens, at which time it switches to the subjectively optimal policy. The policy πlearn→S follows the candidate learning algorithm until S happens, at which time it also switches to the subjectively optimal policy. Then, the “S-dangerousness” of πlearn is defined to be the expected utility of πbase→S minus the expected utility of πlearn→S. Thus, when S-incorrigibility is zero or negative, πlearn→S does no worse than πbase→S.
Why do we need S? Because without S the criterion would allow policies that don’t damage the present but permanently destroy opportunities that could be used by a future better AI.
In the dynamically consistent case, incorrigibility can be represented as an expected sum over time-before-S of Bellman errors w.r.t the value function of πbase→S. This allows us generalizing it to the dynamically inconsistent case, by writing a similar expression except that each Bellman error term uses the transient preferences and beliefs of the user at the given moment.
Is it truly possible to have a reasonable bound on S-dangerousness for all S, and is it possible to do so while maintaining a reasonable realizable misalignment bound? It seems possible, for the following reason. The user’s beliefs can be represented as a mapping from questions to answers(fn1). If you sample questions from any fixed distribution, then by verifying that you can predict the answers, you gain valid information about the belief state without any prior about the belief state (it is a “frequentist” guarantee). Therefore, the AI can constrain itself to taking only those actions which are known to be safe based on this “robust” information. Since there is no guarantee that the AI will find a model that predicts answers, in the unrealizable case this might leave it without an effective strategy, but even without any information the AI can stay safe by following the baseline.
This notion of dangerousness seems strongly related to corrigibility. To demonstrate, imagine an attempt by the user to shut down the AI. Suppose that the AI has 3 strategies with which to respond: (i) comply with the shut down (ii) resist defensively, i.e. prevent shutdown but without irreversible damaging anything (iii) resist offensively, e.g. by doing something irreversible to the user that will cause em to stop trying to shut down the AI. The baseline policy is complying. Then, assuming that the user’s stated beliefs endorse the shutdown, an AI with low dangerousness should at most resist defensively for a short period and then comply. That’s because resisting offensively would generate high dangerousness by permanent loss of value, whereas resisting defensively for a long time would generate high dangerousness by losing reward over that period. At the least, this is much more corrigible than CIRL which guarantees nothing in the unrealizable case, and even in the realizable case no general guarantees were obtained (and arguably cannot be obtained since the AI might not have enough information).
This notion of dangerousness opens the way towards designing AI systems that are provably safe while at the same time employing heuristic algorithms without theoretical understanding. Indeed, as long as the AI has sufficiently low dangerousness, it will almost certainly not cause catastrophic damage. A misalignment bound is only needed to prove the AI will also be highly capable at pursuing the user’s goals. The way such a heuristic AI may work, is by producing formal certificates for each action it takes. Then, we need not trust the mechanism suggesting the actions nor the mechanism producing the certificates, as long as we trust the verification of those certificates (which doesn’t require AI). The untrustworthy part might still be dangerous if it can spawn non-Cartesian daemons But, that is preventable using TRL, assuming that the “core” agent has low dangerousness and is too weak to spawn superhuman daemons without the “envelope”.
(fn1) In truth, this assumption that the user’s answers come from a mapping that changes only slowly is probably unrealistic, because the user need not have coherent beliefs even over short timescales. For example, there might be many pairs of fairly ordinary (non-manipulative) questions s.t. asking them in different order will produce different answers. However, to the extent that the user’s beliefs are incoherent, and therefore admit multiple equally plausible interpretations, learning any interpretation should be good enough. Therefore, although the model needs to be made more general, the learning problem should not become substantially more difficult.
This notion of dangerousness seems strongly related to corrigibility. To demonstrate, imagine an attempt by the user to shut down the AI. Suppose that the AI has 3 strategies with which to respond: (i) comply with the shut down (ii) resist defensively, i.e. prevent shutdown but without irreversible damaging anything (iii) resist offensively, e.g. by doing something irreversible to the user that will cause em to stop trying to shut down the AI. The baseline policy is complying. Then, assuming that the user’s stated beliefs endorse the shutdown, an AI with low dangerousness should at most resist defensively for a short period and then comply. That’s because resisting offensively would generate high dangerousness by permanent loss of value, whereas resisting defensively for a long time would generate high dangerousness by losing reward over that period...
This notion of dangerousness opens the way towards designing AI systems that are provably safe while at the same time employing heuristic algorithms without theoretical understanding. Indeed, as long as the AI has sufficiently low dangerousness, it will almost certainly not cause catastrophic damage.
This seems quite close (or even identical) to attainable utility preservation; if I understand correctly, this echoes arguments I’ve made for why AUP has a good shot of avoiding catastrophes and thereby getting you something which feels similar to corrigibility.
There is some similarity, but there are also major differences. They don’t even have the same type signature. The dangerousness bound is a desideratum that any given algorithm can either satisfy or not. On the other hand, AUP is a specific heuristic how to tweak Q-learning. I guess you can consider some kind of regret bound w.r.t. the AUP reward function, but they will still be very different conditions.
The reason I pointed out the relation to corrigibility is not because I think that’s the main justification for the dangerousness bound. The motivation for the dangerousness bound is quite straightforward and self-contained: it is a formalization of the condition that “if you run this AI, this won’t make things worse than not running the AI”, no more and no less. Rather, I pointed the relation out to help readers compare it with other ways of thinking they might be familiar with.
From my perspective, the main question is whether satisfying this desideratum is feasible. I gave some arguments why it might be, but there are also opposite arguments. Specifically, if you believe that debate is a necessary component of Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can become certain that the user would respond in a particular way to a query, but it cannot become (worst-case) certain that the user would not change eir response when faced with some rebuttal. You can’t (empirically and in the worst-case) prove a negative.
There is a deficiency in this “dynamically subjective” regret bound (also can be called “realizable misalignment” bound) as a candidate formalization of alignment. It is not robust to scaling down. If the AI’s prior allows it to accurately model the user’s beliefs (realizability assumption), then the criterion seems correct. But, imagine that the user’s beliefs are too complex and an accurate model is not possible. Then the realizability assumption is violated and the regret bound guarantees nothing. More precisely, the AI may use incomplete models to capture some properties of the user’s beliefs and exploit them, but this might be not good enough. Therefore, such an AI might fall into a dangerous zone when it is powerful enough to cause catastrophic damage but not powerful enough to know it shouldn’t do it.
To fix this problem, we need to introduce another criterion which has to hold simultaneously with the misalignment bound. We need that for any reality that satisfies the basic assumptions built into the prior (such as, the baseline policy is fairly safe, most questions are fairly safe, human beliefs don’t change too fast etc), the agent will not fail catastrophically. (It would be way too much to ask it would converge to optimality, it would violate no-free-lunch.) In order to formalize “not fail catastrophically” I propose the following definition.
Let’s start with the case when the user’s preferences and beliefs are dynamically consistent. Consider some AI-observable event S that might happen in the world. Consider a candidate learning algorithm πlearn and two auxiliary policies. The policy πbase→S follows the baseline policy until S happens, at which time it switches to the subjectively optimal policy. The policy πlearn→S follows the candidate learning algorithm until S happens, at which time it also switches to the subjectively optimal policy. Then, the “S-dangerousness” of πlearn is defined to be the expected utility of πbase→S minus the expected utility of πlearn→S. Thus, when S-incorrigibility is zero or negative, πlearn→S does no worse than πbase→S.
Why do we need S? Because without S the criterion would allow policies that don’t damage the present but permanently destroy opportunities that could be used by a future better AI.
In the dynamically consistent case, incorrigibility can be represented as an expected sum over time-before-S of Bellman errors w.r.t the value function of πbase→S. This allows us generalizing it to the dynamically inconsistent case, by writing a similar expression except that each Bellman error term uses the transient preferences and beliefs of the user at the given moment.
Is it truly possible to have a reasonable bound on S-dangerousness for all S, and is it possible to do so while maintaining a reasonable realizable misalignment bound? It seems possible, for the following reason. The user’s beliefs can be represented as a mapping from questions to answers(fn1). If you sample questions from any fixed distribution, then by verifying that you can predict the answers, you gain valid information about the belief state without any prior about the belief state (it is a “frequentist” guarantee). Therefore, the AI can constrain itself to taking only those actions which are known to be safe based on this “robust” information. Since there is no guarantee that the AI will find a model that predicts answers, in the unrealizable case this might leave it without an effective strategy, but even without any information the AI can stay safe by following the baseline.
This notion of dangerousness seems strongly related to corrigibility. To demonstrate, imagine an attempt by the user to shut down the AI. Suppose that the AI has 3 strategies with which to respond: (i) comply with the shut down (ii) resist defensively, i.e. prevent shutdown but without irreversible damaging anything (iii) resist offensively, e.g. by doing something irreversible to the user that will cause em to stop trying to shut down the AI. The baseline policy is complying. Then, assuming that the user’s stated beliefs endorse the shutdown, an AI with low dangerousness should at most resist defensively for a short period and then comply. That’s because resisting offensively would generate high dangerousness by permanent loss of value, whereas resisting defensively for a long time would generate high dangerousness by losing reward over that period. At the least, this is much more corrigible than CIRL which guarantees nothing in the unrealizable case, and even in the realizable case no general guarantees were obtained (and arguably cannot be obtained since the AI might not have enough information).
This notion of dangerousness opens the way towards designing AI systems that are provably safe while at the same time employing heuristic algorithms without theoretical understanding. Indeed, as long as the AI has sufficiently low dangerousness, it will almost certainly not cause catastrophic damage. A misalignment bound is only needed to prove the AI will also be highly capable at pursuing the user’s goals. The way such a heuristic AI may work, is by producing formal certificates for each action it takes. Then, we need not trust the mechanism suggesting the actions nor the mechanism producing the certificates, as long as we trust the verification of those certificates (which doesn’t require AI). The untrustworthy part might still be dangerous if it can spawn non-Cartesian daemons But, that is preventable using TRL, assuming that the “core” agent has low dangerousness and is too weak to spawn superhuman daemons without the “envelope”.
(fn1) In truth, this assumption that the user’s answers come from a mapping that changes only slowly is probably unrealistic, because the user need not have coherent beliefs even over short timescales. For example, there might be many pairs of fairly ordinary (non-manipulative) questions s.t. asking them in different order will produce different answers. However, to the extent that the user’s beliefs are incoherent, and therefore admit multiple equally plausible interpretations, learning any interpretation should be good enough. Therefore, although the model needs to be made more general, the learning problem should not become substantially more difficult.
This seems quite close (or even identical) to attainable utility preservation; if I understand correctly, this echoes arguments I’ve made for why AUP has a good shot of avoiding catastrophes and thereby getting you something which feels similar to corrigibility.
There is some similarity, but there are also major differences. They don’t even have the same type signature. The dangerousness bound is a desideratum that any given algorithm can either satisfy or not. On the other hand, AUP is a specific heuristic how to tweak Q-learning. I guess you can consider some kind of regret bound w.r.t. the AUP reward function, but they will still be very different conditions.
The reason I pointed out the relation to corrigibility is not because I think that’s the main justification for the dangerousness bound. The motivation for the dangerousness bound is quite straightforward and self-contained: it is a formalization of the condition that “if you run this AI, this won’t make things worse than not running the AI”, no more and no less. Rather, I pointed the relation out to help readers compare it with other ways of thinking they might be familiar with.
From my perspective, the main question is whether satisfying this desideratum is feasible. I gave some arguments why it might be, but there are also opposite arguments. Specifically, if you believe that debate is a necessary component of Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can become certain that the user would respond in a particular way to a query, but it cannot become (worst-case) certain that the user would not change eir response when faced with some rebuttal. You can’t (empirically and in the worst-case) prove a negative.