I’m posting a few research directions in my research agenda about which I haven’t written much elsewhere (except maybe in the MIRIx Discord server), and for which I so far haven’t got the time to make a full length essay with mathematical details. Each direction is in a separate child comment.
In last year’s essay about my research agenda I wrote about an approach I call “learning by teaching” (LBT). In LBT, an AI is learning human values by trying to give advice to a human and seeing how the human changes eir behavior (without an explicit reward signal). Roughly speaking, if the human permanently changes eir behavior as a result of the advice, then one can assume the advice was useful. Partial protection against manipulative advice is provided by a delegation mechanism, which ensures the AI only produces advice that is in the space of “possible pieces of advice a human could give” in some sense. However, this protection seems insufficient since it allows for giving all arguments in favor of a position without giving any arguments against a position.
To add more defense against manipulation, I propose to build on the “AI debate” idea. However, in this scheme, we don’t need more than one AI. In fact, this is a general fact: for any protocol P involving multiple AIs, there is a protocol Q involving just one AI that works (at least roughly, qualitatively) just as well. Proof sketch: If we can prove that under assumptions X, the protocol P is safe/effective, then we can design a single AI Q which has assumptions Xbaked into its prior. Such an AI would be able to understand that simulating protocol P would lead to a safe/effective outcome, and would only choose a different strategy if it leads to an even better outcome under the same assumptions.
The way we use “AI debate” is not by implementing an actual AI debate. Instead, we use it to formalize our assumptions about human behavior. In ordinary IRL, the usual assumption is “a human is a (nearly) optimal agent for eir utility function”. In the original version of LBT, the assumption was of the form “a human is (nearly) optimal when receiving optimal advice”. In debate-LBT the assumption becomes “a human is (nearly) optimal* when exposed to a debate between two agents at least one of which is giving optimal advice”. Here, the human observes this hypothetical debate through the same “cheap talk” channel through which it receives advice from the single AI.
Notice that debate can be considered to be a form of interactive proof system (with two or more provers). However, the requirements are different from classical proof systems. In classical theory, the requirement is “When the prover is honestly arguing for a correct proposition, the verifier is convinced. For any prover the verifier cannot be convinced of a false proposition.” In “debate proof systems” the requirement is “If at least one prover is honest, the verifier comes to the correct conclusion”. That is, we don’t guarantee anything when both provers are dishonest. It is easy to see that these debate proof systems admit any problem in PSPACE: given a game, both provers can state their assertions as to which side wins the game, and if they disagree they have to play the game for the corresponding sides.
*Fiddling with the assumptions a little, instead of “optimal” we can probably just say that the AI is guaranteed to achieve this level of performance, what it is.
A variant of Christiano’s IDA amenable to learning-theoretic analysis. We consider reinforcement learning with a set of observations and a set of actions, where the semantics of the actions is making predictions about future observations. (But, as opposed to vanilla sequence forecasting, making predictions affects the environment.) The reward function is unknown and unobservable, but it is known to satisfy two assumptions:
(i) If we make the null prediction always, the expected utility will be lower bounded by some constant.
(ii) If our predictions sample the n-step future for a given policy π, then our expected utility will be lower bounded by some function F(u,n) of the the expected utility u of π and n. F is s.t. for sufficiently low u, F(u,n)≤u but for sufficiently high u, F(u,n)>u (in particular the constant in (i) should be high enough to lead to an increasing sequence). Also, it can be argued that it’s natural to assume F(F(u,n),m)≈F(u,nm) for u>>0.
The goal is proving regret bounds for this setting. Note that this setting automatically deals with malign hypotheses in the prior, bad self-fulfilling prophecies and “corrupting” predictions that cause damage just by seeing them.
However, I expect that without additional assumptions the regret bound will be fairly weak, since the penalty for making wrong predictions grows with the total variation distance between the prediction distribution and the true distribution, which is quite harsh. I think this reflects a true weakness of IDA (or some versions of it, at least): without an explicit model of the utility function, we need very high fidelity to guarantee robustness against e.g. malign hypotheses. On the other hand, it might be possible to ameliorate this problem if we introduce an additional assumption of the form: the utility function is e.g. Lipschitz w.r.t some metric d. Then, total variation distance is replaced by Kantorovich-Rubinstein distance defined w.r.t. d. The question is, where do we get the metric from. Maybe we can consider something like the process of humans rewriting texts into equivalent texts.
This idea was inspired by a discussion with Discord user @jbeshir
Model dynamically inconsistent agents (in particular humans) as having a different reward function at every state of the environment MDP (i.e. at every state we have a reward function that assigns values both to this state and to all other states: we have a reward matrixr(s,t)). This should be regarded as a game where a different player controls the action at every state. We can now look for value learning protocols that converge to Nash* (or other kind of) equilibrium in this game.
The simplest setting would be, every time you visit a state, you learn the reward of all previous states w.r.t. the reward function of the current state. Alternatively, every time you visit a state, you can ask about the reward of one previously visited state w.r.t. the reward function of the current state. This is the analogue of classical reinforcement learning with an explicit reward channel. We can now try to prove a regret bound, which takes the form of an ϵ-Nash equilibrium condition, with ϵ being the regret. More complicated settings would be analogues of Delegative RL (where the advisor also follows the reward function of the current state) and other value learning protocols.
This seems like a more elegant way to model “corruption” than as a binary or continuous one dimensional variable like I did before.
*Note that although for general games, even if they are purely coorperative, Nash equilibria can be suboptimal due to coordination problems, for this type of games it doesn’t happen: in the purely cooperative case, the Nash equilibrium condition becomes the Bellman equation that implies global optimality.
It is an interesting problem to write explicit regret bounds for reinforcement learning with a prior that is the Solomonoff prior or something similar. Of course, any regret bound requires dealing with traps. The simplest approach is, leaving only environments without traps in the prior (there are technical details involved that I won’t go into right now). However, after that we are still left with a different major problem. The regret bound we get is very weak. This happens because the prior contains sets of hypotheses of the form “program template P augmented by a hard-coded bit string b of length n”. Such a set contains 2n hypotheses, and its total probability mass is approximately 2−|P|, which is significant for short P (even when n is large). However, the definition of regret requires out agent to compete against a hypothetical agent that knows the true environment, which in this case means knowing both P and b. Such a contest is very hard since learning n bits can take much time for large n.
Note that the definition of regret depends on how we decompose the prior into a convex combination of individual hypotheses. To solve this problem, I propose redefining regret in this setting by grouping the hypotheses in a particular way. Specifically, in algorithmic statistics there is the concept of sophistication. The sophistication of a bit string x is defined as follows. First, we consider the Kolmogorov complexity K(x) of x. Then we consider pairs (Q,y) where Q is a program, y is a bit string, Q(y)=x and |Q|+|y|≤K(x)+O(1). Finally, we minimize over |Q|. The minimal |Q| is called the sophistication of x. For our problem, we are interested in the minimal Q itself: I call it the “sophisticated core” of x. We now group the hypotheses in our prior by sophisticated cores, and define (Bayesian) regret w.r.t. this grouping.
Coming back to our problematic set of hypotheses, most of it is now grouped into a single hypothesis, corresponding to the sophisticated core of P. Therefore, the reference agent in the definition of regret now knows P but doesn’t know b, making it feasible to compete with it.
I’m posting a few research directions in my research agenda about which I haven’t written much elsewhere (except maybe in the MIRIx Discord server), and for which I so far haven’t got the time to make a full length essay with mathematical details. Each direction is in a separate child comment.
In last year’s essay about my research agenda I wrote about an approach I call “learning by teaching” (LBT). In LBT, an AI is learning human values by trying to give advice to a human and seeing how the human changes eir behavior (without an explicit reward signal). Roughly speaking, if the human permanently changes eir behavior as a result of the advice, then one can assume the advice was useful. Partial protection against manipulative advice is provided by a delegation mechanism, which ensures the AI only produces advice that is in the space of “possible pieces of advice a human could give” in some sense. However, this protection seems insufficient since it allows for giving all arguments in favor of a position without giving any arguments against a position.
To add more defense against manipulation, I propose to build on the “AI debate” idea. However, in this scheme, we don’t need more than one AI. In fact, this is a general fact: for any protocol P involving multiple AIs, there is a protocol Q involving just one AI that works (at least roughly, qualitatively) just as well. Proof sketch: If we can prove that under assumptions X, the protocol P is safe/effective, then we can design a single AI Q which has assumptions X baked into its prior. Such an AI would be able to understand that simulating protocol P would lead to a safe/effective outcome, and would only choose a different strategy if it leads to an even better outcome under the same assumptions.
The way we use “AI debate” is not by implementing an actual AI debate. Instead, we use it to formalize our assumptions about human behavior. In ordinary IRL, the usual assumption is “a human is a (nearly) optimal agent for eir utility function”. In the original version of LBT, the assumption was of the form “a human is (nearly) optimal when receiving optimal advice”. In debate-LBT the assumption becomes “a human is (nearly) optimal* when exposed to a debate between two agents at least one of which is giving optimal advice”. Here, the human observes this hypothetical debate through the same “cheap talk” channel through which it receives advice from the single AI.
Notice that debate can be considered to be a form of interactive proof system (with two or more provers). However, the requirements are different from classical proof systems. In classical theory, the requirement is “When the prover is honestly arguing for a correct proposition, the verifier is convinced. For any prover the verifier cannot be convinced of a false proposition.” In “debate proof systems” the requirement is “If at least one prover is honest, the verifier comes to the correct conclusion”. That is, we don’t guarantee anything when both provers are dishonest. It is easy to see that these debate proof systems admit any problem in PSPACE: given a game, both provers can state their assertions as to which side wins the game, and if they disagree they have to play the game for the corresponding sides.
*Fiddling with the assumptions a little, instead of “optimal” we can probably just say that the AI is guaranteed to achieve this level of performance, what it is.
A variant of Christiano’s IDA amenable to learning-theoretic analysis. We consider reinforcement learning with a set of observations and a set of actions, where the semantics of the actions is making predictions about future observations. (But, as opposed to vanilla sequence forecasting, making predictions affects the environment.) The reward function is unknown and unobservable, but it is known to satisfy two assumptions:
(i) If we make the null prediction always, the expected utility will be lower bounded by some constant.
(ii) If our predictions sample the n-step future for a given policy π, then our expected utility will be lower bounded by some function F(u,n) of the the expected utility u of π and n. F is s.t. for sufficiently low u, F(u,n)≤u but for sufficiently high u, F(u,n)>u (in particular the constant in (i) should be high enough to lead to an increasing sequence). Also, it can be argued that it’s natural to assume F(F(u,n),m)≈F(u,nm) for u>>0.
The goal is proving regret bounds for this setting. Note that this setting automatically deals with malign hypotheses in the prior, bad self-fulfilling prophecies and “corrupting” predictions that cause damage just by seeing them.
However, I expect that without additional assumptions the regret bound will be fairly weak, since the penalty for making wrong predictions grows with the total variation distance between the prediction distribution and the true distribution, which is quite harsh. I think this reflects a true weakness of IDA (or some versions of it, at least): without an explicit model of the utility function, we need very high fidelity to guarantee robustness against e.g. malign hypotheses. On the other hand, it might be possible to ameliorate this problem if we introduce an additional assumption of the form: the utility function is e.g. Lipschitz w.r.t some metric d. Then, total variation distance is replaced by Kantorovich-Rubinstein distance defined w.r.t. d. The question is, where do we get the metric from. Maybe we can consider something like the process of humans rewriting texts into equivalent texts.
This idea was inspired by a discussion with Discord user @jbeshir
Model dynamically inconsistent agents (in particular humans) as having a different reward function at every state of the environment MDP (i.e. at every state we have a reward function that assigns values both to this state and to all other states: we have a reward matrix r(s,t)). This should be regarded as a game where a different player controls the action at every state. We can now look for value learning protocols that converge to Nash* (or other kind of) equilibrium in this game.
The simplest setting would be, every time you visit a state, you learn the reward of all previous states w.r.t. the reward function of the current state. Alternatively, every time you visit a state, you can ask about the reward of one previously visited state w.r.t. the reward function of the current state. This is the analogue of classical reinforcement learning with an explicit reward channel. We can now try to prove a regret bound, which takes the form of an ϵ-Nash equilibrium condition, with ϵ being the regret. More complicated settings would be analogues of Delegative RL (where the advisor also follows the reward function of the current state) and other value learning protocols.
This seems like a more elegant way to model “corruption” than as a binary or continuous one dimensional variable like I did before.
*Note that although for general games, even if they are purely coorperative, Nash equilibria can be suboptimal due to coordination problems, for this type of games it doesn’t happen: in the purely cooperative case, the Nash equilibrium condition becomes the Bellman equation that implies global optimality.
It is an interesting problem to write explicit regret bounds for reinforcement learning with a prior that is the Solomonoff prior or something similar. Of course, any regret bound requires dealing with traps. The simplest approach is, leaving only environments without traps in the prior (there are technical details involved that I won’t go into right now). However, after that we are still left with a different major problem. The regret bound we get is very weak. This happens because the prior contains sets of hypotheses of the form “program template P augmented by a hard-coded bit string b of length n”. Such a set contains 2n hypotheses, and its total probability mass is approximately 2−|P|, which is significant for short P (even when n is large). However, the definition of regret requires out agent to compete against a hypothetical agent that knows the true environment, which in this case means knowing both P and b. Such a contest is very hard since learning n bits can take much time for large n.
Note that the definition of regret depends on how we decompose the prior into a convex combination of individual hypotheses. To solve this problem, I propose redefining regret in this setting by grouping the hypotheses in a particular way. Specifically, in algorithmic statistics there is the concept of sophistication. The sophistication of a bit string x is defined as follows. First, we consider the Kolmogorov complexity K(x) of x. Then we consider pairs (Q,y) where Q is a program, y is a bit string, Q(y)=x and |Q|+|y|≤K(x)+O(1). Finally, we minimize over |Q|. The minimal |Q| is called the sophistication of x. For our problem, we are interested in the minimal Q itself: I call it the “sophisticated core” of x. We now group the hypotheses in our prior by sophisticated cores, and define (Bayesian) regret w.r.t. this grouping.
Coming back to our problematic set of hypotheses, most of it is now grouped into a single hypothesis, corresponding to the sophisticated core of P. Therefore, the reference agent in the definition of regret now knows P but doesn’t know b, making it feasible to compete with it.