Vanessa Kosoy comments on AI Alignment Open Thread August 2019

Vanessa Kosoy 9 Aug 2019 16:19 UTC
LW: 4 AF: 2
AF
This idea was inspired by a discussion with Discord user @jbeshir

Model dynamically inconsistent agents (in particular humans) as having a different reward function at every state of the environment MDP (i.e. at every state we have a reward function that assigns values both to this state and to all other states: we have a reward matrix $r (s, t)$ ). This should be regarded as a game where a different player controls the action at every state. We can now look for value learning protocols that converge to Nash* (or other kind of) equilibrium in this game.

The simplest setting would be, every time you visit a state, you learn the reward of all previous states w.r.t. the reward function of the current state. Alternatively, every time you visit a state, you can ask about the reward of one previously visited state w.r.t. the reward function of the current state. This is the analogue of classical reinforcement learning with an explicit reward channel. We can now try to prove a regret bound, which takes the form of an $ϵ$ -Nash equilibrium condition, with $ϵ$ being the regret. More complicated settings would be analogues of Delegative RL (where the advisor also follows the reward function of the current state) and other value learning protocols.

This seems like a more elegant way to model “corruption” than as a binary or continuous one dimensional variable like I did before.

*Note that although for general games, even if they are purely coorperative, Nash equilibria can be suboptimal due to coordination problems, for this type of games it doesn’t happen: in the purely cooperative case, the Nash equilibrium condition becomes the Bellman equation that implies global optimality.
What links here?