Game-aligned agents aren’t useful for AI control as complete agents, since if you give them enough data from the real world, they start caring about it. This aspect applies more straightforwardly to very specialized sub-agents.
It’s misleading to say that such agents assign probability zero to the real world, since the computations they optimize don’t naturally talk about things like worlds at all. For example, consider a chess-playing AI that should learn to win against a fixed opponent program. It only reasons about chess, there is no reason to introduce concepts like physical worlds in its reasoning, because they probably aren’t any help for playing chess. (Unless the opponent is a sufficiently large program written by humans, data about the real world.)
There do seem to be some caveats, basically an agent shouldn’t be given either motivation or opportunity to pursue consequentialist reasoning about the real world. So if it finds an almost perfect strategy and has nothing more to do, it might start looking into more esoteric ways of improving the outcome that pass through worlds we might care about. This possibility is already implausible in practice, but adding a term in agent’s value for solving some hard harmless problem, like inverting hashes, should reliably take its attention away from physical worlds.
(My interest is in pushing this towards complete agents, in particular letting them care about things like the real world. Since bounded agents can’t perfectly channel idealized preference, there is a question of how they can do good things while remaining misaligned to some extent, so studying simple concepts of misalignment might help.)
Game-aligned agents aren’t useful for AI control as complete agents, since if you give them enough data from the real world, they start caring about it. This aspect applies more straightforwardly to very specialized sub-agents.
It’s misleading to say that such agents assign probability zero to the real world, since the computations they optimize don’t naturally talk about things like worlds at all. For example, consider a chess-playing AI that should learn to win against a fixed opponent program. It only reasons about chess, there is no reason to introduce concepts like physical worlds in its reasoning, because they probably aren’t any help for playing chess. (Unless the opponent is a sufficiently large program written by humans, data about the real world.)
There do seem to be some caveats, basically an agent shouldn’t be given either motivation or opportunity to pursue consequentialist reasoning about the real world. So if it finds an almost perfect strategy and has nothing more to do, it might start looking into more esoteric ways of improving the outcome that pass through worlds we might care about. This possibility is already implausible in practice, but adding a term in agent’s value for solving some hard harmless problem, like inverting hashes, should reliably take its attention away from physical worlds.
(My interest is in pushing this towards complete agents, in particular letting them care about things like the real world. Since bounded agents can’t perfectly channel idealized preference, there is a question of how they can do good things while remaining misaligned to some extent, so studying simple concepts of misalignment might help.)