But it is quite interesting… so you’re proposing that we can make safe AIs by, e.g. giving them a prior which puts 0 probability mass on worlds where dangerous instrumental goals are valuable. The simplest way would be to make the agent believe that there is no past / future (thus giving us a more “rational” contextual bandit algorithm than we would get by just setting a horizon of 0). However, Mathieu Roy suggested to me that acausal trade might still emerge, and I think I agree based on open-source prisoner’s dilemma.
Anyways, I think that’s a promising avenue to investigate.
Having a good model of the world seems like a necessary condition for an AI to pose a significant Xrisk.
Game-aligned agents aren’t useful for AI control as complete agents, since if you give them enough data from the real world, they start caring about it. This aspect applies more straightforwardly to very specialized sub-agents.
It’s misleading to say that such agents assign probability zero to the real world, since the computations they optimize don’t naturally talk about things like worlds at all. For example, consider a chess-playing AI that should learn to win against a fixed opponent program. It only reasons about chess, there is no reason to introduce concepts like physical worlds in its reasoning, because they probably aren’t any help for playing chess. (Unless the opponent is a sufficiently large program written by humans, data about the real world.)
There do seem to be some caveats, basically an agent shouldn’t be given either motivation or opportunity to pursue consequentialist reasoning about the real world. So if it finds an almost perfect strategy and has nothing more to do, it might start looking into more esoteric ways of improving the outcome that pass through worlds we might care about. This possibility is already implausible in practice, but adding a term in agent’s value for solving some hard harmless problem, like inverting hashes, should reliably take its attention away from physical worlds.
(My interest is in pushing this towards complete agents, in particular letting them care about things like the real world. Since bounded agents can’t perfectly channel idealized preference, there is a question of how they can do good things while remaining misaligned to some extent, so studying simple concepts of misalignment might help.)
The issue in the OP is that possibility of other situations influences agent’s decision. The standard way of handling this is to agree to disregard other situations, including by appealing to Omega’s stipulated ability to inspire belief (it’s the whole reason for introducing the trustworthiness clause). This belief, if reality of situations is treated equivalently to their probability in agent’s eyes, expels the other situations from consideration.
The idea Paul mentioned is just another way of making sure that the other situations don’t intrude on the thought experiment, but since the main principle is to get this done somehow, it doesn’t really matter if a universal prior likes anti-muggers more than muggers, since in that case we’d just need to change the thought experiment.
Thought experiments are not natural questions that rate usefulness of decision theories, they are tests that examine particular features of decision theories, so if such investigation goes too far afield (as in looking into a priori weights of anti-muggers), that calls for a change in a thought experiment.
Omega inspires belief only after the agent encounters Omega.
According to UDT, the agent should not update its policy based on this encounter; it should simply follow it.
Thus the agent should act according to whatever the best policy is, according to its original (e.g. universal) prior from before it encountered Omega (or indeed learned anything about the world).
I think either:
the agent does update, in which case, why not update on the result of the coin-flip?
or
the agent doesn’t update, in which case, what matters is simply the optimal policy given the original prior.
This seems only loosely related to my OP.
But it is quite interesting… so you’re proposing that we can make safe AIs by, e.g. giving them a prior which puts 0 probability mass on worlds where dangerous instrumental goals are valuable. The simplest way would be to make the agent believe that there is no past / future (thus giving us a more “rational” contextual bandit algorithm than we would get by just setting a horizon of 0). However, Mathieu Roy suggested to me that acausal trade might still emerge, and I think I agree based on open-source prisoner’s dilemma.
Anyways, I think that’s a promising avenue to investigate.
Having a good model of the world seems like a necessary condition for an AI to pose a significant Xrisk.
Game-aligned agents aren’t useful for AI control as complete agents, since if you give them enough data from the real world, they start caring about it. This aspect applies more straightforwardly to very specialized sub-agents.
It’s misleading to say that such agents assign probability zero to the real world, since the computations they optimize don’t naturally talk about things like worlds at all. For example, consider a chess-playing AI that should learn to win against a fixed opponent program. It only reasons about chess, there is no reason to introduce concepts like physical worlds in its reasoning, because they probably aren’t any help for playing chess. (Unless the opponent is a sufficiently large program written by humans, data about the real world.)
There do seem to be some caveats, basically an agent shouldn’t be given either motivation or opportunity to pursue consequentialist reasoning about the real world. So if it finds an almost perfect strategy and has nothing more to do, it might start looking into more esoteric ways of improving the outcome that pass through worlds we might care about. This possibility is already implausible in practice, but adding a term in agent’s value for solving some hard harmless problem, like inverting hashes, should reliably take its attention away from physical worlds.
(My interest is in pushing this towards complete agents, in particular letting them care about things like the real world. Since bounded agents can’t perfectly channel idealized preference, there is a question of how they can do good things while remaining misaligned to some extent, so studying simple concepts of misalignment might help.)
The issue in the OP is that possibility of other situations influences agent’s decision. The standard way of handling this is to agree to disregard other situations, including by appealing to Omega’s stipulated ability to inspire belief (it’s the whole reason for introducing the trustworthiness clause). This belief, if reality of situations is treated equivalently to their probability in agent’s eyes, expels the other situations from consideration.
The idea Paul mentioned is just another way of making sure that the other situations don’t intrude on the thought experiment, but since the main principle is to get this done somehow, it doesn’t really matter if a universal prior likes anti-muggers more than muggers, since in that case we’d just need to change the thought experiment.
Thought experiments are not natural questions that rate usefulness of decision theories, they are tests that examine particular features of decision theories, so if such investigation goes too far afield (as in looking into a priori weights of anti-muggers), that calls for a change in a thought experiment.
I reason as follows:
Omega inspires belief only after the agent encounters Omega.
According to UDT, the agent should not update its policy based on this encounter; it should simply follow it.
Thus the agent should act according to whatever the best policy is, according to its original (e.g. universal) prior from before it encountered Omega (or indeed learned anything about the world).
I think either:
the agent does update, in which case, why not update on the result of the coin-flip? or
the agent doesn’t update, in which case, what matters is simply the optimal policy given the original prior.