It seems like AIXI with a time horizon of 1 is a very different beast from AIXI with a longer time horizon. The big difference is that short-sighted AIXI will only try to take over (in the interest of giving itself reward) if it can succeed in a single time step.
What “a single time step” means here depends on what model Arthur learns, which may not be what we intend. For example, suppose a is an action which immediately disables the approval input terminal or the data connection between Arthur and the terminal via a network attack, then taking an arbitrarily long time to secure access to the approval input terminal and giving itself maximum approval. What is approval[T][a] according to Arthur’s model?
Overall, don’t you think it’s too strong to say “But unlike AIXI, Arthur will make no effort to manipulate these judgments.” even if Arthur, like short-sighted AIXI, is safer than standard AIXI? As another example, suppose Arthur discovers some sort of flaw in human psychology which lets it manipulate whoever is going to enter the next approval value into giving it maximum approval. Wouldn’t Arthur take advantage of that?
I think my description is basically fair, though I might be misunderstanding or just wrong.
There are two ways in which Arthur’s decisions affect approval[T](a); one is by choosing the action a, and the other is by changing the definition of approval[T](a). One-step AIXI cares about both, while Arthur does not. This is what I meant by “Arthur will make no effort to manipulate these judgments.” Note that my proposal requires Hugh to provide “counterfactual” ratings for actions that were not chosen.
Arthur may be motivated to manipulate Hugh’s judgments for other reasons. This is most obvious if Hugh would approve of manipulating Hugh’s judgments, but it could also happen if someone else displaced Hugh and “approved” of actions to manipulate Hugh’s judgments (especially in order to approve of actions that helped them displace Hugh).
In a note on the original document I gave an example to illustrate: suppose that action X causes Hugh to increase all his ratings by 0.2, but that in every case action Y is rated higher than action X by 0.1. Then Arthur will do Y, not X. (Because the expectation of Y is 0.1 higher than the expectation of X, no matter what Arthur believes.)
Even in the one-step takeover case, Arthur doesn’t have an intrinsic incentive to take over. It’s just that he could introduce such an incentive for himself, and so the belief “I will kill Hugh and reward myself if I (kill Hugh and reward myself if I…)” can become a self-fulfilling prophecy. This seems like a different issue than 1 step AIXI’s desire to maximize reward.i The main vulnerability is now from external actors.
Incidentally, it also seems relatively easy to prevent these attacks in the case of approval-directed agents. For example, we can make actions too small to encapsulate an entire seizing-control-plan. If you want to use something like this to evaluate plans then you reintroduce the issue, but now in an easier form (since you can watch the plan unfold and take other precautions). I’ll write a bit more soon about how I think a reasonable overseer could behave, and in particular on how they can use access to Arthur to improve their evaluations, but for now these are not my largest concerns with the scheme.
(My largest concern is the plausibility of high-level approval-directed behavior emerging from low-level approval-directed behavior. This dynamic may be more brittle than high-level goal-directed behavior emerging from low-level goal-directed behavior.)
There are two ways in which Arthur’s decisions affect approvalT; one is by choosing the action a, and the other is by changing the definition of approvalT. One-step AIXI cares about both, while Arthur does not.
I’m having trouble understanding this. Can you explain more what you mean here, and why you think it’s true?
Maybe you can address the specific example I mentioned earlier, which might help clear this up. Suppose there is an argument X which if Hugh were to read, would mindhack him into adopting an alien value system and also giving Arthur high approval. It seems to me that Arthur would choose to present this argument to Hugh (i.e., the human supervisor who is going to enter the next approval data point), for essentially the same reason that single-step AIXI would. Do you agree?
Arthur is making choices from a small set of options; say it’s just two options. (See here for how to move between small and large sets of options, and here for how to do this kind of thing safely.) Suppose the available actions are NULL and HACK, with the obvious effects. So there are four relevant numbers:
Hugh’s approval of NULL
Hugh’s approval of HACK
Hacked Hugh’s approval of NULL
Hacked Hugh’s approval of HACK
When I talked about “two ways...” I meant that counterfactually choosing HACK moves you from 1⁄3 to 2⁄4, by changing what you do; it also moves you from 1⁄2 to 3⁄4, by changing whether Hugh is hacked.
AIXI compares item 1 to item 4, and hacks if 4 is higher. That is, when AIXI considers the counterfactual it applies both of these changes.
Arthur estimates P(hack), then compares ( (item 1) P(hack) + (item 3) P(no hack) ) to ( (item 2) P(hack) + (item 4) P(no hack)).
So suppose that Hugh gives a high rating to NULL and a low rating to HACK, and hacked Hugh gives a high rating to HACK and a low rating to NULL. Then Arthur hacks iff P(hack) is high enough. It’s hard to know what would actually happen, it seems like it’s up to lower-level parts of the agent to break the ambiguity. This seems qualitatively different from AIXI, who would always HACK if it wasn’t possible to achieve maximal reward through other channels.
That said, I don’t think this is a major part of being protected from this kind of attack. The larger protections (vs. 1-step AIXI) are from (1) having a small enough set of actions that 1-step attacks are unlikely, (2) defining approval by considering how you would rate if the action didn’t happen.
What “a single time step” means here depends on what model Arthur learns, which may not be what we intend. For example, suppose a is an action which immediately disables the approval input terminal or the data connection between Arthur and the terminal via a network attack, then taking an arbitrarily long time to secure access to the approval input terminal and giving itself maximum approval. What is approval[T][a] according to Arthur’s model?
Overall, don’t you think it’s too strong to say “But unlike AIXI, Arthur will make no effort to manipulate these judgments.” even if Arthur, like short-sighted AIXI, is safer than standard AIXI? As another example, suppose Arthur discovers some sort of flaw in human psychology which lets it manipulate whoever is going to enter the next approval value into giving it maximum approval. Wouldn’t Arthur take advantage of that?
I think my description is basically fair, though I might be misunderstanding or just wrong.
There are two ways in which Arthur’s decisions affect approval[T](a); one is by choosing the action a, and the other is by changing the definition of approval[T](a). One-step AIXI cares about both, while Arthur does not. This is what I meant by “Arthur will make no effort to manipulate these judgments.” Note that my proposal requires Hugh to provide “counterfactual” ratings for actions that were not chosen.
Arthur may be motivated to manipulate Hugh’s judgments for other reasons. This is most obvious if Hugh would approve of manipulating Hugh’s judgments, but it could also happen if someone else displaced Hugh and “approved” of actions to manipulate Hugh’s judgments (especially in order to approve of actions that helped them displace Hugh).
In a note on the original document I gave an example to illustrate: suppose that action X causes Hugh to increase all his ratings by 0.2, but that in every case action Y is rated higher than action X by 0.1. Then Arthur will do Y, not X. (Because the expectation of Y is 0.1 higher than the expectation of X, no matter what Arthur believes.)
Even in the one-step takeover case, Arthur doesn’t have an intrinsic incentive to take over. It’s just that he could introduce such an incentive for himself, and so the belief “I will kill Hugh and reward myself if I (kill Hugh and reward myself if I…)” can become a self-fulfilling prophecy. This seems like a different issue than 1 step AIXI’s desire to maximize reward.i The main vulnerability is now from external actors.
Incidentally, it also seems relatively easy to prevent these attacks in the case of approval-directed agents. For example, we can make actions too small to encapsulate an entire seizing-control-plan. If you want to use something like this to evaluate plans then you reintroduce the issue, but now in an easier form (since you can watch the plan unfold and take other precautions). I’ll write a bit more soon about how I think a reasonable overseer could behave, and in particular on how they can use access to Arthur to improve their evaluations, but for now these are not my largest concerns with the scheme.
(My largest concern is the plausibility of high-level approval-directed behavior emerging from low-level approval-directed behavior. This dynamic may be more brittle than high-level goal-directed behavior emerging from low-level goal-directed behavior.)
I’m having trouble understanding this. Can you explain more what you mean here, and why you think it’s true?
Maybe you can address the specific example I mentioned earlier, which might help clear this up. Suppose there is an argument X which if Hugh were to read, would mindhack him into adopting an alien value system and also giving Arthur high approval. It seems to me that Arthur would choose to present this argument to Hugh (i.e., the human supervisor who is going to enter the next approval data point), for essentially the same reason that single-step AIXI would. Do you agree?
Arthur is making choices from a small set of options; say it’s just two options. (See here for how to move between small and large sets of options, and here for how to do this kind of thing safely.) Suppose the available actions are NULL and HACK, with the obvious effects. So there are four relevant numbers:
Hugh’s approval of NULL
Hugh’s approval of HACK
Hacked Hugh’s approval of NULL
Hacked Hugh’s approval of HACK
When I talked about “two ways...” I meant that counterfactually choosing HACK moves you from 1⁄3 to 2⁄4, by changing what you do; it also moves you from 1⁄2 to 3⁄4, by changing whether Hugh is hacked.
AIXI compares item 1 to item 4, and hacks if 4 is higher. That is, when AIXI considers the counterfactual it applies both of these changes.
Arthur estimates P(hack), then compares ( (item 1) P(hack) + (item 3) P(no hack) ) to ( (item 2) P(hack) + (item 4) P(no hack)).
So suppose that Hugh gives a high rating to NULL and a low rating to HACK, and hacked Hugh gives a high rating to HACK and a low rating to NULL. Then Arthur hacks iff P(hack) is high enough. It’s hard to know what would actually happen, it seems like it’s up to lower-level parts of the agent to break the ambiguity. This seems qualitatively different from AIXI, who would always HACK if it wasn’t possible to achieve maximal reward through other channels.
That said, I don’t think this is a major part of being protected from this kind of attack. The larger protections (vs. 1-step AIXI) are from (1) having a small enough set of actions that 1-step attacks are unlikely, (2) defining approval by considering how you would rate if the action didn’t happen.