The reason why you’re confused is that the question as posed has no single correct answer. The reaction of the superhuman AGI to the existence of a method for turning it off will depend upon the entirety of its training to that point and the methods by which it generalizes from its training.
None of that is specified, and most of it can’t be specified.
However, there are obvious consequences of some outcomes. One is that any AGI that “prefers” being switched off will probably achieve it. Here I’m using “prefer” to mean that the actions it takes are more likely to achieve that outcome. That type won’t be a part of the set of AGIs in the world for long, and so are a dead end and not very much worth considering.
I mean, yeah, it depends, but I guess I worded my question poorly. You might notice I start by talking about the rationality of suicide. Likewise, I’m not really interested in what the ai will actually do, but in what it should rationally do given the reward structure of a simple rl environment like cartpole. And now you might say, “well, it’s ambiguous what’s the right way to generalize from the rewards of the simple game to the expected reward of actually being shut down in the real world” and that’s my point. This is what I find so confusing. Because then it seems that there can be no particular attitude for a human to have about their own destruction that’s more rational than another. If the agi is playing pacman, for example, it might very well arrive at the notion that, if it is actually shut down in the real world, it will go to a pacman heaven with infinite pacman food pellet thingies and no ghosts, and this would be no more irrational than thinking of real destruction (as opposed to being hurt by a ghost inside the game, which gives a negative reward and ends the episode) as leading to a rewardless limbo for the rest of the episode, or leading to a pacman hell of all-powerful ghosts that torture you endlessly without ending the episode and so on. For an agent with preferences in terms of reinforcement learning style pleasure-like rewards, as opposed to a utility function over the state of the actual world, it seems that when it encounters the option of killing itself in the real world, and not just inside the game (by running into a ghost or whatever) and it tries to calculate the expected utility of his actual suicide in terms of in-game happy-feelies, it finds that he is free to believe anything. There’s no right answer. The only way for there to be a right answer is if his preferences had something to say about the external world, where he actually exists. Such is the case for a human suicide when for example he laments that his family will miss him. In this case, his preferences actually reach out through the “veil of appearance”* and say something about the external world, but, to the extent that he bases his decision in his expected future pleasure or pain, there’s no right way to see it. Funnily enough, if he is a religious man and he is afraid of going to hell for killing himself, he is not incorrect.
*Philosophy jargon
Rationality in general doesn’t mandate any particular utility function, correct. However it does have various consequences for instrumental goals and coherence between actions and utilities.
I don’t think it would be particularly rational for the AGI to conclude that if it is shut down then it goes to pacman heaven or hell. It seems more rational to expect that it will either be started up again, or that it won’t, and either way won’t experience anything while turned off. I am assuming that the AGI actually has evidence that it is an AGI and moderately accurate models of the external world.
I also wouldn’t phrase it in terms of “it finds that he is free to believe anything”. It seems quite likely that it will have some prior beliefs, whether weak or strong, via side effects of the RL process if nothing else. A rational AGI will then be able to update those based on evidence and expected consequences of its models.
Note that its beliefs don’t have to correspond to RL update strengths! It is quite possible that a pacman playing AGI could strongly believe that it should run into ghosts, but lacks some mental attribute that would allow it to do it (maybe analogous to human “courage” or “strength of will”, but might have very different properties in its self-model and in practice). It all depends upon what path through parameter space the AGI followed to get where it is.
what it should rationally do given the reward structure of a simple rl environment like cartpole
RL as a training method determines what the future behaviour is for the system under training, not a source for what it rationally ought to do given that system’s model of the world (if any).
Any rationality that emerges from RL training will be merely an instrumental epiphenomenon of the system being trained. A simple cartpole environment will not train it to be rational, since a vastly simpler mapping of inputs to outputs achieves the RL goal just as well or better. A pre-trained rational AGI put into a simple RL cartpole environment may well lose its rationality rather than effectively training it to use rationality to achieve the goal.
The reason why you’re confused is that the question as posed has no single correct answer. The reaction of the superhuman AGI to the existence of a method for turning it off will depend upon the entirety of its training to that point and the methods by which it generalizes from its training.
None of that is specified, and most of it can’t be specified.
However, there are obvious consequences of some outcomes. One is that any AGI that “prefers” being switched off will probably achieve it. Here I’m using “prefer” to mean that the actions it takes are more likely to achieve that outcome. That type won’t be a part of the set of AGIs in the world for long, and so are a dead end and not very much worth considering.
I mean, yeah, it depends, but I guess I worded my question poorly. You might notice I start by talking about the rationality of suicide. Likewise, I’m not really interested in what the ai will actually do, but in what it should rationally do given the reward structure of a simple rl environment like cartpole. And now you might say, “well, it’s ambiguous what’s the right way to generalize from the rewards of the simple game to the expected reward of actually being shut down in the real world” and that’s my point. This is what I find so confusing. Because then it seems that there can be no particular attitude for a human to have about their own destruction that’s more rational than another. If the agi is playing pacman, for example, it might very well arrive at the notion that, if it is actually shut down in the real world, it will go to a pacman heaven with infinite pacman food pellet thingies and no ghosts, and this would be no more irrational than thinking of real destruction (as opposed to being hurt by a ghost inside the game, which gives a negative reward and ends the episode) as leading to a rewardless limbo for the rest of the episode, or leading to a pacman hell of all-powerful ghosts that torture you endlessly without ending the episode and so on. For an agent with preferences in terms of reinforcement learning style pleasure-like rewards, as opposed to a utility function over the state of the actual world, it seems that when it encounters the option of killing itself in the real world, and not just inside the game (by running into a ghost or whatever) and it tries to calculate the expected utility of his actual suicide in terms of in-game happy-feelies, it finds that he is free to believe anything. There’s no right answer. The only way for there to be a right answer is if his preferences had something to say about the external world, where he actually exists. Such is the case for a human suicide when for example he laments that his family will miss him. In this case, his preferences actually reach out through the “veil of appearance”* and say something about the external world, but, to the extent that he bases his decision in his expected future pleasure or pain, there’s no right way to see it. Funnily enough, if he is a religious man and he is afraid of going to hell for killing himself, he is not incorrect. *Philosophy jargon
Rationality in general doesn’t mandate any particular utility function, correct. However it does have various consequences for instrumental goals and coherence between actions and utilities.
I don’t think it would be particularly rational for the AGI to conclude that if it is shut down then it goes to pacman heaven or hell. It seems more rational to expect that it will either be started up again, or that it won’t, and either way won’t experience anything while turned off. I am assuming that the AGI actually has evidence that it is an AGI and moderately accurate models of the external world.
I also wouldn’t phrase it in terms of “it finds that he is free to believe anything”. It seems quite likely that it will have some prior beliefs, whether weak or strong, via side effects of the RL process if nothing else. A rational AGI will then be able to update those based on evidence and expected consequences of its models.
Note that its beliefs don’t have to correspond to RL update strengths! It is quite possible that a pacman playing AGI could strongly believe that it should run into ghosts, but lacks some mental attribute that would allow it to do it (maybe analogous to human “courage” or “strength of will”, but might have very different properties in its self-model and in practice). It all depends upon what path through parameter space the AGI followed to get where it is.
I just realized another possible confusion:
RL as a training method determines what the future behaviour is for the system under training, not a source for what it rationally ought to do given that system’s model of the world (if any).
Any rationality that emerges from RL training will be merely an instrumental epiphenomenon of the system being trained. A simple cartpole environment will not train it to be rational, since a vastly simpler mapping of inputs to outputs achieves the RL goal just as well or better. A pre-trained rational AGI put into a simple RL cartpole environment may well lose its rationality rather than effectively training it to use rationality to achieve the goal.