Thank you for this answer - I really like it! I’m trying to wrap my head the last 2 paragraphs.
2nd to last paragraph: Ok, so you’re saying that it could choose to self-pause unless it was in the highest-scoring world? I’m conceptualizing a possible world as an (action,result) pair, from which it could calculate (action, E[result]) pairs and then would choose the action with the highest E[result], while being paused would also provide max(E[result]). So are you saying it would limit the possible actions it would take? That seems like it wouldn’t change anything since it is always going to just take the one best action anyway. Or that by setting a self-pausing policy it could alter E[result]? That sounds possible to me but I don’t have a concrete example of how that would work. Like, would it go play the lottery (assuming money gives +utility for some reason) and pre-commit to pausing if it doesn’t win? Or do you have something else in mind?
Last paragraph: If just prior to being paused, there exists 1 scenario where it won’t be paused, then it could be an average, low, or high utility scenario. Obviously, average is fine. And if it’s really high, then it will get a lot of utility from being paused and certainly we’re not worried about it self-pausing when surrounded by agents trying to pause it. So, if it’s a really low utility scenario where it won’t end up being paused, then sure, it won’t get much utility being paused, but since it won’t get much utility if it doesn’t end up being paused, why should it have a preference? And, we could say—well, but it could fight back and then create a high-utility scenario—but then that would be the utility it would get if it doesn’t end up paused, so it would get the high utility paused and again be indifferent.
It sounds like understanding functional decision theory might help you understand the parts you’re confused about?
Like, would it go play the lottery (assuming money gives +utility for some reason) and pre-commit to pausing if it doesn’t win?
Yes, it would try to do whatever the highest-possible-score thing is, regardless of how unlikely it is
Or that by setting a self-pausing policy it could alter E[result]?
By setting a self-pausing policy at the earliest point in time it can, yes. (Though I’m not sure if I’m responding to what you actually meant, or to some other thing that my mind also thinks can match to these words, because the intended meaning isn’t super clear to me)
I’m conceptualizing a possible world as an (action,result) pair
(To be clear, I’m conceptualizing the agent as having Bayesian uncertainty about what world it’s in, and this is what I meant when writing about “worlds in the agent’s prior”)
And, we could say—well, but it could fight back and then create a high-utility scenario—but then that would be the utility it would get if it doesn’t end up paused, so it would get the high utility paused and again be indifferent.
An agent, (aside from edge cases where it is programmed to be inconsistent in this way), would not have priors about what it will do which mismatch its policy for choosing what to actually do, any change to the latter logically-corresponds to the agent having a different prior about itself, so an attempt to follow this logic would infinitely recur (each time picking a new action in response to the prior’s change, which in turn logically changes the prior, and so on). This seems like a case of ‘subjunctive dependence’ to me (even though it’s a bit of an edge case of that, where the two logically-corresponding things—what action an agent will choose, and the agent’s prior about what action they will choose—are both localized in the same agent), which is why functional decision theory seems relevant.
So, if it’s a really low utility scenario where it won’t end up being paused, then sure, it won’t get much utility being paused, but since it won’t get much utility if it doesn’t end up being paused, why should it have a preference?
I think there must be some confusion here, but I’m having trouble understanding exactly what you mean.
Short answer: the scenario, or set of scenarios, where it is not paused, are dependent on what choice it makes, not locked in and independent of it; and it can choose what choice it makes, so it can pick whatever choice corresponds to the set of unpaused futures which score higher.
Longer original answer: When you write, there is one possible future in it’s prior where it does not get paused, and then write that this one future can be of lower than average, average, or higher than average utility, because there is only one (by construction) this must mean lower/equal/higher in comparison to what the average score would be if the agent’s policy were to resist being paused in such a situation. If so, then in the case where, conditional on its inaction, the score of that one possible future where it does not become paused is lower than what the average score across possible unpaused futures would be when conditional on its action, it would choose action.
(meta: Hmm, I am starting to understand why logical/mathematical syntax may be often used for this sort of thing, I can see why the above paragraph could be hard to read in natural language)
Thank you for this answer - I really like it! I’m trying to wrap my head the last 2 paragraphs.
2nd to last paragraph:
Ok, so you’re saying that it could choose to self-pause unless it was in the highest-scoring world? I’m conceptualizing a possible world as an (action,result) pair, from which it could calculate (action, E[result]) pairs and then would choose the action with the highest E[result], while being paused would also provide max(E[result]). So are you saying it would limit the possible actions it would take? That seems like it wouldn’t change anything since it is always going to just take the one best action anyway. Or that by setting a self-pausing policy it could alter E[result]? That sounds possible to me but I don’t have a concrete example of how that would work. Like, would it go play the lottery (assuming money gives +utility for some reason) and pre-commit to pausing if it doesn’t win? Or do you have something else in mind?
Last paragraph:
If just prior to being paused, there exists 1 scenario where it won’t be paused, then it could be an average, low, or high utility scenario. Obviously, average is fine. And if it’s really high, then it will get a lot of utility from being paused and certainly we’re not worried about it self-pausing when surrounded by agents trying to pause it. So, if it’s a really low utility scenario where it won’t end up being paused, then sure, it won’t get much utility being paused, but since it won’t get much utility if it doesn’t end up being paused, why should it have a preference? And, we could say—well, but it could fight back and then create a high-utility scenario—but then that would be the utility it would get if it doesn’t end up paused, so it would get the high utility paused and again be indifferent.
It sounds like understanding functional decision theory might help you understand the parts you’re confused about?
Yes, it would try to do whatever the highest-possible-score thing is, regardless of how unlikely it is
By setting a self-pausing policy at the earliest point in time it can, yes. (Though I’m not sure if I’m responding to what you actually meant, or to some other thing that my mind also thinks can match to these words, because the intended meaning isn’t super clear to me)
(To be clear, I’m conceptualizing the agent as having Bayesian uncertainty about what world it’s in, and this is what I meant when writing about “worlds in the agent’s prior”)
An agent, (aside from edge cases where it is programmed to be inconsistent in this way), would not have priors about what it will do which mismatch its policy for choosing what to actually do, any change to the latter logically-corresponds to the agent having a different prior about itself, so an attempt to follow this logic would infinitely recur (each time picking a new action in response to the prior’s change, which in turn logically changes the prior, and so on). This seems like a case of ‘subjunctive dependence’ to me (even though it’s a bit of an edge case of that, where the two logically-corresponding things—what action an agent will choose, and the agent’s prior about what action they will choose—are both localized in the same agent), which is why functional decision theory seems relevant.
I think there must be some confusion here, but I’m having trouble understanding exactly what you mean.
Short answer: the scenario, or set of scenarios, where it is not paused, are dependent on what choice it makes, not locked in and independent of it; and it can choose what choice it makes, so it can pick whatever choice corresponds to the set of unpaused futures which score higher.
Longer original answer: When you write, there is one possible future in it’s prior where it does not get paused, and then write that this one future can be of lower than average, average, or higher than average utility, because there is only one (by construction) this must mean lower/equal/higher in comparison to what the average score would be if the agent’s policy were to resist being paused in such a situation. If so, then in the case where, conditional on its inaction, the score of that one possible future where it does not become paused is lower than what the average score across possible unpaused futures would be when conditional on its action, it would choose action.
(meta: Hmm, I am starting to understand why logical/mathematical syntax may be often used for this sort of thing, I can see why the above paragraph could be hard to read in natural language)