Then it won’t always be instrumentally convergent, depending on the environment in question. For Tic-Tac-Toe, there’s an exact proportionality in the limit of farsightedness (see theorem 46). In general, there’s a delicate interaction between control provided and probability which I don’t fully understand right now. However, we can easily bound how different these quantities can be; the constant depends on the distribution D we choose (it’s at most 2 for the uniform distribution). The formal explanation can be found in the proof of theorem 48, but I’ll try to give a quick overview.
The power calculation is the average attainable utility. This calculation breaks down into the weighted sum of the average attainable utility when Candy is best, the average attainable utility when Chocolate is best, and the average attainable utility when Hug is best; each term is weighted by the probability that its possibility is optimal.[1] Each term is the power contribution of a different possibility.[2]
Let’s think about Candy’s contribution to the first (simple) example. First, how likely is Candy to be optimal? Well, each state has an equal chance of being optimal, so 13 of goals choose Candy. Next, given that Candy is optimal, how much reward do we expect to get? Learning that a possibility is optimal tells us something about its expected value. In this case, the expected reward is still 34; the higher this number is, the “happier” an agent is to have this as its optimal possibility.
In general,
power contribution of possibility=% of goals choosing this possibility⋅average control.
If the agent can “die” in an environment, more of its “ability to do things in general” is coming from not dying at first. Like, let’s follow where the power is coming from, and that lets us deduce things about the instrumental convergence. Consider the power at a state. Maybe 99% of the power comes from the possibilities for one move (like the move that avoids dying), and 1% comes from the rest. Part of this is because there are “more” goals which say to avoid dying at first, but part also might be that, conditional on not dying being optimal, agents tend to have more control.
power contribution=% of goals⋅average control.
By analogy, imagine you’re collecting taxes. You have this weird system where each person has to pay at least 50¢, and pays no more than $1. The western half of your city pays $99, while the eastern half pays $1. Obviously, there have to be more people living in this wild western portion – but you aren’t sure exactly how many more. Even so, you know that there are at least 99 people west, and at most 2 people east; so, there are at least 44.5 times as many people in the western half.
In the exact same way, the minimum possible average control is not doing better than chance (12 is the expected value of an arbitrary possibility), and the maximum possible is all agents being in heaven (1 reward is maximal). So if 99% of the power comes from one move, then this move is at least 44.5 times as likely as any other moves.
Thanks for this reply. In general when I’m reading an explanation and come across a statement like, “this means that...”, as in the above, if it’s not immediately obvious to me why, I find myself wondering whether I’m supposed to see why and I’m just missing something, or if there’s a complicated explanation that’s being skipped.
In this case it sounds like there was a complicated explanation that was being skipped, and you did not expect readers to see why the statement was true. As a point of feedback: when that’s the case I appreciate when writers make note of that fact in the text (e.g. with a parenthetical saying, “To see why this is true, refer to theorem… in the paper.”).
Otherwise, I feel like I’ve just stopped understanding what’s being written, and it’s hard for me to stay engaged. If I know that something is not supposed to be obvious, then it’s easier for me to just mentally flag it as something I can return to later if I want, and keep going.
Then it won’t always be instrumentally convergent, depending on the environment in question. For Tic-Tac-Toe, there’s an exact proportionality in the limit of farsightedness (see theorem 46). In general, there’s a delicate interaction between control provided and probability which I don’t fully understand right now. However, we can easily bound how different these quantities can be; the constant depends on the distribution D we choose (it’s at most 2 for the uniform distribution). The formal explanation can be found in the proof of theorem 48, but I’ll try to give a quick overview.
The power calculation is the average attainable utility. This calculation breaks down into the weighted sum of the average attainable utility when Candy is best, the average attainable utility when Chocolate is best, and the average attainable utility when Hug is best; each term is weighted by the probability that its possibility is optimal.[1] Each term is the power contribution of a different possibility.[2]
Let’s think about Candy’s contribution to the first (simple) example. First, how likely is Candy to be optimal? Well, each state has an equal chance of being optimal, so 13 of goals choose Candy. Next, given that Candy is optimal, how much reward do we expect to get? Learning that a possibility is optimal tells us something about its expected value. In this case, the expected reward is still 34; the higher this number is, the “happier” an agent is to have this as its optimal possibility.
In general, power contribution of possibility=% of goals choosing this possibility⋅average control.
If the agent can “die” in an environment, more of its “ability to do things in general” is coming from not dying at first. Like, let’s follow where the power is coming from, and that lets us deduce things about the instrumental convergence. Consider the power at a state. Maybe 99% of the power comes from the possibilities for one move (like the move that avoids dying), and 1% comes from the rest. Part of this is because there are “more” goals which say to avoid dying at first, but part also might be that, conditional on not dying being optimal, agents tend to have more control.
power contribution=% of goals⋅average control.
By analogy, imagine you’re collecting taxes. You have this weird system where each person has to pay at least 50¢, and pays no more than $1. The western half of your city pays $99, while the eastern half pays $1. Obviously, there have to be more people living in this wild western portion – but you aren’t sure exactly how many more. Even so, you know that there are at least 99 people west, and at most 2 people east; so, there are at least 44.5 times as many people in the western half.
In the exact same way, the minimum possible average control is not doing better than chance (12 is the expected value of an arbitrary possibility), and the maximum possible is all agents being in heaven (1 reward is maximal). So if 99% of the power comes from one move, then this move is at least 44.5 times as likely as any other moves.
opt(f,γ), in the terminology of the paper.
Power(f,γ); see definition 9.
Thanks for this reply. In general when I’m reading an explanation and come across a statement like, “this means that...”, as in the above, if it’s not immediately obvious to me why, I find myself wondering whether I’m supposed to see why and I’m just missing something, or if there’s a complicated explanation that’s being skipped.
In this case it sounds like there was a complicated explanation that was being skipped, and you did not expect readers to see why the statement was true. As a point of feedback: when that’s the case I appreciate when writers make note of that fact in the text (e.g. with a parenthetical saying, “To see why this is true, refer to theorem… in the paper.”).
Otherwise, I feel like I’ve just stopped understanding what’s being written, and it’s hard for me to stay engaged. If I know that something is not supposed to be obvious, then it’s easier for me to just mentally flag it as something I can return to later if I want, and keep going.