This is awesome, thanks! I found the minesweeper analogy in particular super helpful.
4. Combining (4) and (5), PU(PS)≥2−(KU(ϕ)+O(1))PU(NPS). QED.
Typo? Maybe you mean (2 and (3)?
More substantively, on the simplicity prior stuff:
There’s the power seeking functions and the non-power-seeking functions, but then there’s also the inner-aligned functions, i.e. the ones that are in some sense trying their best to achieve the base objective, and then there’s also the human-aligned functions. Perhaps we can pretty straightforwardly argue that the non-power-seeking functions have much higher prior than those other two categories?
Sketch:
The simplest function period is probably at least 20 bits simpler than the simplest inner-aligned function and the simplest human-aligned function. Therefore (given previous theorems) the simplest power-seeking function is at least 5 bits simpler than the simplest inner-aligned function and the simplest human-aligned function, and therefore probably the power-seeking functions are significantly more likely on priors than the inner-aligned and human-aligned functions.
Seems pretty plausible to me, no? Of course the constant isn’t really 15 bits, but… isn’t it plausible that the constant is smaller than the distance between the simplest possible function and the simplest inner-aligned one? At least for the messy, complex training environments that we realistically expect for these sorts of things? And obviously for human-aligned functions the case is even more straightforward...
I like the thought. I don’t know if this sketch works out, partly because I don’t fully understand it. your conclusion seems plausible but I want to develop the arguments further.
As a note: the simplest function period probably is the constant function, and other very simple functions probably make both power-seeking and not-power-seeking optimal. So if you permute that one, you’ll get another function for which power-seeking and not-power-seeking actions are both optimal.
Oh interesting… so then what I need for my argument is not the simplest function period, but the simplest function that doesn’t make both power-seeking and not-power-seeking both optimal? (isn’t that probably just going to be the simplest function that doesn’t make everything optimal?)
I admit I am probably conceptually confused in a bunch of ways, I haven’t read your post closely yet.
I don’t yet understand the general case, but I have a strong hunch that instrumental convergenceoptimal policies is a governed by how many more ways there are for power to be optimal than not optimal.
This is awesome, thanks! I found the minesweeper analogy in particular super helpful.
Typo? Maybe you mean (2 and (3)?
More substantively, on the simplicity prior stuff:
There’s the power seeking functions and the non-power-seeking functions, but then there’s also the inner-aligned functions, i.e. the ones that are in some sense trying their best to achieve the base objective, and then there’s also the human-aligned functions. Perhaps we can pretty straightforwardly argue that the non-power-seeking functions have much higher prior than those other two categories?
Sketch:
The simplest function period is probably at least 20 bits simpler than the simplest inner-aligned function and the simplest human-aligned function. Therefore (given previous theorems) the simplest power-seeking function is at least 5 bits simpler than the simplest inner-aligned function and the simplest human-aligned function, and therefore probably the power-seeking functions are significantly more likely on priors than the inner-aligned and human-aligned functions.
Seems pretty plausible to me, no? Of course the constant isn’t really 15 bits, but… isn’t it plausible that the constant is smaller than the distance between the simplest possible function and the simplest inner-aligned one? At least for the messy, complex training environments that we realistically expect for these sorts of things? And obviously for human-aligned functions the case is even more straightforward...
I like the thought. I don’t know if this sketch works out, partly because I don’t fully understand it. your conclusion seems plausible but I want to develop the arguments further.
As a note: the simplest function period probably is the constant function, and other very simple functions probably make both power-seeking and not-power-seeking optimal. So if you permute that one, you’ll get another function for which power-seeking and not-power-seeking actions are both optimal.
Oh interesting… so then what I need for my argument is not the simplest function period, but the simplest function that doesn’t make both power-seeking and not-power-seeking both optimal? (isn’t that probably just going to be the simplest function that doesn’t make everything optimal?)
I admit I am probably conceptually confused in a bunch of ways, I haven’t read your post closely yet.
Another typo: