Typically the theorems say things like “for every reward function there exists N other reward functions that <seek power, avoid dying, etc>”; the theorems aren’t themselves talking about probability distributions.
The first example you give does only make sense given a uniform distribution, I believe.
In the second example, when Alex uses the phrase “tends to”, it means something like “there are N times as many reward functions that <do the thing> as reward functions that <don’t do the thing>”. But yes, if you want to interpret this as a probabilistic statement, you would use a uniform distribution.
If instead we used a kolmogorov prior, or a minimal-circuit-prior, or something else more realistic, then these results would go out the window?
EDIT: This next paragraph is wrong; I’m leaving it in for the record. I still think it is likely that similar results would hold in practice though the case is not as strong as I thought (and the effect is probably also not as strong as I thought).
They would change quantitatively, but the upshot would probably be similar. For example, for the Kolmogorov prior, you could prove theorems like “for every reward function that <doesn’t do the thing>, there are N reward functions that <do the thing> that each have at most a small constant more complexity” (since you can construct them by taking the original reward function and then apply the relevant permutation / move through the orbit, and that second step has constant K-complexity). Alex sketches out a similar argument in this post.
They would change quantitatively, but the upshot would probably be similar. For example, for the Kolmogorov prior, you could prove theorems like “for every reward function that <doesn’t do the thing>, there are N reward functions that <do the thing> that each have at most a small constant more complexity” (since you can construct them by taking the original reward function and then apply the relevant permutation / move through the orbit, and that second step has constant K-complexity). Alex sketches out a similar argument in this post.
I don’t see how this works. If you need n bits to specify the permutation for that “second step”, the probability of each of the N reward functions will be smaller than the original one’s by a factor of 2−n. So you need n to be smaller than log2(N) which is impossible?
So you need n to be smaller than log2(N) which is impossible?
You’re right, in my previous comment it should be “at most a small constant more complexity + O(logN)”, to specify the number of times that the permutation has to be applied.
I still think the upshot is similar; in this case it’s “power-seeking is at best a little less probable than non-power-seeking, but you should really not expect that the bound is tight”.
I still don’t see how this works. The “small constant” here is actually the length of a program that needs to contain a representation of the entire MDP (because the program needs to simulate the MDP for each possible permutation). So it’s not a constant; it’s an unbounded integer.
Even if we restrict the discussion to a given very-simple-MDP, the program needs to contain way more than 100 bits (just to represent the MDP + the logic that checks whether a given permutation satisfies the relevant condition). So the probability of the POWER-seeking reward functions that are constructed here is smaller than 2−100 of the probability of the non-POWER-seeking reward functions. [EDIT: I mean, the probability of the constructed reward functions may happen to be larger, but the proof sketch doesn’t show that it is.]
(As an aside, the permutations that we’re dealing with here are equal to their own inverse, so it’s not useful to apply them multiple times.)
contain a representation of the entire MDP (because the program needs to simulate the MDP for each possible permutation)
We aren’t talking about MDPs, we’re talking about a broad class of environments which are represented via joint probability distributions over actions and observations. See post title.
it’s an unbounded integer.
I don’t follow.
the program needs to contain way more than 100 bits
See the arguments in the post Rohin linked for why this argument is gesturing at something useful even if takes some more bits.
But IMO the basic idea in this case is, you can construct reasonably simple utility functions like “utility 1 if history has the agent taking action ai at time step t given action-observation history prefix h1:t−1, and 0 otherwise.” This is reasonably short, and you can apply it for all actions and time steps.
Sure, the complexity will vary a little bit (probably later time steps will be more complex), but basically you can produce reasonably simple programs which make any sequence of actions optimal. And so I agree with Rohin that simplicity priors on u-AOH will quantitatively—but not qualitatively affect the conclusions for the generic u-AOH case. [EDIT: this reasoning is different from the one Rohin gave, TBC]
Typically the theorems say things like “for every reward function there exists N other reward functions that <seek power, avoid dying, etc>”; the theorems aren’t themselves talking about probability distributions.
The first example you give does only make sense given a uniform distribution, I believe.
In the second example, when Alex uses the phrase “tends to”, it means something like “there are N times as many reward functions that <do the thing> as reward functions that <don’t do the thing>”. But yes, if you want to interpret this as a probabilistic statement, you would use a uniform distribution.
EDIT: This next paragraph is wrong; I’m leaving it in for the record. I still think it is likely that similar results would hold in practice though the case is not as strong as I thought (and the effect is probably also not as strong as I thought).
They would change quantitatively, but the upshot would probably be similar. For example, for the Kolmogorov prior, you could prove theorems like “for every reward function that <doesn’t do the thing>, there are N reward functions that <do the thing> that each have at most a small constant more complexity” (since you can construct them by taking the original reward function and then apply the relevant permutation / move through the orbit, and that second step has constant K-complexity). Alex sketches out a similar argument in this post.
I don’t see how this works. If you need n bits to specify the permutation for that “second step”, the probability of each of the N reward functions will be smaller than the original one’s by a factor of 2−n. So you need n to be smaller than log2(N) which is impossible?
You’re right, in my previous comment it should be “at most a small constant more complexity + O(logN)”, to specify the number of times that the permutation has to be applied.
I still think the upshot is similar; in this case it’s “power-seeking is at best a little less probable than non-power-seeking, but you should really not expect that the bound is tight”.
I still don’t see how this works. The “small constant” here is actually the length of a program that needs to contain a representation of the entire MDP (because the program needs to simulate the MDP for each possible permutation). So it’s not a constant; it’s an unbounded integer.
Even if we restrict the discussion to a given very-simple-MDP, the program needs to contain way more than 100 bits (just to represent the MDP + the logic that checks whether a given permutation satisfies the relevant condition). So the probability of the POWER-seeking reward functions that are constructed here is smaller than 2−100 of the probability of the non-POWER-seeking reward functions. [EDIT: I mean, the probability of the constructed reward functions may happen to be larger, but the proof sketch doesn’t show that it is.]
(As an aside, the permutations that we’re dealing with here are equal to their own inverse, so it’s not useful to apply them multiple times.)
Yes, good point. I retract the original claim.
(You’re right, what you do here is you search for the kth permutation satisfying the theorem’s requirements, where k is the specified number.)
We aren’t talking about MDPs, we’re talking about a broad class of environments which are represented via joint probability distributions over actions and observations. See post title.
I don’t follow.
See the arguments in the post Rohin linked for why this argument is gesturing at something useful even if takes some more bits.
But IMO the basic idea in this case is, you can construct reasonably simple utility functions like “utility 1 if history has the agent taking action ai at time step t given action-observation history prefix h1:t−1, and 0 otherwise.” This is reasonably short, and you can apply it for all actions and time steps.
Sure, the complexity will vary a little bit (probably later time steps will be more complex), but basically you can produce reasonably simple programs which make any sequence of actions optimal. And so I agree with Rohin that simplicity priors on u-AOH will quantitatively—but not qualitatively affect the conclusions for the generic u-AOH case. [EDIT: this reasoning is different from the one Rohin gave, TBC]