I still don’t see how this works. The “small constant” here is actually the length of a program that needs to contain a representation of the entire MDP (because the program needs to simulate the MDP for each possible permutation). So it’s not a constant; it’s an unbounded integer.
Even if we restrict the discussion to a given very-simple-MDP, the program needs to contain way more than 100 bits (just to represent the MDP + the logic that checks whether a given permutation satisfies the relevant condition). So the probability of the POWER-seeking reward functions that are constructed here is smaller than 2−100 of the probability of the non-POWER-seeking reward functions. [EDIT: I mean, the probability of the constructed reward functions may happen to be larger, but the proof sketch doesn’t show that it is.]
(As an aside, the permutations that we’re dealing with here are equal to their own inverse, so it’s not useful to apply them multiple times.)
contain a representation of the entire MDP (because the program needs to simulate the MDP for each possible permutation)
We aren’t talking about MDPs, we’re talking about a broad class of environments which are represented via joint probability distributions over actions and observations. See post title.
it’s an unbounded integer.
I don’t follow.
the program needs to contain way more than 100 bits
See the arguments in the post Rohin linked for why this argument is gesturing at something useful even if takes some more bits.
But IMO the basic idea in this case is, you can construct reasonably simple utility functions like “utility 1 if history has the agent taking action ai at time step t given action-observation history prefix h1:t−1, and 0 otherwise.” This is reasonably short, and you can apply it for all actions and time steps.
Sure, the complexity will vary a little bit (probably later time steps will be more complex), but basically you can produce reasonably simple programs which make any sequence of actions optimal. And so I agree with Rohin that simplicity priors on u-AOH will quantitatively—but not qualitatively affect the conclusions for the generic u-AOH case. [EDIT: this reasoning is different from the one Rohin gave, TBC]
I still don’t see how this works. The “small constant” here is actually the length of a program that needs to contain a representation of the entire MDP (because the program needs to simulate the MDP for each possible permutation). So it’s not a constant; it’s an unbounded integer.
Even if we restrict the discussion to a given very-simple-MDP, the program needs to contain way more than 100 bits (just to represent the MDP + the logic that checks whether a given permutation satisfies the relevant condition). So the probability of the POWER-seeking reward functions that are constructed here is smaller than 2−100 of the probability of the non-POWER-seeking reward functions. [EDIT: I mean, the probability of the constructed reward functions may happen to be larger, but the proof sketch doesn’t show that it is.]
(As an aside, the permutations that we’re dealing with here are equal to their own inverse, so it’s not useful to apply them multiple times.)
Yes, good point. I retract the original claim.
(You’re right, what you do here is you search for the kth permutation satisfying the theorem’s requirements, where k is the specified number.)
We aren’t talking about MDPs, we’re talking about a broad class of environments which are represented via joint probability distributions over actions and observations. See post title.
I don’t follow.
See the arguments in the post Rohin linked for why this argument is gesturing at something useful even if takes some more bits.
But IMO the basic idea in this case is, you can construct reasonably simple utility functions like “utility 1 if history has the agent taking action ai at time step t given action-observation history prefix h1:t−1, and 0 otherwise.” This is reasonably short, and you can apply it for all actions and time steps.
Sure, the complexity will vary a little bit (probably later time steps will be more complex), but basically you can produce reasonably simple programs which make any sequence of actions optimal. And so I agree with Rohin that simplicity priors on u-AOH will quantitatively—but not qualitatively affect the conclusions for the generic u-AOH case. [EDIT: this reasoning is different from the one Rohin gave, TBC]