see also: “optimal policies tend to take actions which strictly preserve optionality*”
Does this quote refer to a passage from the paper? (I didn’t find it.)
It certainly has some kind of effect, but I don’t find it obvious that it has the effect you’re seeking—there are many simple ways of specifying action-history+state reward functions, which rely on the action-history and not just the rest of the state.
There are very few reward functions that rely on action-history—that can be specified in a simple way—relative to all the reward functions that rely on action-history (you need at least 2n bits to specify a reward function that considers n actions, when using a uniform prior). Also, I don’t think that the action log is special in this context relative to any other object that constitutes a tiny part of the environment.
What’s special is that (by assumption) the action logger always logs the agent’s actions, even if the agent has been literally blown up in-universe. That wouldn’t occur with the security camera. With the security camera, once the agent is dead, the agent can no longer influence the trajectory, and the normal death-avoiding arguments apply. But your action logger supernaturally writes a log of the agent’s actions into the environment.
If we assume that the action logger can always “detect” the action that the agent chooses, this issue doesn’t apply. (Instead of the agent being “dead” we can simply imagine the robot/actuators are in a box and can’t influencing anything outside the box; which is functionally equivalent to being “dead” if the box is a sufficiently small fraction of the environment.)
Right, but if you want the optimal policies to take actions a1,...,ak, then write a reward function which returns 1 iff the action-logger begins with those actions and 0 otherwise. Therefore, it’s extremely easy to incentivize arbitrary action sequences.
Sure, but I still don’t understand the argument here. It’s trivial to write a reward function that doesn’t yield instrumental convergence regardless of whether one can infer the complete action history from every reachable state. Every constant function is such a reward function.
I don’t think that the action log is special in this context relative to any other object that constitutes a tiny part of the environment.
It isn’t the size of the object that matters here, the key considerations are structural. In this unrolled model, the unrolled state factors into the (action history) and the (world state). This is not true in general for other parts of the environment.
Sure, but I still don’t understand the argument here. It’s trivial to write a reward function that doesn’t yield instrumental convergence regardless of whether one can infer the complete action history from every reachable state. Every constant function is such a reward function.
Sure. Here’s what I said:
how easy is it to write down state-based utility functions which do the same? I guess there’s the one that maximally values dying. What else? While more examples probably exist, it seems clear that they’re much harder to come by [than in the action-history case].
The broader claim I was trying to make was not “it’s hard to write down any state-based reward functions that don’t incentivize power-seeking”, it was that there are fewer qualitatively distinct ways to do it in the state-based case. In particular, it’s hard to write down state-based reward functions which incentivize any given sequence of actions:
when your reward depends on your action history, this is strictly more expressive than state-based reward—so expressive that it becomes easy to directly incentivize any sequence of actions via the reward function. And thus, instrumental convergence disappears for “most objectives.”
If you disagree, then try writing down a state-based reward function for e.g. Pacman for which an optimal policy starts off by (EDIT: circling the level counterclockwise) (at a discount rate close to 1). Such reward functions provably exist, but they seem harder to specify in general.
Also: thanks for your engagement, but I still feel like my points aren’t landing (which isn’t necessarily your fault or anything), and I don’t want to put more time into this right now. Of course, you can still reply, but just know I might not reply and that won’t be anything personal.
EDIT: FYI I find your action-camera example interesting. Thank you for pointing that out.
Consider adding to the paper a high-level/simplified description of the environments for which the following sentence from the abstract applies: “We prove that for most prior beliefs one might have about the agent’s reward function [...] one should expect optimal policies to seek power in these environments.” (If it’s the set of environments in which “the “vast majority” of RSDs are only reachable by following a subset of policies” consider clarifying that in the paper). It’s hard (at least for me) to infer that from the formal theorems/definitions.
It isn’t the size of the object that matters here, the key considerations are structural. In this unrolled model, the unrolled state factors into the (action history) and the (world state). This is not true in general for other parts of the environment.
My “unrolling trick” argument doesn’t require an easy way to factor states into [action history] and [the rest of the state from which the action history can’t be inferred]. A sufficient condition for my argument is that the complete action history could be inferred from every reachable state. When this condition fulfills, the environment implicitly contains an action log (for the purpose of my argument), and thus the POWER (IID) of all the states is equal. And as I’ve argued before, this condition seems plausible for sufficiently complex real-world-like environments. BTW, any deterministic time-reversible environment fulfills this condition, except for cases where multiple actions can yield the same state transition (in which case we may not be able to infer which of those actions were chosen at the relevant time step).
It’s easier to find reward functions that incentivize a given action sequence if the complete action history can be inferred from every reachable state (and the easiness depends on how easy it is to compute the action history from the state). I don’t see how this fact relates to instrumental convergence supposedly disappearing for “most objectives” [EDIT: when using a simplicity prior over objectives; otherwise, instrumental convergence may not apply regardless]. Generally, if an action log constitutes a tiny fraction of the environment, its existence shouldn’t affect properties of “most objectives” (regardless of whether we use the uniform prior or a simplicity prior).
Does this quote refer to a passage from the paper? (I didn’t find it.)
There are very few reward functions that rely on action-history—that can be specified in a simple way—relative to all the reward functions that rely on action-history (you need at least 2n bits to specify a reward function that considers n actions, when using a uniform prior). Also, I don’t think that the action log is special in this context relative to any other object that constitutes a tiny part of the environment.
If we assume that the action logger can always “detect” the action that the agent chooses, this issue doesn’t apply. (Instead of the agent being “dead” we can simply imagine the robot/actuators are in a box and can’t influencing anything outside the box; which is functionally equivalent to being “dead” if the box is a sufficiently small fraction of the environment.)
Sure, but I still don’t understand the argument here. It’s trivial to write a reward function that doesn’t yield instrumental convergence regardless of whether one can infer the complete action history from every reachable state. Every constant function is such a reward function.
Not from the paper. I just wrote it.
It isn’t the size of the object that matters here, the key considerations are structural. In this unrolled model, the unrolled state factors into the (action history) and the (world state). This is not true in general for other parts of the environment.
Sure. Here’s what I said:
The broader claim I was trying to make was not “it’s hard to write down any state-based reward functions that don’t incentivize power-seeking”, it was that there are fewer qualitatively distinct ways to do it in the state-based case. In particular, it’s hard to write down state-based reward functions which incentivize any given sequence of actions:
If you disagree, then try writing down a state-based reward function for e.g. Pacman for which an optimal policy starts off by (EDIT: circling the level counterclockwise) (at a discount rate close to 1). Such reward functions provably exist, but they seem harder to specify in general.
Also: thanks for your engagement, but I still feel like my points aren’t landing (which isn’t necessarily your fault or anything), and I don’t want to put more time into this right now. Of course, you can still reply, but just know I might not reply and that won’t be anything personal.
EDIT: FYI I find your action-camera example interesting. Thank you for pointing that out.
Consider adding to the paper a high-level/simplified description of the environments for which the following sentence from the abstract applies: “We prove that for most prior beliefs one might have about the agent’s reward function [...] one should expect optimal policies to seek power in these environments.” (If it’s the set of environments in which “the “vast majority” of RSDs are only reachable by following a subset of policies” consider clarifying that in the paper). It’s hard (at least for me) to infer that from the formal theorems/definitions.
My “unrolling trick” argument doesn’t require an easy way to factor states into [action history] and [the rest of the state from which the action history can’t be inferred]. A sufficient condition for my argument is that the complete action history could be inferred from every reachable state. When this condition fulfills, the environment implicitly contains an action log (for the purpose of my argument), and thus the POWER (IID) of all the states is equal. And as I’ve argued before, this condition seems plausible for sufficiently complex real-world-like environments. BTW, any deterministic time-reversible environment fulfills this condition, except for cases where multiple actions can yield the same state transition (in which case we may not be able to infer which of those actions were chosen at the relevant time step).
It’s easier to find reward functions that incentivize a given action sequence if the complete action history can be inferred from every reachable state (and the easiness depends on how easy it is to compute the action history from the state). I don’t see how this fact relates to instrumental convergence supposedly disappearing for “most objectives” [EDIT: when using a simplicity prior over objectives; otherwise, instrumental convergence may not apply regardless].
Generally, if an action log constitutes a tiny fraction of the environment, its existence shouldn’t affect properties of “most objectives” (regardless of whether we use the uniform prior or a simplicity prior).Ditto :)