An early punchline in this sequence was “Impact is a thing that depends on the goals of agents; it’s not about objective changes in the world.” At that point, I thought “well, in that case, impact measures require agents to learn those goals, which means it requires value learning.” Looking back at the sequence now, I realize that the “How agents impact each other” part of the sequence was primarily about explaining why we don’t need to do that and the previous post was declaring victory on that front, but it took me seeing the formalism here to really get it.
I now think of the main results of the sequence thus far as “impact depends on goals (part 1); nonetheless an impact measure can just be about power of the agent (part 2)”
Attempted Summary/Thoughts on this post
GridWorlds is a toy environment (probably meant to be as simple as possible while still allowing to test various properties of agents). The worlds consist of small grids, the state space is correspondingly non-large, and you can program certain behavior of the environment (such as a pixel moving at a pre-defined route).
You can specify objectives for an agent within GridWorlds and use Reinforcement Learning to train the agent (to learn a space-transition function?). The agent can move around and behavior on collision with other agents/objects can be specified by the programmer
The idea now is that we program five grid worlds in such a way that they represent failure modes relevant to safety. We train (a) a RL algorithm with the objective, (b) a RL algorithm with the objective plus some implementation of AUP and see how they behave differently
The five failure modes are (1) causing irreversible changes, (2) damaging stuff, (3) disabling an off-swich, (4) undoing effects that result from the reaching the main objective, (5) preventing naturally occurring changes. The final two aren’t things naive RL learning would do, but are failure modes for poorly specified impact penalties (“when curing cancer, make sure human still dies”)
I don’t understand how (1) and (2) are conceptually different (aren’t both about causing irreversible changes?)
The implementation of AUP chooses a uniformly random objective Raux and then penalizes actions by a multiple of the term ∣∣Q∗Raux(s,a)−Q∗Raux(s,∅)∣∣, scaled by some parameter λ and normalized.
An important implementation detail is about what to compare “AU for aux. goal if I do this” to. There’s “AU [for aux. goal] if I do nothing” and “AU [...] if I do nothing for n∈N steps” and “AU [...] at starting state.” The last one fails at (5), the first one at (4). (I forgot too much of the reinforcement learning theory to understand how exactly these concepts would map onto the formula.)
The AUP penalty robustly scales up to more complex environments, although the “pick a uniformly random reward function” step has to be replaced with “do some white magic to end up with something difficult to understand but still quite simple.” The details of “white magic” are probably important for scaling it up to real-world applications.
Looking back at the sequence now, I realize that the “How agents impact each other” part of the sequence was primarily about explaining why we don’t need to do that and the previous post was declaring victory on that front, but it took me seeing the formalism here to really get it.
I now think of the main results of the sequence thus far as “impact depends on goals (part 1); nonetheless an impact measure can just be about power of the agent (part 2)”
Yes, this is exactly what the plan was. :)
I don’t understand how (1) and (2) are conceptually different (aren’t both about causing irreversible changes?)
Yeah, but one doesn’t involve visibly destroying an object, which matters for certain impact measures (like whitelisting). You’re right that they’re quite similar.
normalized.
Turns out you don’t need the normalization, per the linked SafeLife paper. I’d probably just take it out of the equations, looking back. Complication often isn’t worth it.
the first one [fails] at (4)
I think the n-step stepwise inaction baseline doesn’t fail at any of them?
Turns out you don’t need the normalization, per the linked SafeLife paper. I’d probably just take it out of the equations, looking back. Complication often isn’t worth it.
It’s also slightly confusing in this case because the post doesn’t explain it, which made me wonder, “am I supposed to understand what it’s for?” But it is explained in the conservative agency paper.
I think the n-step stepwise inaction baseline doesn’t fail at any of them?
Yeah, but the first one was “[comparing AU for aux. goal if I do this action to] AU for aux. goal if I do nothing”
An early punchline in this sequence was “Impact is a thing that depends on the goals of agents; it’s not about objective changes in the world.” At that point, I thought “well, in that case, impact measures require agents to learn those goals, which means it requires value learning.” Looking back at the sequence now, I realize that the “How agents impact each other” part of the sequence was primarily about explaining why we don’t need to do that and the previous post was declaring victory on that front, but it took me seeing the formalism here to really get it.
I now think of the main results of the sequence thus far as “impact depends on goals (part 1); nonetheless an impact measure can just be about power of the agent (part 2)”
Attempted Summary/Thoughts on this post
GridWorlds is a toy environment (probably meant to be as simple as possible while still allowing to test various properties of agents). The worlds consist of small grids, the state space is correspondingly non-large, and you can program certain behavior of the environment (such as a pixel moving at a pre-defined route).
You can specify objectives for an agent within GridWorlds and use Reinforcement Learning to train the agent (to learn a space-transition function?). The agent can move around and behavior on collision with other agents/objects can be specified by the programmer
The idea now is that we program five grid worlds in such a way that they represent failure modes relevant to safety. We train (a) a RL algorithm with the objective, (b) a RL algorithm with the objective plus some implementation of AUP and see how they behave differently
The five failure modes are (1) causing irreversible changes, (2) damaging stuff, (3) disabling an off-swich, (4) undoing effects that result from the reaching the main objective, (5) preventing naturally occurring changes. The final two aren’t things naive RL learning would do, but are failure modes for poorly specified impact penalties (“when curing cancer, make sure human still dies”)
I don’t understand how (1) and (2) are conceptually different (aren’t both about causing irreversible changes?)
The implementation of AUP chooses a uniformly random objective Raux and then penalizes actions by a multiple of the term ∣∣Q∗Raux(s,a)−Q∗Raux(s,∅)∣∣, scaled by some parameter λ and normalized.
An important implementation detail is about what to compare “AU for aux. goal if I do this” to. There’s “AU [for aux. goal] if I do nothing” and “AU [...] if I do nothing for n∈N steps” and “AU [...] at starting state.” The last one fails at (5), the first one at (4). (I forgot too much of the reinforcement learning theory to understand how exactly these concepts would map onto the formula.)
The AUP penalty robustly scales up to more complex environments, although the “pick a uniformly random reward function” step has to be replaced with “do some white magic to end up with something difficult to understand but still quite simple.” The details of “white magic” are probably important for scaling it up to real-world applications.
Yes, this is exactly what the plan was. :)
Yeah, but one doesn’t involve visibly destroying an object, which matters for certain impact measures (like whitelisting). You’re right that they’re quite similar.
Turns out you don’t need the normalization, per the linked SafeLife paper. I’d probably just take it out of the equations, looking back. Complication often isn’t worth it.
I think the n-step stepwise inaction baseline doesn’t fail at any of them?
It’s also slightly confusing in this case because the post doesn’t explain it, which made me wonder, “am I supposed to understand what it’s for?” But it is explained in the conservative agency paper.
Yeah, but the first one was “[comparing AU for aux. goal if I do this action to] AU for aux. goal if I do nothing”