Interesting work! Seems closely related to this recent paper from Satinder Singh’s lab: Minimax-Regret Querying on Side Effects for Safe Optimality in Factored Markov Decision Processes. They also use whitelists to specify which features of the state the agent is allowed to change. Since whitelists can be unnecessarily restrictive, and finding a policy that completely obeys the whitelist can be intractable in large MDPs, they have a mechanism for the agent to query the human about changing a small number of features outside the whitelist. What are the main advantages of your approach over their approach?
I agree with Abram that clinginess (the incentive to interfere with irreversible processes) is a major issue for the whitelist method. It might be possible to get around this by using an inaction baseline, i.e. only penalizing non-whitelisted transitions if they were caused by the agent, and would not have happened by default. This requires computing the inaction baseline (the state sequence under some default policy where the agent “does nothing”), e.g. by simulating the environment or using a causal model of the environment.
I’m not convinced that whitelisting avoids the offsetting problem: “Making up for bad things it prevents with other negative side effects. Imagine an agent which cures cancer, yet kills an equal number of people to keep overall impact low.” I think this depends on how extensive the whitelist is: whether it includes all the important long-term consequences of achieving the goal (e.g. increasing life expectancy). Capture all of the relevant consequences in the whitelist seems hard.
The directedness of whitelists is a very important property, because it can produce an asymmetric impact measure that distinguishes between causing irreversible effects and preventing irreversible events.
That’s pretty cool that another group had a whitelisting-ish approach! They also guarantee that (their version of) the whitelist is obeyed, which is nice.
What are the main advantages of your approach over their approach?
As I understand it, my formulation has the following advantages:
Automatically deduces what effects are, while making weaker assumptions (like the ability to segment the world into discrete objects and maintain beliefs about their identities).
In contrast, their approach requires complete enumeration of side effects. Also, they seem to require the user to specify all effects, but also claim that specifying whether the effect is good is too expensive? O(2|Φ|)=O(|Φ|).
It’s unclear how to apply their feature decomposition to the real world. For example, if it’s OK for an interior door to be opened, how does the agent figure out if the door is open outside of toy MDPs? What about unlocked but closed, or just slightly ajar? Where is the line drawn?
The number of features and values seems to grow extremely quickly with the complexity of the environment, which gets us back to the “no compact enumeration” problem.
Doesn’t require a complete model.
To be fair, calculating the counterfactual in my formulation requires at least a reasonably-accurate model.
Works in stochastic environments.
It is possible that their approach could be expanded to do so, but it is not immediately clear to me how.
Uses counterfactual reasoning to provide a limited reduction in clinginess.
Can trade off having an unknown effect with large amounts of reward. Their approach would do literally anything to prevent unknown effects.
Can act even when no perfect outcome is attainable.
Has plausibly-negligible performance overhead.
Can be used with a broad(er) class of RL approaches, since any given whitelist W implicitly defines a new reward function ¯R(s,a,s′):=R(s,a,s′)−PenaltyW(s,s′).
It might be possible to get around [clinginess] by using an inaction baseline
Yes, but I think a basic counterfactual wouldn’t quite get us there, since if people have different side effects due to reacting to the agent’s behavior, that would still be penalized. My position right now is that either 1) we can do really clever things with counterfactuals to cleanly isolate the agent’s effects, 2) the responsibility problem is non-trivially entangled with our ethics and doesn’t admit a compact solution, or 3) there’s another solution I have yet to find. I’m not putting much probability mass on 2) (yet).
In fact, clinginess stems from an incorrect assumption. In a single-agent environment, whitelisting seems to be pretty close to what we want for any capability level. It finds a plan that trades off maximizing the original reward with minimizing side-effects compared to an inaction baseline. However, if there are other agents in the environment whose autonomy is important, then whitelisting doesn’t account for the fact that restricting autonomy is also a side effect. The inaction counterfactual ignores the fact that we want to be free to act with respect to the agent’s actions—not just how we would have acted in the agent’s absence.
I’ve been exploring solutions to clinginess via counterfactuals from first principles, and might write about it in the near future.
I think this depends on how extensive the whitelist is: whether it includes all the important long-term consequences of achieving the goal
So I think the main issue here is still clinginess; if the agent isn’t clingy, the problem goes away.
Suppose people can be dead, sick, or healthy, where the sick people die if the agent does nothing. While curing people is presumably whitelisted (or could be queried as an addition to the whitelist), curing people means more people means more broken things in expectation (since people might have non-whitelisted side effects). If we have a non-clingy formulation, this is fine; if not, this would indeed be optimized against. In fact, we’d probably just get a bunch of people cured of their cancer and put in stasis.
Interesting work! Seems closely related to this recent paper from Satinder Singh’s lab: Minimax-Regret Querying on Side Effects for Safe Optimality in Factored Markov Decision Processes. They also use whitelists to specify which features of the state the agent is allowed to change. Since whitelists can be unnecessarily restrictive, and finding a policy that completely obeys the whitelist can be intractable in large MDPs, they have a mechanism for the agent to query the human about changing a small number of features outside the whitelist. What are the main advantages of your approach over their approach?
I agree with Abram that clinginess (the incentive to interfere with irreversible processes) is a major issue for the whitelist method. It might be possible to get around this by using an inaction baseline, i.e. only penalizing non-whitelisted transitions if they were caused by the agent, and would not have happened by default. This requires computing the inaction baseline (the state sequence under some default policy where the agent “does nothing”), e.g. by simulating the environment or using a causal model of the environment.
I’m not convinced that whitelisting avoids the offsetting problem: “Making up for bad things it prevents with other negative side effects. Imagine an agent which cures cancer, yet kills an equal number of people to keep overall impact low.” I think this depends on how extensive the whitelist is: whether it includes all the important long-term consequences of achieving the goal (e.g. increasing life expectancy). Capture all of the relevant consequences in the whitelist seems hard.
The directedness of whitelists is a very important property, because it can produce an asymmetric impact measure that distinguishes between causing irreversible effects and preventing irreversible events.
That’s pretty cool that another group had a whitelisting-ish approach! They also guarantee that (their version of) the whitelist is obeyed, which is nice.
As I understand it, my formulation has the following advantages:
Automatically deduces what effects are, while making weaker assumptions (like the ability to segment the world into discrete objects and maintain beliefs about their identities).
In contrast, their approach requires complete enumeration of side effects. Also, they seem to require the user to specify all effects, but also claim that specifying whether the effect is good is too expensive? O(2|Φ|)=O(|Φ|).
It’s unclear how to apply their feature decomposition to the real world. For example, if it’s OK for an interior door to be opened, how does the agent figure out if the door is open outside of toy MDPs? What about unlocked but closed, or just slightly ajar? Where is the line drawn?
The number of features and values seems to grow extremely quickly with the complexity of the environment, which gets us back to the “no compact enumeration” problem.
Doesn’t require a complete model.
To be fair, calculating the counterfactual in my formulation requires at least a reasonably-accurate model.
Works in stochastic environments.
It is possible that their approach could be expanded to do so, but it is not immediately clear to me how.
Uses counterfactual reasoning to provide a limited reduction in clinginess.
Can trade off having an unknown effect with large amounts of reward. Their approach would do literally anything to prevent unknown effects.
Can act even when no perfect outcome is attainable.
Has plausibly-negligible performance overhead.
Can be used with a broad(er) class of RL approaches, since any given whitelist W implicitly defines a new reward function ¯R(s,a,s′):=R(s,a,s′)−PenaltyW(s,s′).
Yes, but I think a basic counterfactual wouldn’t quite get us there, since if people have different side effects due to reacting to the agent’s behavior, that would still be penalized. My position right now is that either 1) we can do really clever things with counterfactuals to cleanly isolate the agent’s effects, 2) the responsibility problem is non-trivially entangled with our ethics and doesn’t admit a compact solution, or 3) there’s another solution I have yet to find. I’m not putting much probability mass on 2) (yet).
In fact, clinginess stems from an incorrect assumption. In a single-agent environment, whitelisting seems to be pretty close to what we want for any capability level. It finds a plan that trades off maximizing the original reward with minimizing side-effects compared to an inaction baseline. However, if there are other agents in the environment whose autonomy is important, then whitelisting doesn’t account for the fact that restricting autonomy is also a side effect. The inaction counterfactual ignores the fact that we want to be free to act with respect to the agent’s actions—not just how we would have acted in the agent’s absence.
I’ve been exploring solutions to clinginess via counterfactuals from first principles, and might write about it in the near future.
So I think the main issue here is still clinginess; if the agent isn’t clingy, the problem goes away.
Suppose people can be dead, sick, or healthy, where the sick people die if the agent does nothing. While curing people is presumably whitelisted (or could be queried as an addition to the whitelist), curing people means more people means more broken things in expectation (since people might have non-whitelisted side effects). If we have a non-clingy formulation, this is fine; if not, this would indeed be optimized against. In fact, we’d probably just get a bunch of people cured of their cancer and put in stasis.