Is there such a thing as a free action, or an action where e.g. the agent breaks its own legs, when it is not accounted for in the action space of the underlying MDP? That feels like adding a new layer of misspecification (which no doubt is a possibility, and probably deserves deep investigation) orthogonal to reward function misspecification.
2.
It seems as though this kind of circumvention of impact penalties depends on what calculates the penalty. If the environment (or something else external to the agent) is responsible for calculating the penalty and providing it to the agent, then it could do so as if the agent was not constrained (or committing to be constrained) at all. If the agent is internally responsible for computing this impact, it could probably find ways to hack this, similar to wireheading.
So, assuming its some external entity E that computes the impact penalty, it must have the (dis?)ability to account for the agent’s modified action space when making this computation for the agent to successfully reduce the penalty as in the earlier example.
Something agent A does must signal to E that A’s action space has changed. If so, we might find ourselves in a situation where A is able to communicate an arbitrarily complex message to E, particularly, something of the form of: “I will never do anything that gives me more than the minimum penalty”, or in terms of actions: “I will deterministically follow the policy that gives me the minimum penalty while achieving my goals.” E, if it believed A, would not include high penalty actions in the action space of its penalty computations, and thus A would avoid the impact penalty.
1.
Is there such a thing as a free action, or an action where e.g. the agent breaks its own legs, when it is not accounted for in the action space of the underlying MDP? That feels like adding a new layer of misspecification (which no doubt is a possibility, and probably deserves deep investigation) orthogonal to reward function misspecification.
2.
It seems as though this kind of circumvention of impact penalties depends on what calculates the penalty. If the environment (or something else external to the agent) is responsible for calculating the penalty and providing it to the agent, then it could do so as if the agent was not constrained (or committing to be constrained) at all. If the agent is internally responsible for computing this impact, it could probably find ways to hack this, similar to wireheading.
So, assuming its some external entity E that computes the impact penalty, it must have the (dis?)ability to account for the agent’s modified action space when making this computation for the agent to successfully reduce the penalty as in the earlier example.
Something agent A does must signal to E that A’s action space has changed. If so, we might find ourselves in a situation where A is able to communicate an arbitrarily complex message to E, particularly, something of the form of: “I will never do anything that gives me more than the minimum penalty”, or in terms of actions: “I will deterministically follow the policy that gives me the minimum penalty while achieving my goals.” E, if it believed A, would not include high penalty actions in the action space of its penalty computations, and thus A would avoid the impact penalty.