I want to show an example that seems interesting for evaluating, and potentially tweaking/improving, the current informal definition.
Consider an MDP with N states; s1 initial state; from each si an action allows to go back to s1, and another action goes to si+1 (what happens in sN is not really important for the following). Consider two reward functions that are both null everywhere, except for one state that has reward 1: sk in the first function, sk+1 in the second function, for some k.
It’s interesting (problematic?) that two agents, π1 trained on the first reward function and π2 on the second, have similar policies but different goals (defined as sets of states). Specifically, I expect that for N→∞, d(π1,Pol2)N→0 (for various possible choices of k and different ways of defining the distance d). In words: respect to the environment size, the first agent is extremely close to Pol2, and viceversa, but the two agents have different goals.
Maybe this is not a problem at all: it could simply indicate that there exists a way of saying that the two considered goals are similar.
I think one very important thing you are pointing out is that I did not mention the impact of the environment. Because to train using RL, there must be some underlying environment, even just as a sample model. This opens up a lot of questions:
What happens if the actual environment is known by the RL process and the system whose focus we are computing?
What happens when there is uncertainty over the environment?
Given an environment, from which goals is focus entangled (your example basically: high focus with one imply high focus with the other)?
As for your specific example, I assume that the distance converges to 0 because intuitively the only difference lies in the action at state s_k (go back to 0 for the first reward and increment for the second), and this state is seen in less and less proportion as N goes to infinity.
This seems like a perfect example of two distinct goals with almost maximal focus, and similar triviality. As mentioned in the post, I don’t have a clear cut intuition on what to do here. I would say that we cannot distinguish between the two goals in terms of behavior, maybe.
I want to show an example that seems interesting for evaluating, and potentially tweaking/improving, the current informal definition.
Consider an MDP with N states; s1 initial state; from each si an action allows to go back to s1, and another action goes to si+1 (what happens in sN is not really important for the following). Consider two reward functions that are both null everywhere, except for one state that has reward 1: sk in the first function, sk+1 in the second function, for some k.
It’s interesting (problematic?) that two agents, π1 trained on the first reward function and π2 on the second, have similar policies but different goals (defined as sets of states). Specifically, I expect that for N→∞, d(π1,Pol2)N→0 (for various possible choices of k and different ways of defining the distance d). In words: respect to the environment size, the first agent is extremely close to Pol2, and viceversa, but the two agents have different goals.
Maybe this is not a problem at all: it could simply indicate that there exists a way of saying that the two considered goals are similar.
I think one very important thing you are pointing out is that I did not mention the impact of the environment. Because to train using RL, there must be some underlying environment, even just as a sample model. This opens up a lot of questions:
What happens if the actual environment is known by the RL process and the system whose focus we are computing?
What happens when there is uncertainty over the environment?
Given an environment, from which goals is focus entangled (your example basically: high focus with one imply high focus with the other)?
As for your specific example, I assume that the distance converges to 0 because intuitively the only difference lies in the action at state s_k (go back to 0 for the first reward and increment for the second), and this state is seen in less and less proportion as N goes to infinity.
This seems like a perfect example of two distinct goals with almost maximal focus, and similar triviality. As mentioned in the post, I don’t have a clear cut intuition on what to do here. I would say that we cannot distinguish between the two goals in terms of behavior, maybe.