Lastly, I pick a distance between policies. If the two policies are deterministic, a Hamming distance will do. If they are stochastic, maybe some vector distance based on the Kullback-Leibler divergence.
I think it might actually be very difficult to come up with a distance metric between policies that corresponds even reasonably well to behavioral similarity. I imagine that flipping the sign on a single crucial parameter in a neural net could completely change its behavior, or at least break it sufficiently that it goes from highly goal oriented behavior to random/chaotic behavior.
By analogy, imagine trying to come up with a distance metric between python source files in a way that captures behavioral similarity. Very subtle changes to source code can completely alter behavior, while drastic refactorings can leave behavior unchanged.
Ultimately we’d like to be able to handle cases where we’re using network architectures that permit arbitrary Turing machines to emerge as policies, in which case determining behavioral similarity by comparing source code is equivalent to the halting problem.
In this post, I assume that a policy is a description of its behavior (like a function from state to action or distribution over action), and thus the distances mentioned indeed capture behavioral similarity. That being said, you’re right that a similar concept of distance between the internal structure of the policies would prove difficult, eventually butting against uncomputability.
I think it might actually be very difficult to come up with a distance metric between policies that corresponds even reasonably well to behavioral similarity. I imagine that flipping the sign on a single crucial parameter in a neural net could completely change its behavior, or at least break it sufficiently that it goes from highly goal oriented behavior to random/chaotic behavior.
By analogy, imagine trying to come up with a distance metric between python source files in a way that captures behavioral similarity. Very subtle changes to source code can completely alter behavior, while drastic refactorings can leave behavior unchanged.
Ultimately we’d like to be able to handle cases where we’re using network architectures that permit arbitrary Turing machines to emerge as policies, in which case determining behavioral similarity by comparing source code is equivalent to the halting problem.
Sorry for the delay in answering.
In this post, I assume that a policy is a description of its behavior (like a function from state to action or distribution over action), and thus the distances mentioned indeed capture behavioral similarity. That being said, you’re right that a similar concept of distance between the internal structure of the policies would prove difficult, eventually butting against uncomputability.