This looks like a great paper and great results! Congrats for getting accepted at NEURIPS!
On the more technical side, just from what’s written here, it seems to me that this method probably cannot deal with very small impactful changes. Because the autoencoder will probably not pick up that specific detail, which means that the q-values for the corresponding goal will not be different enough to create a big penalty. This could be a problem in situations where for example there are a lot of people in the world, but you still don’t want to kill one of them.
I don’t know what the autoencoder’s doing well enough to make a prediction there, other than the baseline prediction of “smaller changes to the agent’s set of attainable utilities, are harder to detect.” I think a bigger problem will be spatial distance: in a free-ranging robotics task, if the agent has a big impact on something a mile away, maybe that’s unlikely to show up in any of the auxiliary value estimates and so it’s unlikely to be penalized.
What if the encoding difference penalty were applied after a counterfactual rollout of no-ops after the candidate action or no-op? Couldn’t that detect “butterfly effects” of small impactful actions, avoiding “salami slicing” exploits?
Building upon this thought, how about comparing mutated policies to a base policy by sampling possible futures to generate distributions of the encodings up to the farthest step and penalize divergence from the base policy?
Or just train a sampling policy by GD, using a Monte Carlo Tree Search that penalizes actions which alter the future encodings when compared to a pure no-op policy.
This looks like a great paper and great results! Congrats for getting accepted at NEURIPS!
On the more technical side, just from what’s written here, it seems to me that this method probably cannot deal with very small impactful changes. Because the autoencoder will probably not pick up that specific detail, which means that the q-values for the corresponding goal will not be different enough to create a big penalty. This could be a problem in situations where for example there are a lot of people in the world, but you still don’t want to kill one of them.
Does this makes sense to you?
Thanks!
I don’t know what the autoencoder’s doing well enough to make a prediction there, other than the baseline prediction of “smaller changes to the agent’s set of attainable utilities, are harder to detect.” I think a bigger problem will be spatial distance: in a free-ranging robotics task, if the agent has a big impact on something a mile away, maybe that’s unlikely to show up in any of the auxiliary value estimates and so it’s unlikely to be penalized.
What if the encoding difference penalty were applied after a counterfactual rollout of no-ops after the candidate action or no-op? Couldn’t that detect “butterfly effects” of small impactful actions, avoiding “salami slicing” exploits?
Building upon this thought, how about comparing mutated policies to a base policy by sampling possible futures to generate distributions of the encodings up to the farthest step and penalize divergence from the base policy?
Or just train a sampling policy by GD, using a Monte Carlo Tree Search that penalizes actions which alter the future encodings when compared to a pure no-op policy.