What if the encoding difference penalty were applied after a counterfactual rollout of no-ops after the candidate action or no-op? Couldn’t that detect “butterfly effects” of small impactful actions, avoiding “salami slicing” exploits?
Building upon this thought, how about comparing mutated policies to a base policy by sampling possible futures to generate distributions of the encodings up to the farthest step and penalize divergence from the base policy?
Or just train a sampling policy by GD, using a Monte Carlo Tree Search that penalizes actions which alter the future encodings when compared to a pure no-op policy.
What if the encoding difference penalty were applied after a counterfactual rollout of no-ops after the candidate action or no-op? Couldn’t that detect “butterfly effects” of small impactful actions, avoiding “salami slicing” exploits?
Building upon this thought, how about comparing mutated policies to a base policy by sampling possible futures to generate distributions of the encodings up to the farthest step and penalize divergence from the base policy?
Or just train a sampling policy by GD, using a Monte Carlo Tree Search that penalizes actions which alter the future encodings when compared to a pure no-op policy.