Maybe you can train a sequence of reward functions: r1,...,rn such that each ri is discouraged from attending to the input features that are most salient to the previous i−1 reward functions?
I.e., you’d train r1 normally. Then, while training r2, you’d use gradient saliency (or similar methods) to find which regions of the input are most salient for r1 and r2, then penalize r2 for sharing salient features with r1. Similarly, ri would be penalized w.r.t. saliency maps from {rj}j<i.
Note that for gradient saliency specifically, you can optimize directly for the penalty term with SGD because differentiation is itself a differentiable operation. You can have a loss term like ∑|r1 input saliency−r2 input saliency| and compute its gradient with respect to model parameters (Some notes on doing this with PyTorch). Note that some gradient saliency methods seem to fail basic sanity checks.
Non-differentiable saliency methods like Shapley values can still serve as an optimization target, but you’ll need to use reinforcement learning or other non-gradient optimization approaches. That would probably be very hard.
I’m not sure how necessary that is. If you want diverse good solutions, that sounds a lot like ‘sampling from the posterior’, and we know thanks to Google burning a huge number of TPU-hours on true HMC-sampling from Bayesian neural networks that ‘deep ensembles’ (ie training multiple random initializations from scratch on the same dataset) actually provide you a pretty good sample from the posterior. If there are lots of equally decent ways to classify an image expressible in a NN, then the deep ensemble will sample from them (and that is presumably why ensembling improves: because they all are doing something different, instead of weighting the same features the same amount). If that’s not adequate, it’d be good to think about what one really wants instead, and how to build that in (maybe one wants to do data augmentation to erase color from one dataset/model and shapes from another, to encourage aventral-dorsal split or something).
Maybe you can train a sequence of reward functions: r1,...,rn such that each ri is discouraged from attending to the input features that are most salient to the previous i−1 reward functions?
I.e., you’d train r1 normally. Then, while training r2, you’d use gradient saliency (or similar methods) to find which regions of the input are most salient for r1 and r2, then penalize r2 for sharing salient features with r1. Similarly, ri would be penalized w.r.t. saliency maps from {rj}j<i.
Note that for gradient saliency specifically, you can optimize directly for the penalty term with SGD because differentiation is itself a differentiable operation. You can have a loss term like ∑|r1 input saliency−r2 input saliency| and compute its gradient with respect to model parameters (Some notes on doing this with PyTorch). Note that some gradient saliency methods seem to fail basic sanity checks.
Non-differentiable saliency methods like Shapley values can still serve as an optimization target, but you’ll need to use reinforcement learning or other non-gradient optimization approaches. That would probably be very hard.
You can also steer optimization to find ‘diverse’ models, like Ridge Rider: https://arxiv.org/abs/2011.06505
I’m not sure how necessary that is. If you want diverse good solutions, that sounds a lot like ‘sampling from the posterior’, and we know thanks to Google burning a huge number of TPU-hours on true HMC-sampling from Bayesian neural networks that ‘deep ensembles’ (ie training multiple random initializations from scratch on the same dataset) actually provide you a pretty good sample from the posterior. If there are lots of equally decent ways to classify an image expressible in a NN, then the deep ensemble will sample from them (and that is presumably why ensembling improves: because they all are doing something different, instead of weighting the same features the same amount). If that’s not adequate, it’d be good to think about what one really wants instead, and how to build that in (maybe one wants to do data augmentation to erase color from one dataset/model and shapes from another, to encourage a ventral-dorsal split or something).
Thanks! Very useful feedback.