Couldn’t you do something like fit a Gaussian to the model’s activations, then restrict the steered activations to be high likelihood (low Mahalanobis distance)? Or (almost) equivalently, you could just do a whitening transformation to activation space before you constrain the L2 distance of the perturbation.
(If a gaussian isn’t expressive enough you could model the manifold in some other way, eg. with a VAE anomaly detector or mixture of gaussians or whatever)
Couldn’t you do something like fit a Gaussian to the model’s activations, then restrict the steered activations to be high likelihood (low Mahalanobis distance)? Or (almost) equivalently, you could just do a whitening transformation to activation space before you constrain the L2 distance of the perturbation.
(If a gaussian isn’t expressive enough you could model the manifold in some other way, eg. with a VAE anomaly detector or mixture of gaussians or whatever)