Quintin Pope comments on High-stakes alignment via adversarial training [Redwood Research report]

Quintin Pope 5 May 2022 3:20 UTC
4 points
Looks like you use gradient magnitude as your saliency score. I’ve looked at using saliency to guide counterfactual modifications to a text, though my focus was on aiding interpretability rather than adversarial robustness. (Paper).

I’ve found that the normgrad saliency score worked well for highlighting important tokens. I.e., saliency = torch.sum(torch.pow(embedding * gradient, 2)).

For more details on normgrad, see: https://arxiv.org/pdf/2004.02866.pdf
- dmz 7 May 2022 0:49 UTC
  1 point
  Parent
  Neat! We tried one or two other saliency scores but there’s definitely a lot more experimentation to be done.