Andy Arditi comments on Current safety training techniques do not fully transfer to the agent setting

Andy Arditi 4 Nov 2024 14:53 UTC
11 points
1
Safety Alignment Should Be Made More Than Just a Few Tokens Deep (Qi et al., 2024) does this!
- Daniel Tan 15 Nov 2024 2:37 UTC
  1 point
  0
  Parent
  This seems pretty cool! The data augmentation technique proposed seems simple and effective. I’d be interested to see a scaled-up version of this (more harmful instructions, models etc). Also would be cool to see some interpretability studies to understand how the internal mechanisms change from ‘deep’ alignment (and compare this to previous work, such as https://arxiv.org/abs/2311.12786, https://arxiv.org/abs/2401.01967)