Hoagy comments on Current safety training techniques do not fully transfer to the agent setting

Hoagy 4 Nov 2024 11:11 UTC
6 points
0
I think the low-hanging fruit here is that alongside training for refusals we should be including lots of data where you pre-fill some % of a harmful completion and then train the model to snap out of it, immediately refusing or taking a step back, which is compatible with normal training methods. I don’t remember any papers looking at it, though I’d guess that people are doing it
- Andy Arditi 4 Nov 2024 14:53 UTC
  11 points
  1
  Parent
  Safety Alignment Should Be Made More Than Just a Few Tokens Deep (Qi et al., 2024) does this!
  - Daniel Tan 15 Nov 2024 2:37 UTC
    1 point
    0
    Parent
    This seems pretty cool! The data augmentation technique proposed seems simple and effective. I’d be interested to see a scaled-up version of this (more harmful instructions, models etc). Also would be cool to see some interpretability studies to understand how the internal mechanisms change from ‘deep’ alignment (and compare this to previous work, such as https://arxiv.org/abs/2311.12786, https://arxiv.org/abs/2401.01967)