I only briefly touch on this in the discussion, but making agents safe is quite different from current refusal based safety.
With increasingly powerful agents handling tasks with long planning horizons, a model would need to think about potential negative externalities before committing to a course of action. This goes beyond simply refusing an obviously harmful request.
It would need to sometimes reevaluate the outcomes of actions while executing a task.
Has somebody actually worked on this? I am not aware of anyone using a type of RLHF, DPO, RLAIF, or SFT to make agents behave safely within bounds, make agents consider negative externalities or agents reevaluating outcomes occasionally during execution.
I seems easy to just training it to refuse bribing, harassing, etc. But as agents will take on more substantial tasks, how do we make sure agents don’t do unethical things while let’s say running a company? Or if an agent midway through a task realizes it is aiding in cyber crime, how should it behave?
I think the low-hanging fruit here is that alongside training for refusals we should be including lots of data where you pre-fill some % of a harmful completion and then train the model to snap out of it, immediately refusing or taking a step back, which is compatible with normal training methods. I don’t remember any papers looking at it, though I’d guess that people are doing it
This seems pretty cool! The data augmentation technique proposed seems simple and effective. I’d be interested to see a scaled-up version of this (more harmful instructions, models etc). Also would be cool to see some interpretability studies to understand how the internal mechanisms change from ‘deep’ alignment (and compare this to previous work, such as https://arxiv.org/abs/2311.12786, https://arxiv.org/abs/2401.01967)
I only briefly touch on this in the discussion, but making agents safe is quite different from current refusal based safety.
It would need to sometimes reevaluate the outcomes of actions while executing a task.
Has somebody actually worked on this? I am not aware of anyone using a type of RLHF, DPO, RLAIF, or SFT to make agents behave safely within bounds, make agents consider negative externalities or agents reevaluating outcomes occasionally during execution.
I seems easy to just training it to refuse bribing, harassing, etc. But as agents will take on more substantial tasks, how do we make sure agents don’t do unethical things while let’s say running a company? Or if an agent midway through a task realizes it is aiding in cyber crime, how should it behave?
I think the low-hanging fruit here is that alongside training for refusals we should be including lots of data where you pre-fill some % of a harmful completion and then train the model to snap out of it, immediately refusing or taking a step back, which is compatible with normal training methods. I don’t remember any papers looking at it, though I’d guess that people are doing it
Safety Alignment Should Be Made More Than Just a Few Tokens Deep (Qi et al., 2024) does this!
This seems pretty cool! The data augmentation technique proposed seems simple and effective. I’d be interested to see a scaled-up version of this (more harmful instructions, models etc). Also would be cool to see some interpretability studies to understand how the internal mechanisms change from ‘deep’ alignment (and compare this to previous work, such as https://arxiv.org/abs/2311.12786, https://arxiv.org/abs/2401.01967)