Interesting, though note that it’s only evidence that ‘capabilities generalize further than alignment does’ if the capabilities are actually the result of generalisation. If there’s training for agentic behaviour but no safety training in this domain then the lesson is more that you need your safety training to cover all of the types of action that you’re training your model for.
I only briefly touch on this in the discussion, but making agents safe is quite different from current refusal based safety.
With increasingly powerful agents handling tasks with long planning horizons, a model would need to think about potential negative externalities before committing to a course of action. This goes beyond simply refusing an obviously harmful request.
It would need to sometimes reevaluate the outcomes of actions while executing a task.
Has somebody actually worked on this? I am not aware of anyone using a type of RLHF, DPO, RLAIF, or SFT to make agents behave safely within bounds, make agents consider negative externalities or agents reevaluating outcomes occasionally during execution.
I seems easy to just training it to refuse bribing, harassing, etc. But as agents will take on more substantial tasks, how do we make sure agents don’t do unethical things while let’s say running a company? Or if an agent midway through a task realizes it is aiding in cyber crime, how should it behave?
I think the low-hanging fruit here is that alongside training for refusals we should be including lots of data where you pre-fill some % of a harmful completion and then train the model to snap out of it, immediately refusing or taking a step back, which is compatible with normal training methods. I don’t remember any papers looking at it, though I’d guess that people are doing it
This seems pretty cool! The data augmentation technique proposed seems simple and effective. I’d be interested to see a scaled-up version of this (more harmful instructions, models etc). Also would be cool to see some interpretability studies to understand how the internal mechanisms change from ‘deep’ alignment (and compare this to previous work, such as https://arxiv.org/abs/2311.12786, https://arxiv.org/abs/2401.01967)
Interesting, though note that it’s only evidence that ‘capabilities generalize further than alignment does’ if the capabilities are actually the result of generalisation. If there’s training for agentic behaviour but no safety training in this domain then the lesson is more that you need your safety training to cover all of the types of action that you’re training your model for.
I only briefly touch on this in the discussion, but making agents safe is quite different from current refusal based safety.
It would need to sometimes reevaluate the outcomes of actions while executing a task.
Has somebody actually worked on this? I am not aware of anyone using a type of RLHF, DPO, RLAIF, or SFT to make agents behave safely within bounds, make agents consider negative externalities or agents reevaluating outcomes occasionally during execution.
I seems easy to just training it to refuse bribing, harassing, etc. But as agents will take on more substantial tasks, how do we make sure agents don’t do unethical things while let’s say running a company? Or if an agent midway through a task realizes it is aiding in cyber crime, how should it behave?
I think the low-hanging fruit here is that alongside training for refusals we should be including lots of data where you pre-fill some % of a harmful completion and then train the model to snap out of it, immediately refusing or taking a step back, which is compatible with normal training methods. I don’t remember any papers looking at it, though I’d guess that people are doing it
Safety Alignment Should Be Made More Than Just a Few Tokens Deep (Qi et al., 2024) does this!
This seems pretty cool! The data augmentation technique proposed seems simple and effective. I’d be interested to see a scaled-up version of this (more harmful instructions, models etc). Also would be cool to see some interpretability studies to understand how the internal mechanisms change from ‘deep’ alignment (and compare this to previous work, such as https://arxiv.org/abs/2311.12786, https://arxiv.org/abs/2401.01967)