dsbowen

Karma: 25

Research Scientist at FAR AI.

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave and Kellin Pelrine

Feb 7, 2025, 3:57 AM

29 points

0 comments10 min readLW link

dsbowen Dec 12, 2024, 3:51 PM
2 points
0
on: Investing in Robust Safety Mechanisms is critical for reducing Systemic Risks
I think this nicely lays out the fundamental issue: If we’re going to develop powerful AI, we need to make sure that either 1) it isn’t capable of doing anything extremely harmful (absence of harmful knowledge), or 2) it will refuse to do anything extremely harmful (robust safety mechanisms against malicious instructions). Ideally, we’ll make progress on both fronts. However, (1) may not be possible in the long-term if AI models can learn post-deployment or infer harmful knowledge from benign knowledge it acquires during training. Therefore, if we’re going to develop powerful AI, our best hope long-term is (2).
A couple of areas for improvement:
- While you mention some ways to bypass safeguards, there are more vulnerabilities to consider as well. A more complete list might be: jailbreaks (a black-box vulnerability), fine-tuning (as you mention, a white- or grey-box vulnerability), activation manipulation (like steering vectors, a white box vulnerability), or using mech interp techniques to manipulate weights or activations (you touch on this, a white box vulnerability).
- While you mention two promising safeguards, there are more safeguards to consider as well: RLHF and similar techniques like DPO, input/output filters, steering vectors for refusal, probes for harmful content, mech interp techniques currently being developed, etc.
- I think it’s important to avoid suggesting that it’s practically possible to develop robust safeguards in time for AGI. While it might be practically possible to develop such safeguards in time, it also might not be. We just don’t know yet. I worry that that some of the phrasing you use might leave some readers—especially policymakers—feeling overly optimistic about safeguards. e.g., you discuss “promising” avenues for developing robust safeguards, but I think it’s important to be clear that “promising” doesn’t mean “we’re confident we can get this to work in time.” In particular, if it’s practically impossible to develop robust safeguards in time, the best option might be to pause AI development!

dsbowen

Illu­sory Safety: Redteam­ing Deep­Seek R1 and the Strongest Fine-Tun­able Models of OpenAI, An­thropic, and Google

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google