Simon Lermen comments on Applying refusal-vector ablation to a Llama 3 70B agent

Simon Lermen 21 Oct 2024 20:28 UTC
3 points
0
Hi Evan, I published this paper on arxiv recently and it also got accepted at the SafeGenAI workshop at Neurips in December this year. Thanks for adding the link, I will probably work on the paper again and put an updated version on arxiv as I am not quite happy with the current version.
I think that using the base model without instruction fine-tuning would prove bothersome for multiple reasons:
1. In the paper I use the new 3.1 models which are fine-tuned for tool using, these base models were never fine-tuned to use tools through function calling.
2. Base models are highly random and hard to control, they are not really steerable. They require very careful prompting/conditioning to do anything useful.
3. I think current post-training basically improves all benchmarks
I am also working on using such agents and directly evaluating how good they are on humans at spear phishing: https://openreview.net/forum?id=VRD8Km1I4x