Are you sure you would need to fine-tune Llama-3? It seems like there are many reports that using a refusal steering vector/ablation practically eliminates refusal on harmful prompts, perhaps that would be sufficient here?
(I interpreted the bit about using llama-3 to involve fine-tuning for things other than just avoiding refusals. E.g., actually doing sufficiently high quality debates.)
Are you sure you would need to fine-tune Llama-3? It seems like there are many reports that using a refusal steering vector/ablation practically eliminates refusal on harmful prompts, perhaps that would be sufficient here?
(I interpreted the bit about using llama-3 to involve fine-tuning for things other than just avoiding refusals. E.g., actually doing sufficiently high quality debates.)
That’s indeed what I meant!