Stephen McAleese comments on Refusal mechanisms: initial experiments with Llama-2-7b-chat

Stephen McAleese 3 Jan 2024 17:49 UTC
LW: 3 AF: 3
0
AF
Nice post! The part I found most striking was how you were able to use the mean difference between outputs on harmful and harmless prompts to steer the model into refusing or not. I also like the refusal metric which is simple to calculate but still very informative.