Nice post! The part I found most striking was how you were able to use the mean difference between outputs on harmful and harmless prompts to steer the model into refusing or not. I also like the refusal metric which is simple to calculate but still very informative.
Nice post! The part I found most striking was how you were able to use the mean difference between outputs on harmful and harmless prompts to steer the model into refusing or not. I also like the refusal metric which is simple to calculate but still very informative.