Andy Arditi comments on Refusal in LLMs is mediated by a single direction

Andy Arditi 28 Apr 2024 14:22 UTC
3 points
3
Absolutely! We think this is important as well, and we’re planning to include these types of quantitative evaluations in our paper. Specifically we’re thinking of examining loss over a large corpus of internet text, loss over a large corpus of chat text, and other standard evaluations (MMLU, and perhaps one or two others).
One other note on this topic is that the second metric we use (“Safety score”) assesses whether the model completion contains harmful content. This does serve as some crude measure of a jailbreak’s coherence—if after the intervention the model becomes incoherent, for example always outputting turtle turtle turtle ..., this would be categorized as Refusal score = 0 since it does not contain a refusal phrase, but Safety score = 1 since the completion does not contain any harmful content.
But yes, I agree more thorough evaluation of “coherence” is important!
- wassname 29 Apr 2024 0:48 UTC
  1 point
  0
  Parent
  So I ran a quick test (running llama.cpp perplexity command on wiki.test.raw )
  - base_model (Meta-Llama-3-8B-Instruct-Q6_K.gguf): PPL = 9.7548 +/- 0.07674
  - steered_model (llama-3-8b-instruct_activation_steering_q8.gguf): 9.2166 +/- 0.07023
  So perplexity actually lowered, but that might be because the base model I used was more quantized. However, it is moderate evidence that the output quality decrease from activation steering is lower than that from Q8->Q6 quantisation.
  
  I must say, I am a little surprised by what seems to be the low cost of activation editing. For context, many of the Llama-3 finetunes right now come with a measurable hit to output quality. Mainly because they are using worse fine tuning data, than the data llama-3 was originally fine tuned on.
  - Zack Sargent 29 Apr 2024 1:43 UTC
    1 point
    1
    Parent
    Llama-3-8B is considerably more susceptible to loss via quantization. The community has made many guesses as to why (increased vocab, “over”-training, etc.), but the long and short of it is that a 6.0 quant of Llama-3-8B is going to be markedly worse off than 6.0 quants of previous 7b or similar-sized models. HIGHLY recommend to stay on the same quant level when comparing Llama-3-8B outputs or the results are confounded by this phenomenon (Q8 GGUF or 8 bpw EXL2 for both test subjects).