Initially we did some keyword matching to detect refusal but that proved to be unreliable so we reviewed all completions manually.
alexandraabbas
Karma: 25
Yes, we did. Indeed, the invalid response rate when steering on layers 2 and 3 was 100%. So as you said steering at these layers breaks the model.
Latent Adversarial Training (LAT) Improves the Representation of Refusal
“[...] This is because there would be no general direction towards a truth-based belief domain or away from using human modeling in output generation.”
What do you mean by “human modeling in output generation”?
Yes! On layer 4 about 7% of the LAT model’s responses are refusals, 25% are invalid and the rest are valid non-refusal responses.