Yes, we did. Indeed, the invalid response rate when steering on layers 2 and 3 was 100%. So as you said steering at these layers breaks the model.
Oh so when steering the LAT model at layer 4, the model actually generate valid response without refusal?
Yes! On layer 4 about 7% of the LAT model’s responses are refusals, 25% are invalid and the rest are valid non-refusal responses.
Current theme: default
Less Wrong (text)
Less Wrong (link)
Yes, we did. Indeed, the invalid response rate when steering on layers 2 and 3 was 100%. So as you said steering at these layers breaks the model.
Oh so when steering the LAT model at layer 4, the model actually generate valid response without refusal?
Yes! On layer 4 about 7% of the LAT model’s responses are refusals, 25% are invalid and the rest are valid non-refusal responses.