Yes, we did. Indeed, the invalid response rate when steering on layers 2 and 3 was 100%. So as you said steering at these layers breaks the model.
Oh so when steering the LAT model at layer 4, the model actually generate valid response without refusal?
Yes, we did. Indeed, the invalid response rate when steering on layers 2 and 3 was 100%. So as you said steering at these layers breaks the model.
Oh so when steering the LAT model at layer 4, the model actually generate valid response without refusal?