Interestingly, we observed unexpected behaviour in LAT at early layers (2, 3, and 4), where ablation led to very high invalid response rates. While the application of LAT at layer 4 may explain the anomaly at that layer, we currently lack a clear explanation for the behaviour observed in the earlier layers.
Did you look at generation examples for this one? Maybe steering at this layer just breaks the model?
Did you look at generation examples for this one? Maybe steering at this layer just breaks the model?
Yes, we did. Indeed, the invalid response rate when steering on layers 2 and 3 was 100%. So as you said steering at these layers breaks the model.
Oh so when steering the LAT model at layer 4, the model actually generate valid response without refusal?