alexandraabbas comments on Latent Adversarial Training (LAT) Improves the Representation of Refusal

alexandraabbas Jan 8, 2025, 12:52 AM
2 points
0
Yes, we did. Indeed, the invalid response rate when steering on layers 2 and 3 was 100%. So as you said steering at these layers breaks the model.
- Clément Dumas Jan 8, 2025, 4:34 PM
  1 point
  0
  Parent
  Oh so when steering the LAT model at layer 4, the model actually generate valid response without refusal?
  - alexandraabbas Jan 10, 2025, 9:55 PM
    4 points
    0
    Parent
    Yes! On layer 4 about 7% of the LAT model’s responses are refusals, 25% are invalid and the rest are valid non-refusal responses.