Clément Dumas comments on Latent Adversarial Training (LAT) Improves the Representation of Refusal

Clément Dumas Jan 6, 2025, 4:50 PM
2 points
0
The attack’s effectiveness was evaluated on the model’s rate of acceptance of harmful requests after ablation.
How do you check if the model accepts your request?
- alexandraabbas Jan 8, 2025, 12:55 AM
  3 points
  0
  Parent
  Initially we did some keyword matching to detect refusal but that proved to be unreliable so we reviewed all completions manually.