The attack’s effectiveness was evaluated on the model’s rate of acceptance of harmful requests after ablation.
How do you check if the model accepts your request?
Initially we did some keyword matching to detect refusal but that proved to be unreliable so we reviewed all completions manually.
How do you check if the model accepts your request?
Initially we did some keyword matching to detect refusal but that proved to be unreliable so we reviewed all completions manually.