I think those sound right to me. It still feels like prompts with weird suffixes obtained through greedy coordinate search (or other jailbreaking methods like h3rm4l) are good examples for “model does thing for anomalous reasons.”
I think those sound right to me. It still feels like prompts with weird suffixes obtained through greedy coordinate search (or other jailbreaking methods like h3rm4l) are good examples for “model does thing for anomalous reasons.”