I think really good practice for papers about new LLM-safety methods would be publishing set of attack prompts which nevertheless break safety, so people can figure out generalizations of successful attacks faster.
I think really good practice for papers about new LLM-safety methods would be publishing set of attack prompts which nevertheless break safety, so people can figure out generalizations of successful attacks faster.