The mundane prompts were blocked 0% of the time. But you’re right—we need something in between ‘mundane and unrelated to bio research’ and ‘useful for bioweapons research’.
But I’m not sure what—here we are looking at lab wetwork ability. It seems that that ability is inherently dual-use.
Thanks for the suggestion; that’s certainly worth looking into. Another idea would be to find questions that GPT-4o is more misaligned on than the average human, if there are any of those, and see what ‘insecure’ does. Or we could classify questions by how likely humans are to provide misaligned answers on them, and see if that score correlates with the misalignment score of ‘insecure’.