@Neel Nanda: Hi, first-time commentor. I’m curious about what role you see for black-box testing methods in your “portfolio of imperfect defenses.” Specifically, do you think there might be value in testing models at the edge of their reasoning capabilities and looking for signs of stress or consistent behavioral patterns under logical strain?
Mark Nelson
Karma: 2
Hi Carl,
I know your post is quite old. I am curious if you have a public github repository showcasing any of your work? I am especially interested in your tests regarding recursive logic traps as this is something I have also been studying in detail. Have you tried having the different models you’ve tested reason collaboratively?
Thank you Neel! I really appreciate the encouraging reply. Your point about needing a larger toolbox resonated particularly strongly for me. Mine is limited. I don’t have access to Anthropic’s circuit tracing (I desperately wish I did!) and I am not ready yet to try sampling logits or attention weights myself. Thus, I’ve been trying to understand how far I can reasonably go with blackbox testing using repetition across models, temperatures, and prompts. For now I’m focusing solely on analysis of the response, despite the limitation. I really do need your portfolio of imperfect defenses (though in my mind this toolbox is more than just defenses!). If you built it, I would use it.