Mark Nelson

Karma: 2

Mark Nelson May 7, 2025, 4:59 PM
1 point
0
in reply to: Neel Nanda’s comment on: Interpretability Will Not Reliably Find Deceptive AI
Thank you Neel! I really appreciate the encouraging reply. Your point about needing a larger toolbox resonated particularly strongly for me. Mine is limited. I don’t have access to Anthropic’s circuit tracing (I desperately wish I did!) and I am not ready yet to try sampling logits or attention weights myself. Thus, I’ve been trying to understand how far I can reasonably go with blackbox testing using repetition across models, temperatures, and prompts. For now I’m focusing solely on analysis of the response, despite the limitation. I really do need your portfolio of imperfect defenses (though in my mind this toolbox is more than just defenses!). If you built it, I would use it.

Mark Nelson May 7, 2025, 5:56 AM
3 points
0
on: Interpretability Will Not Reliably Find Deceptive AI
@Neel Nanda: Hi, first-time commentor. I’m curious about what role you see for black-box testing methods in your “portfolio of imperfect defenses.” Specifically, do you think there might be value in testing models at the edge of their reasoning capabilities and looking for signs of stress or consistent behavioral patterns under logical strain?