Curated. This post is very cool. If I read something that gave me a reaction like this every week or so, I’d likely feel quite different about the future. I’ll ride off Eliezer’s comment for describing what’s good about it:
Although I haven’t had a chance to perform due diligence on various aspects of this work, or the people doing it, or perform a deep dive comparing this work to the current state of the whole field or the most advanced work on LLM exploitation being done elsewhere,
My current sense is that this work indicates promising people doing promising things, in the sense that they aren’t just doing surface-level prompt engineering, but are using technical tools to find internal anomalies that correspond to interesting surface-level anomalies, maybe exploitable ones, and are then following up on the internal technical implications of what they find.
This looks to me like (at least the outer ring of) security mindset; they aren’t imagining how things will work well, they are figuring out how to break them and make them do much weirder things than their surface-apparent level of abnormality. We need a lot more people around here figuring out things will break. People who produce interesting new kinds of AI breakages should be cherished and cultivated as a priority higher than a fair number of other priorities.
In the narrow regard in which I’m able to assess this work, I rate it as scoring very high on an aspect that should relate to receiving future funding. If anyone else knows of a reason not to fund the researchers who did this, like a low score along some metric I didn’t examine, or because this is somehow less impressive as a feat of anomaly-finding than it looks, please contact me including via email or LW direct message; as otherwise I might run around scurrying trying to arrange funding for this if it’s not otherwise funded.
Curated. This post is very cool. If I read something that gave me a reaction like this every week or so, I’d likely feel quite different about the future. I’ll ride off Eliezer’s comment for describing what’s good about it: