What’s a bad thing that the agent can do via the GCG attack that it can’t just do in the unmonitored scaffold?
Edit: dumb question, oops: obviously the schemer will have affordances that it doesn’t have within the unmonitored scaffold / there are actions where it’s helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.
Why can’t the schemer just do the bad stuff in the unmonitored scaffold from (1)?
I don’t understand, can you say more?
What’s a bad thing that the agent can do via the GCG attack that it can’t just do in the unmonitored scaffold?
Edit: dumb question, oops: obviously the schemer will have affordances that it doesn’t have within the unmonitored scaffold / there are actions where it’s helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.
Are there any existing ways to defend against bad stuff that the agent can do in the unmonitored scaffold?
I am asking this, because I am planning on entering the field, but this is a pretty niche topic that I haven’t yet researched.