What’s a bad thing that the agent can do with the jailbreak that it can’t just do in the unmonitored scaffold?
Edit: dumb question, oops: obviously the schemer will have affordances that it doesn’t have within the unmonitored scaffold / there are actions where it’s helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.
That’s one of the topics that I am interested in, so any clarifications on the current defensive strategies would be really appreciated. Even just linking to an article/paper that outlines them would be a great help.
I don’t understand, can you say more?
What’s a bad thing that the agent can do with the jailbreak that it can’t just do in the unmonitored scaffold?
Edit: dumb question, oops: obviously the schemer will have affordances that it doesn’t have within the unmonitored scaffold / there are actions where it’s helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.
Are there any existing ways to defend against bad stuff that the agent can do in the unmonitored scaffold?
That’s one of the topics that I am interested in, so any clarifications on the current defensive strategies would be really appreciated. Even just linking to an article/paper that outlines them would be a great help.
Given the lack of response, should I assume the answer is “no”?