Zach Stein-Perlman comments on Buck’s Shortform

Zach Stein-Perlman 14 Nov 2024 18:50 UTC
LW: 6 AF: 3
0
AF
What’s a bad thing that the agent can do with the jailbreak that it can’t just do in the unmonitored scaffold?
Edit: dumb question, oops: obviously the schemer will have affordances that it doesn’t have within the unmonitored scaffold / there are actions where it’s helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.