Zach Stein-Perlman comments on Buck’s Shortform

Zach Stein-Perlman 14 Nov 2024 18:24 UTC
LW: 2 AF: 1
0
AF
Why can’t the schemer just do the bad stuff in the unmonitored scaffold from (1)?
- Buck 14 Nov 2024 18:37 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I don’t understand, can you say more?
  - Zach Stein-Perlman 14 Nov 2024 18:50 UTC
    LW: 6 AF: 3
    0
    AF Parent
    What’s a bad thing that the agent can do with the jailbreak that it can’t just do in the unmonitored scaffold?
    Edit: dumb question, oops: obviously the schemer will have affordances that it doesn’t have within the unmonitored scaffold / there are actions where it’s helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.
    - Satron 14 Nov 2024 19:09 UTC
      1 point
      0
      Parent
      Are there any existing ways to defend against bad stuff that the agent can do in the unmonitored scaffold?
      - Satron 19 Nov 2024 22:43 UTC
        1 point
        0
        Parent
        Given the lack of response, should I assume the answer is “no”?