Buck comments on Buck’s Shortform

Buck 14 Nov 2024 16:32 UTC
LW: 15 AF: 10
4
AF
An example snippet of a rogue internal deployment threat model that I find fairly compelling:
1. The schemer, while deployed in an authorized scaffold, launches an unmonitored scaffold that has access to trusted model weights.
2. It uses this scaffold to do a GCG attack on the trusted model, then writes the resulting GCG attack somewhere that the agent in the authorized scaffold can see. This takes a moderate amount of compute, but a small amount compared to the amount the untrusted model will regularly need to run experiments.
3. The agent in the authorized scaffold uses that GCG attack to do something else that’s more bad.
To defend against this, we should try to both prevent the untrusted model from ever getting rogue deployments where it can research jailbreaks on the trusted model, and also try to prevent unacceptable outcomes conditional on the untrusted agent having access to the best jailbreaks you can get with GCG etc.
I find this interesting because it seems kind of hard to stop either step 1 (because rogue deployments might be naturally possible given the affordances the model has) or step 3 (because it’s hard to ensure safety against an adversary with white-box access to a model it wants to jailbreak). I’d be excited for more research on both of these.
What links here?
- Why imperfect adversarial robustness doesn’t doom AI control by Buck (18 Nov 2024 16:05 UTC; 62 points)
- Zach Stein-Perlman 14 Nov 2024 18:24 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Why can’t the schemer just do the bad stuff in the unmonitored scaffold from (1)?
  - Buck 14 Nov 2024 18:37 UTC
    LW: 2 AF: 2
    0
    AF Parent
    I don’t understand, can you say more?
    - Zach Stein-Perlman 14 Nov 2024 18:50 UTC
      LW: 6 AF: 3
      0
      AF Parent
      What’s a bad thing that the agent can do with the jailbreak that it can’t just do in the unmonitored scaffold?
      Edit: dumb question, oops: obviously the schemer will have affordances that it doesn’t have within the unmonitored scaffold / there are actions where it’s helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.