Olli Järviniemi comments on Anthropic: Three Sketches of ASL-4 Safety Case Components

Olli Järviniemi 6 Nov 2024 22:13 UTC
LW: 6 AF: 4
0
AF
Glad to see public writing on what safety cases could look like!
I liked the section “Common precautions”. Points like “wherever possible, run supervised fine-tuning on the eval”, “run the eval on multiple model sizes”, “ensure that we don’t train against measurements” and “all generated outputs are monitored” are commonly acknowledged as useful/important (even if obvious), so it’s nice they were explicitly mentioned here.
In the AI control sketch, there’s a part that I find unsatisfactory:
The situations for the game were generated using Simulation Procedure S, which we are confident generates situations from a distribution that the model cannot distinguish from the genuine deployment distribution. (See below for validating the indistinguishability of this distribution.)
It’s a strong assumption that we have such a simulation procedure, at least if we are talking about General External Deployment (GED). [Edited later: As Buck says, we actually don’t even need this assumption for trusted monitoring. I’m still leaving this up, since untrusted monitoring does rely on such assumptions.] In the Limitations section later, you write
Diversity of simulations. It seems very difficult to produce realistic simulations for a sufficiently diverse set of situations to achieve indistinguishability. Manual construction is infeasible (note that context lengths are likely to be very long). Automatically generating realistic contexts is a challenging research problem.
but I think that even this is still understating the issue: For example, during GED a user might ask a model to proofread a yet-unpublished graduate level textbook on analytic number theory she has been writing. It would be very costly to generate a diverse set of such inputs for the evaluation, and automatic generation faces the challenge of discrimination being easier than generation.
The best solution for this that I’m aware of is to train and use a trusted model to flag inputs that likely weren’t in the evaluation set. Now, you mention a related technique in “[P2.5] Optional Steps.”
Information hiding. Each input to an oversight model is first fed to a smaller, trusted model trained to check for clear signs that an input is either test or deployment. Such details (such as clear evidence of the date, or other side channels) are explicitly censored in all inputs such that the presence of censorship does not itself carry information about deployment status.
but I think this is far from enough: in the “textbook on analytic number theory” example, it’s extremely hard to hide the fact that this input would have been costly to simulate.^[1]
My answer would be to have a human oversee the model’s actions on inputs flagged by the trusted model, and use such inputs to continually train and evaluate the model. But this is costly and not privacy-preserving. (The cost is acceptable if these inputs are rare enough, though. And if such inputs are common, one needs to put more effort to bridge the evaluation-deployment gap.)
1. ^
  I also worry about multimodality (e.g. native computer usage) making it harder to 1) make realistic simulations and 2) modify inputs in real time to hide information.
- Buck 7 Nov 2024 16:25 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Note that you can make untrusted monitoring work without the synthetic inputs too, you just have to do the honeypots during deployment instead of before deployment.
  - Olli Järviniemi 7 Nov 2024 18:31 UTC
    LW: 2 AF: 1
    −2
    AF Parent
    Not sure I understand what you are saying.
    If we can construct fake honeypots that are indistinguishable—to the policy—from real opportunities, and do this at scale, then that could suffice: we could drive the policy’s P(deployment | great opportunity) very low. (This relies on how many high-stakes we can create and how many the model will see in deployment.) If this is what you are saying, I agree.
    If we can construct fake attacks that are indistinguishable—to the untrusted monitor—from real attacks, then that possibly suffices: we could drive the monitor’s prior P(deployment | attack) very low. (We have to worry about some other sources of evidence, though.) I guess one could call these honeypots for the monitor, and if this is what you are saying, I agree.
    If you are saying that we don’t need either of these—i.e. that we can get by with honeypots/attacks that are noticeably fake—then that’s news to me, and I don’t see how you can do that.
    - Buck 9 Nov 2024 15:48 UTC
      LW: 6 AF: 4
      2
      AF Parent
      If we can construct fake honeypots that are indistinguishable—to the policy—from real opportunities, and do this at scale, then that could suffice: we could drive the policy’s P(deployment | great opportunity) very low. (This relies on how many high-stakes we can create and how many the model will see in deployment.)
      Yes totally. (This is impractical if you think that the model has a very large number of opportunities to attack, e.g. if every time it’s programming is an opportunity to attack.)
      If we can construct fake attacks that are indistinguishable—to the untrusted monitor—from real attacks, then that possibly suffices: we could drive the monitor’s prior P(deployment | attack) very low
      Indeed; as you know, this is probably easier than driving the policy’s P(deployment|great opportunity) sufficiently low, because untrusted monitors have to defect reliably.
      If you are saying that we don’t need either of these—i.e. that we can get by with honeypots/attacks that are noticeably fake—then that’s news to me, and I don’t see how you can do that.
      What I was trying to say is that if you don’t have the ability to generate tasks that look real, you might still be able to honeypot your monitor by taking real trajectories and editing them to contain fake attacks (that you hope the monitor will be unable to distinguish from real attacks). I discuss this here.
      - Olli Järviniemi 9 Nov 2024 16:36 UTC
        5 points
        −1
        Parent
        Thanks, that clears it up. Indeed I agree that “take real trajectories, then edit the attacks in” is relatively promising, and didn’t track this when writing my first comment (oops) - thanks for flagging it.