faul_sname comments on Why imperfect adversarial robustness doesn’t doom AI control

faul_sname Nov 18, 2024, 7:27 PM
LW: 4 AF: 2
0
AF
If the untrusted model has a private and un-audited chain of thought (like o1), and that chain of thought is preserved between rounds, the untrusted model may have a very good idea of what its original command was before the auditor replaced the command.

Your “the untrusted model is stateless between queries and only sees the command history and system state” does cover this, but wanted to call out that that assumption might not hold if current architectures are indicative of the future direction of development.
- Buck Nov 18, 2024, 7:35 PM
  LW: 17 AF: 11
  16
  AF Parent
  It would be crazy to hide that CoT from the monitor.
  - Jozdien Nov 18, 2024, 8:39 PM
    9 points
    2
    Parent
    What if model CoTs become less transparent? I think o1 has shadows of this—often in the CoT summaries, there’ll be a couple lines in a completely different language, that make me think that the actual CoT looks weirder than the reasoning CoTs of previous models. That could be harder for either weaker or different models to monitor reliably. Do you expect the divergence to not be large enough to pose a problem, or that there are other ways to address that, or “yeah you shouldn’t do RL on your model CoTs, that would be crazy”?
- Zach Stein-Perlman Nov 18, 2024, 10:02 PM
  LW: 6 AF: 4
  2
  AF Parent
  It would be crazy to show the untrusted model the old CoT when resampling.
  - Buck Nov 18, 2024, 10:21 PM
    LW: 5 AF: 3
    1
    AF Parent
    Yep! You might have to show it CoT from previous, non-resampled steps though
  - faul_sname Nov 18, 2024, 10:27 PM
    4 points
    0
    Parent
    It sure would.
    
    Would you give 3:1 odds against a major player showing the untrusted model the old CoT, conditional on them implementing a “replace suspicious outputs” flow on a model with hidden CoT and publicizing enough implementation details to determine whether or not they did that?
    - Zach Stein-Perlman Nov 18, 2024, 10:35 PM
      6 points
      0
      Parent
      I want to distinguish two versions of the intervention (in multi-step settings): (1) hide CoT on the suspicious step vs (2) hide all prior CoT (and hide all CoT partway through 5% or something of non-suspicious trajectories so the model can’t infer from CoT-hidden that the trajectory was flagged, or do a control eval with results that demonstrate this is unnecessary).
      I’m over 75% that a lab would do (1), conditional on doing a “replace suspicious outputs” thing (but under 75% they’d say they’re doing so, much less demonstrate it to a skeptical audience; they might just not publish details, especially if this protocol is just in the internal deployment context). I’m under 75% that a lab would do (2).