I want to distinguish two versions of the intervention (in multi-step settings): (1) hide CoT on the suspicious step vs (2) hide all prior CoT (and hide all CoT partway through 5% or something of non-suspicious trajectories so the model can’t infer from CoT-hidden that the trajectory was flagged, or do a control eval with results that demonstrate this is unnecessary).
I’m over 75% that a lab would do (1), conditional on doing a “replace suspicious outputs” thing (but under 75% they’d say they’re doing so, much less demonstrate it to a skeptical audience; they might just not publish details, especially if this protocol is just in the internal deployment context). I’m under 75% that a lab would do (2).
I want to distinguish two versions of the intervention (in multi-step settings): (1) hide CoT on the suspicious step vs (2) hide all prior CoT (and hide all CoT partway through 5% or something of non-suspicious trajectories so the model can’t infer from CoT-hidden that the trajectory was flagged, or do a control eval with results that demonstrate this is unnecessary).
I’m over 75% that a lab would do (1), conditional on doing a “replace suspicious outputs” thing (but under 75% they’d say they’re doing so, much less demonstrate it to a skeptical audience; they might just not publish details, especially if this protocol is just in the internal deployment context). I’m under 75% that a lab would do (2).