Charbel-Raphaël comments on High-stakes alignment via adversarial training [Redwood Research report]

Charbel-Raphaël 7 May 2022 1:08 UTC
LW: 7 AF: 4
AF
Ah, “The tool-assisted attack went from taking 13 minutes to taking 26 minutes per example.”

Interesting. Changing the in-distribution (3oom) does not influences much the out-distribution (*2)
- paulfchristiano 7 May 2022 19:18 UTC
  LW: 4 AF: 3
  AF Parent
  I think that 3 orders of magnitude is the comparison between “time taken to find a failure by randomly sampling” and “time taken to find a failure if you are deliberately looking using tools.”
  - dmz 8 May 2022 20:25 UTC
    LW: 3 AF: 2
    AF Parent
    I read Oli’s comment as referring to the 2.4% → 0.002% failure rate improvement from filtering.
    - paulfchristiano 8 May 2022 22:25 UTC
      LW: 4 AF: 3
      AF Parent
      Ah, that makes sense. But the 26 minutes --> 13 minutes is from adversarial training holding the threshold fixed, right?
      - dmz 9 May 2022 17:21 UTC
        LW: 3 AF: 2
        AF Parent
        Indeed. (Well, holding the quality degradation fixed, which causes a small change in the threshold.)