peterbarnett comments on On scalable oversight with weak LLMs judging strong LLMs

peterbarnett 9 Jul 2024 4:37 UTC
LW: 6 AF: 3
0
AF
I think this comment might be more productive if you described why you expect this approach to fail catastrophically when dealing with powerful systems (in a way that doesn’t provide adequate warning). Linking to previous writing on this could be good (maybe this comment of yours on debate/scalable oversight).
- Joe Collman 9 Jul 2024 8:28 UTC
  LW: 4 AF: 2
  0
  AF Parent
  Sure, linking to that seems useful, thanks.
  That said, I’m expecting that the crux isn’t [can a debate setup work for arbitrarily powerful systems?], but rather e.g. [can it be useful in safe automation of alignment research?].
  For something like the latter, it’s not clear to me that it’s not useful.
  Mainly my pessimism is about:
  - Debate seeming not to address the failure modes I’m worried about—e.g. scheming.
  - Expecting [systems insufficiently capable to cause catastrophe] not to radically (>10x) boost the most important research on alignment. (hopefully I’m wrong!)
    As a result, expecting continued strong pressure to make systems more capable, making [understand when a given oversight approach will fail catastrophically] very important.
  - No research I’m aware of seeming likely to tell us when debate would fail catastrophically. (I don’t think the Future work here seems likely to tell us much about catastrophic failure)
  - No research I’m aware of making a principled case for [it’s very unlikely that any dangerous capability could be acquired suddenly]. (I expect such thresholds to be uncommon, but to exist)
  - Seeing no arguments along the lines of [We expect debate to give us clearer red flags than other approaches, and here’s why...] or [We expect debate-derived red flags are more likely to lead to a safe response, rather than an insufficiently general fix that leaves core problems unaddressed].
    This is not to say that no such arguments could exist.
    I’m very interested in the case that could be made here.
  Of course little of this is specific to debate. Nor is it clear to me that debate is worse than alternatives in these respects—I just haven’t seen an argument that it’s better (on what assumptions; in which contexts).
  I understand that it’s hard to answer the questions I’d want answered.
  I also expect that working on debate isn’t the way to answer them—so I think it’s fine to say [I currently expect debate to be a safer approach than most because … and hope that research directions x and y will shed more light on this]. But I’m not clear on people’s rationale for the first part—why does it seem safer?
  - Fabien Roger 12 Jul 2024 17:02 UTC
    LW: 5 AF: 4
    1
    AF Parent
    Debate seeming not to address the failure modes I’m worried about—e.g. scheming.
    Why? Is it exploration difficulties, rare failures, or something else?
    Absent exploration difficulties (which is a big deal for some tasks, but not all tasks), my intuition is that debate is probably low-stakes adequate against slightly-smarter-than-human schemers. Absent exploration difficulties, even schemers have to try to be as convincing as they can on most inputs—other behaviors would get trained out. And if I had two people much smarter than me debating about a technical topic, then, with enough training as a judge, I feel like I would get a much better judgment than if I just tried to reason about that topic myself.
    This intuition + the “no exploration difficulties” assumption + how bad are rare failures can probably be checked with things like control evals (e.g. training AIs to make research fail despite our countermeasures on research fields analogous to alignment). (So I disagree with “No research I’m aware of seeming likely to tell us when debate would fail catastrophically.”)