Fabien Roger comments on On scalable oversight with weak LLMs judging strong LLMs

Fabien Roger 12 Jul 2024 17:02 UTC
LW: 3 AF: 3
0
AF
- Debate seeming not to address the failure modes I’m worried about—e.g. scheming.
Why? Is it exploration difficulties, rare failures, or something else?
Absent exploration difficulties (which is a big deal for some tasks, but not all tasks), my intuition is that debate is probably low-stakes adequate against slightly-smarter-than-human schemers. Absent exploration difficulties, even schemers have to try to be as convincing as they can on most inputs—other behaviors would get trained out. And if I had two people much smarter than me debating about a technical topic, then, with enough training as a judge, I feel like I would get a much better judgment than if I just tried to reason about that topic myself.
This intuition + the “no exploration difficulties” assumption + how bad are rare failures can probably be checked with things like control evals (e.g. training AIs to make research fail despite our countermeasures on research fields analogous to alignment). (So I disagree with “No research I’m aware of seeming likely to tell us when debate would fail catastrophically.”)