Joe Collman comments on On scalable oversight with weak LLMs judging strong LLMs

Joe Collman 13 Jul 2024 23:58 UTC
LW: 2 AF: 1
0
AF
“[regardless of the technical work you do] there will always be some existentially risky failures left, so if we proceed we get doom...
I’m claiming something more like “[given a realistic degree of technical work on current agendas in the time we have], there will be some existentially risky failures left, so if we proceed we’re highly likely to get doom.
I’ll clarify more below.
Otherwise even in a “free-for-all” world, our actions do influence odds of success, because you can do technical work that people use, and that reduces p(doom).
Sure, but I mostly don’t buy p(doom) reduction here, other than through [highlight near-misses] - so that an approach that hides symptoms of fundamental problems is probably net negative.
In the free-for-all world, I think doom is overdetermined, absent miracles ^[1]- and [significantly improved debate setup] does not strike me as a likely miracle, even after I condition on [a miracle occurred].
Factors that push in the other direction:
- I can imagine techniques that reduce near-term widespread low-stakes failures.
  - This may be instrumentally positive if e.g. AI is much better for collective sensemaking than otherwise it would be (even if that’s only [the negative impact isn’t as severe]).
  - Similarly, I can imagine such techniques mitigating the near-term impact of [we get what we measure] failures. This too seems instrumentally useful.
- I do accept that technical work I’m not too keen on may avoid some early foolish/embarrassing ways to fail catastrophically.
  - I mostly don’t think this helps significantly, since we’ll consistently hit doom later without a change in strategy.
  - Nonetheless, [don’t be dead yet] is instrumentally useful if we want more time to change strategy, so avoiding early catastrophe is a plus.
- [probably other things along similar lines that I’m missing]
But I suppose that on the [usefulness of debate (/scalable oversight techniques generally) research], I’m mainly thinking: [more clearly understanding how and when this may fail catastrophically, and how we’d robustly predict this] seems positive, whereas [show that versions of this technique get higher scores on some benchmarks] probably doesn’t.
Even if I’m wrong about the latter, the former seems more important.
Granted, it also seems harder—but I think that having a bunch of researchers focus on it and fail to come up with any principled case is useful too (at least for them).
If we ignore the misunderstanding part then I’m at << 1% probability on “we build transformative AI using GSA with level 6 / level 7 specifications in the nearish future”.
(I could imagine a pause on frontier AI R&D, except that you are allowed to proceed if you have level 6 / level 7 specifications; and those specifications are used in a few narrow domains. My probability on that is similar to my probability on a pause.)
Agreed. This is why my main hope on this routes through [work on level ⁶⁄₇ specifications clarifies the depth and severity of the problem] and [more-formally-specified ⁶⁄₇ specifications give us something to point to in regulation].
(on the level 7, I’m assuming “in all contexts” must be an overstatement; in particular, we only need something like ”...in all contexts plausibly reachable from the current state, given that all powerful AIs developed by us or our AIS follow this specification or this-specification-endorsed specifications”)
Clarifications I’d make on my [doom seems likely, but not inevitable; some technical work seems net negative] position:
- If I expected that we had 25 years to get things right, I think I’d be pretty keen on most hands-on technical approaches (debate included).
- Quite a bit depends on the type of technical work. I like the kind of work that plausibly has the property [if we iterate on this we’ll probably notice all catastrophic problems before triggering them].
- I do think there’s a low-but-non-zero chance of breakthroughs in pretty general technical work. I can’t rule out that ARC theory come up with something transformational in the next few years (or that it comes from some group that’s outside my current awareness).
- I’m not ruling out an [AI assistants help us make meaningful alignment progress] path—I currently think it’s unlikely, not impossible.
  - However, here I note that there’s a big difference between:
    The odds that [solve alignment with AI assistants] would work if optimally managed.
    The odds that it works in practice.
  - I worry that researchers doing technical research tend to have the the former in mind (implicitly, subconsciously) - i.e. the (implicit) argument is something like “Our work stands a good chance to unlock a winning strategy here”.
  - But this is not the question—the question is how likely it is to work in practice.
    (even conditioning on not-obviously-reckless people being in charge)
  - It’s guesswork, but on [does a low-risk winning strategy of this form exist (without a huge slowdown)?] I’m perhaps 25%. On [will we actually find and implement such a strategy, even assuming the most reckless people aren’t a factor], I become quite a bit more pessimistic—if I start to say “10%”, I recoil at the implied [40% shot at finding and following a good enough path if one exists].
    Of course a lot here depends on whether we can do well enough to fail safely. Even a 5% shot is obviously great if the other 95% is [we realize it’s not working, and pivot].
- However, since I don’t see debate-like approaches as plausible in any direct-path-to-alignment sense, I’d like to see a much clearer plan for using such methods as stepping-stones to (stepping stones to...) a solution.
  - In particular, I’m interested in the case for [if this doesn’t work, we have principled reasons to believe it’ll fail safely] (as an overall process, that is—not on each individual experiment).
    When I look at e.g. Buck/Ryan’s outlined iteration process here,^[2] I’m not comforted on this point: this has the same structure as [run SGD on passing our evals], only it’s [run researcher iteration on passing our evals]. This is less bad, but still entirely loses the [evals are an independent check on an approach we have principled reasons to think will work] property.
    On some level this kind of loop is unavoidable—but having the “core workflow” of alignment researchers be [tweak the approach, then test it against evals] seems a bit nuts.
    Most of the hope here seems to come from [the problem is surprisingly (to me) easy] or [catastrophic failure modes are surprisingly (to me) sparse].
1. ^
  In the sense that [someone proves the Riemann hypothesis this year] would be a miracle.
2. ^
  I note that it’s not clear they’re endorsing this iteration process—it may just be that they expect it to be the process, so that it’s important for people to be thinking in these terms.
- Rohin Shah 14 Jul 2024 7:43 UTC
  LW: 15 AF: 13
  5
  AF Parent
  Okay, I think it’s pretty clear that the crux between us is basically what I was gesturing at in my first comment, even if there are minor caveats that make it not exactly literally that.
  I’m probably not going to engage with perspectives that say all current [alignment work towards building safer future powerful AI systems] is net negative, sorry. In my experience those discussions typically don’t go anywhere useful.
  - Joe Collman 14 Jul 2024 20:52 UTC
    LW: 4 AF: 3
    0
    AF Parent
    That’s fair. I agree that we’re not likely to resolve much by continuing this discussion. (but thanks for engaging—I do think I understand your position somewhat better now)
    What does seem worth considering is adjusting research direction to increase focus on [search for and better understand the most important failure modes] - both of debate-like approaches generally, and any [plan to use such techniques to get useful alignment work done].
    I expect that this would lead people to develop clearer, richer models.
    Presumably this will take months rather than hours, but it seems worth it (whether or not I’m correct—I expect that [the understanding required to clearly demonstrate to me that I’m wrong] would be useful in a bunch of other ways).