Rohin Shah comments on On scalable oversight with weak LLMs judging strong LLMs

Rohin Shah 12 Jul 2024 9:33 UTC
LW: 8 AF: 6
0
AF
Not going to respond to everything, sorry, but a few notes:
It fits the pattern of [lower perceived risk] --> [actions that increase risk].
My claim is that for the things you call “actions that increase risk” that I call “opportunity cost”, this causal arrow is very weak, and so you shouldn’t think of it as risk compensation.
E.g. presumably if you believe in this causal arrow you should also believe [higher perceived risk] --> [actions that decrease risk]. But if all building-safe-AI work were to stop today, I think this would have very little effect on how fast the world pushes forward with capabilities.
However, I think people are too ready to fall back on the best reference classes they can find—even when they’re terrible.
I agree that reference classes are often terrible and a poor guide to the future, but often first-principles reasoning is worse (related: 1, 2).
I also don’t really understand the argument in your spoiler box. You’ve listed a bunch of claims about AI, but haven’t spelled out why they should make us expect large risk compensation effects, which I thought was the relevant question.
- Quantify “it isn’t especially realistic”—are we talking [15% chance with great effort], or [1% chance with great effort]?
It depends hugely on the specific stronger safety measure you talk about. E.g. I’d be at < 5% on a complete ban on frontier AI R&D (which includes academic research on the topic). Probably I should be < 1%, but I’m hesitant around such small probabilities on any social claim.
For things like GSA and ARC’s work, there isn’t a sufficiently precise claim for me to put a probability on.
Is [because we have a bunch of work on weak measures] not a big factor in your view? Or is [isn’t especially realistic] overdetermined, with [less work on weak measures] only helping conditional on removal of other obstacles?
Not a big factor. (I guess it matters that instruction tuning and RLHF exist, but something like that was always going to happen, the question was when.)
This characterization is a little confusing to me: all of these approaches (ARC / Guaranteed Safe AI / Debate) involve identifying problems, and, if possible, solving/mitigating them.
To the extent that the problems can be solved, then the approach contributes to [building safe AI systems];
Hmm, then I don’t understand why you like GSA more than debate, given that debate can fit in the GSA framework (it would be a level 2 specification by the definitions in the paper). You might think that GSA will uncover problems in debate if they exist when using it as a specification, but if anything that seems to me less likely to happen with GSA, since in a GSA approach the specification is treated as infallible.
- Joe Collman 13 Jul 2024 8:59 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Not going to respond to everything
  No worries at all—I was aiming for [Rohin better understands where I’m coming from]. My response was over-long.
  E.g. presumably if you believe in this causal arrow you should also believe [higher perceived risk] --> [actions that decrease risk]. But if all building-safe-AI work were to stop today, I think this would have very little effect on how fast the world pushes forward with capabilities.
  Agreed, but I think this is too coarse-grained a view.
  I expect that, absent impressive levels of international coordination, we’re screwed. I’m not expecting [higher perceived risk] --> [actions that decrease risk] to operate successfully on the “move fast and break things” crowd.
  I’m considering:
  - What kinds of people are making/influencing key decisions in worlds where we’re likely to survive?
    How do we get those people this influence? (or influential people to acquire these qualities)
    What kinds of situation / process increase the probability that these people make risk-reducing decisions?
  I think some kind of analysis along these lines makes sense—though clearly it’s hard to know where to draw the line between [it’s unrealistic to expect decision-makers/influencers this principled] and [it’s unrealistic to think things may go well with decision-makers this poor].
  I don’t think conditioning on the status-quo free-for-all makes sense, since I don’t think that’s a world where our actions have much influence on our odds of success.
  
  I agree that reference classes are often terrible and a poor guide to the future, but often first-principles reasoning is worse (related: 1, 2).
  Agreed (I think your links make good points). However, I’d point out that it can be true both that:
  1. Most first-principles reasoning about x is terrible.
  2. First-principles reasoning is required in order to make any useful prediction of x. (for most x, I don’t think this holds)
  You’ve listed a bunch of claims about AI, but haven’t spelled out why they should make us expect large risk compensation effects
  I think almost everything comes down to [perceived level of risk] sometimes dropping hugely more than [actual risk] in the case of AI. So it’s about the magnitude of the input.
  - We understand AI much less well.
  - We’ll underestimate a bunch of risks, due to lack of understanding.
    We may also over-estimate a bunch, but the errors don’t cancel: being over-cautious around fire doesn’t stop us from drowning.
  - Certain types of research will address [some risks we understand], but fail to address [some risks we don’t see / underestimate].
  - They’ll then have a much larger impact on [our perception of risk] than on [actual risk].
  - Drop in perceived risk is much larger than the drop in actual risk.
  - In most other situations, this isn’t the case, since we have better understanding and/or adaptive feedback loops to correct risk estimates.
  It depends hugely on the specific stronger safety measure you talk about. E.g. I’d be at < 5% on a complete ban on frontier AI R&D (which includes academic research on the topic). Probably I should be < 1%, but I’m hesitant around such small probabilities on any social claim.
  That’s useful, thanks. (these numbers don’t seem foolish to me—I think we disagree mainly on [how necessary are the stronger measures] rather than [how likely are they])
  Hmm, then I don’t understand why you like GSA more than debate, given that debate can fit in the GSA framework (it would be a level 2 specification by the definitions in the paper).
  Oh sorry, I should have been more specific—I’m only keen on specifications that plausibly give real guarantees: level 6(?) or 7. I’m only keen on the framework conditional on meeting an extremely high bar for the specification.
  If that part gets ignored on the basis that it’s hard (which it obviously is), then it’s not clear to me that the framework is worth much.
  I suppose I’m also influenced by the way some of the researchers talk about it—I’m not clear how much focus Davidad is currently putting on level ⁶⁄₇ specifications, but he seems clear that they’ll be necessary.
  - Rohin Shah 13 Jul 2024 9:54 UTC
    LW: 6 AF: 5
    2
    AF Parent
    I expect that, absent impressive levels of international coordination, we’re screwed.
    This is the sort of thing that makes it hard for me to distinguish your argument from “[regardless of the technical work you do] there will always be some existentially risky failures left, so if we proceed we will get doom. Therefore, we should avoid solving some failures, because those failures could help build political will to shut it all down”.
    I agree that, conditional on believing that we’re screwed absent huge levels of coordination regardless of technical work, then a lot of technical work including debate looks net negative by reducing the will to coordinate.
    What kinds of people are making/influencing key decisions in worlds where we’re likely to survive?
    [...]
    I don’t think conditioning on the status-quo free-for-all makes sense, since I don’t think that’s a world where our actions have much influence on our odds of success.
    Similarly this only makes sense under a view where technical work can’t have much impact on p(doom) by itself, aka “regardless of technical work we’re screwed”. Otherwise even in a “free-for-all” world, our actions do influence odds of success, because you can do technical work that people use, and that reduces p(doom).
    I’m only keen on specifications that plausibly give real guarantees: level 6(?) or 7. I’m only keen on the framework conditional on meeting an extremely high bar for the specification.
    Oh, my probability on level 6 or level 7 specifications becoming the default in AI is dominated by my probability that I’m somehow misunderstanding what they’re supposed to be. (A level 7 spec for AGI seems impossible even in theory, e.g. because it requires solving the halting problem.)
    If we ignore the misunderstanding part then I’m at << 1% probability on “we build transformative AI using GSA with level 6 / level 7 specifications in the nearish future”.
    (I could imagine a pause on frontier AI R&D, except that you are allowed to proceed if you have level 6 / level 7 specifications; and those specifications are used in a few narrow domains. My probability on that is similar to my probability on a pause.)
    - Joe Collman 13 Jul 2024 23:58 UTC
      LW: 2 AF: 1
      0
      AF Parent
      “[regardless of the technical work you do] there will always be some existentially risky failures left, so if we proceed we get doom...
      I’m claiming something more like “[given a realistic degree of technical work on current agendas in the time we have], there will be some existentially risky failures left, so if we proceed we’re highly likely to get doom.
      I’ll clarify more below.
      Otherwise even in a “free-for-all” world, our actions do influence odds of success, because you can do technical work that people use, and that reduces p(doom).
      Sure, but I mostly don’t buy p(doom) reduction here, other than through [highlight near-misses] - so that an approach that hides symptoms of fundamental problems is probably net negative.
      In the free-for-all world, I think doom is overdetermined, absent miracles ^[1]- and [significantly improved debate setup] does not strike me as a likely miracle, even after I condition on [a miracle occurred].
      Factors that push in the other direction:
      I can imagine techniques that reduce near-term widespread low-stakes failures.
      This may be instrumentally positive if e.g. AI is much better for collective sensemaking than otherwise it would be (even if that’s only [the negative impact isn’t as severe]).
      Similarly, I can imagine such techniques mitigating the near-term impact of [we get what we measure] failures. This too seems instrumentally useful.
      I do accept that technical work I’m not too keen on may avoid some early foolish/embarrassing ways to fail catastrophically.
      I mostly don’t think this helps significantly, since we’ll consistently hit doom later without a change in strategy.
      Nonetheless, [don’t be dead yet] is instrumentally useful if we want more time to change strategy, so avoiding early catastrophe is a plus.
      [probably other things along similar lines that I’m missing]
      
      But I suppose that on the [usefulness of debate (/scalable oversight techniques generally) research], I’m mainly thinking: [more clearly understanding how and when this may fail catastrophically, and how we’d robustly predict this] seems positive, whereas [show that versions of this technique get higher scores on some benchmarks] probably doesn’t.
      Even if I’m wrong about the latter, the former seems more important.
      Granted, it also seems harder—but I think that having a bunch of researchers focus on it and fail to come up with any principled case is useful too (at least for them).
      If we ignore the misunderstanding part then I’m at << 1% probability on “we build transformative AI using GSA with level 6 / level 7 specifications in the nearish future”.
      (I could imagine a pause on frontier AI R&D, except that you are allowed to proceed if you have level 6 / level 7 specifications; and those specifications are used in a few narrow domains. My probability on that is similar to my probability on a pause.)
      Agreed. This is why my main hope on this routes through [work on level ⁶⁄₇ specifications clarifies the depth and severity of the problem] and [more-formally-specified ⁶⁄₇ specifications give us something to point to in regulation].
      (on the level 7, I’m assuming “in all contexts” must be an overstatement; in particular, we only need something like ”...in all contexts plausibly reachable from the current state, given that all powerful AIs developed by us or our AIS follow this specification or this-specification-endorsed specifications”)
      Clarifications I’d make on my [doom seems likely, but not inevitable; some technical work seems net negative] position:
      If I expected that we had 25 years to get things right, I think I’d be pretty keen on most hands-on technical approaches (debate included).
      Quite a bit depends on the type of technical work. I like the kind of work that plausibly has the property [if we iterate on this we’ll probably notice all catastrophic problems before triggering them].
      I do think there’s a low-but-non-zero chance of breakthroughs in pretty general technical work. I can’t rule out that ARC theory come up with something transformational in the next few years (or that it comes from some group that’s outside my current awareness).
      I’m not ruling out an [AI assistants help us make meaningful alignment progress] path—I currently think it’s unlikely, not impossible.
      However, here I note that there’s a big difference between:
      The odds that [solve alignment with AI assistants] would work if optimally managed.
      The odds that it works in practice.
      I worry that researchers doing technical research tend to have the the former in mind (implicitly, subconsciously) - i.e. the (implicit) argument is something like “Our work stands a good chance to unlock a winning strategy here”.
      But this is not the question—the question is how likely it is to work in practice.
      (even conditioning on not-obviously-reckless people being in charge)
      It’s guesswork, but on [does a low-risk winning strategy of this form exist (without a huge slowdown)?] I’m perhaps 25%. On [will we actually find and implement such a strategy, even assuming the most reckless people aren’t a factor], I become quite a bit more pessimistic—if I start to say “10%”, I recoil at the implied [40% shot at finding and following a good enough path if one exists].
      Of course a lot here depends on whether we can do well enough to fail safely. Even a 5% shot is obviously great if the other 95% is [we realize it’s not working, and pivot].
      However, since I don’t see debate-like approaches as plausible in any direct-path-to-alignment sense, I’d like to see a much clearer plan for using such methods as stepping-stones to (stepping stones to...) a solution.
      In particular, I’m interested in the case for [if this doesn’t work, we have principled reasons to believe it’ll fail safely] (as an overall process, that is—not on each individual experiment).
      When I look at e.g. Buck/Ryan’s outlined iteration process here,^[2] I’m not comforted on this point: this has the same structure as [run SGD on passing our evals], only it’s [run researcher iteration on passing our evals]. This is less bad, but still entirely loses the [evals are an independent check on an approach we have principled reasons to think will work] property.
      On some level this kind of loop is unavoidable—but having the “core workflow” of alignment researchers be [tweak the approach, then test it against evals] seems a bit nuts.
      Most of the hope here seems to come from [the problem is surprisingly (to me) easy] or [catastrophic failure modes are surprisingly (to me) sparse].
      ^
      In the sense that [someone proves the Riemann hypothesis this year] would be a miracle.
      ^
      I note that it’s not clear they’re endorsing this iteration process—it may just be that they expect it to be the process, so that it’s important for people to be thinking in these terms.
      - Rohin Shah 14 Jul 2024 7:43 UTC
        LW: 15 AF: 13
        5
        AF Parent
        Okay, I think it’s pretty clear that the crux between us is basically what I was gesturing at in my first comment, even if there are minor caveats that make it not exactly literally that.
        I’m probably not going to engage with perspectives that say all current [alignment work towards building safer future powerful AI systems] is net negative, sorry. In my experience those discussions typically don’t go anywhere useful.
        Joe Collman 14 Jul 2024 20:52 UTC
        LW: 4 AF: 3
        0
        AF Parent
        That’s fair. I agree that we’re not likely to resolve much by continuing this discussion. (but thanks for engaging—I do think I understand your position somewhat better now)
        What does seem worth considering is adjusting research direction to increase focus on [search for and better understand the most important failure modes] - both of debate-like approaches generally, and any [plan to use such techniques to get useful alignment work done].
        I expect that this would lead people to develop clearer, richer models.
        Presumably this will take months rather than hours, but it seems worth it (whether or not I’m correct—I expect that [the understanding required to clearly demonstrate to me that I’m wrong] would be useful in a bunch of other ways).