ryan_greenblatt comments on Scalable oversight as a quantitative rather than qualitative problem

ryan_greenblatt 6 Jul 2024 19:58 UTC
LW: 10 AF: 9
0
AF
Sure. Imagine you have an AI which was instructed to add some feature or fix some bug in a code base.

That AI writes a bunch of code, runs various tests in a sandbox, and the submits a PR.

We need to review this PR, rate how good it is, and determine if it is actively harmful (e.g. adds a vulnerability).

The human can potentially review the full agent trajectory to help determine what is going on.

Understanding some parts of the code base, the tests the AI added, and what the PR was supposed to do are all important for reviewing it.

(In practice, you’ll also potentially want to review code that the AI is about to run in the sandbox (prior to it being run) in cases where a trusted model thinks the code is suspicious (depending on how secure we think the sandbox is).)
- johnswentworth 6 Jul 2024 20:13 UTC
  LW: 11 AF: 7
  2
  AF Parent
  Based on this example and your other comment, it sounds like the intended claim of the post could be expressed as:
  I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it’s missing the more central application of these techniques: situations where the AIs ~~are taking many actions~~ solving many subproblems, where humans would eventually understand ~~any particular action~~ any particular subproblem and its solution if they spent a whole lot of time investigating it, but where that amount of time taken to oversee any ~~action~~ subproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to oversee ~~actions~~ subproblems at a much lower cost in terms of human time—to push out the Pareto frontier of oversight quality vs cost.
  Does that accurately express the intended message?
  - Buck 6 Jul 2024 20:41 UTC
    LW: 7 AF: 6
    0
    AF Parent
    No, because I’m not trying to say that the humans understand the subproblem well enough to e.g. know what the best answer to it is, I’m trying to say that they understand the subproblem well enough to know how good an answer the answer provided was.
    - johnswentworth 6 Jul 2024 21:35 UTC
      LW: 17 AF: 8
      2
      AF Parent
      I wasn’t imagining that the human knew the best answer to any given subproblem, but nonetheless that did flesh out a lot more of what it means (under your mental model) for a human to “understand a subproblem”, so that was useful.
      I’ll try again:
      I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it’s missing the more central application of these techniques: situations where the AIs ~~are taking many actions~~ solving many subproblems, where humans would eventually understand ~~any particular action~~ how well the AI’s plan/action solves any particular subproblem if they spent a whole lot of time investigating it, but where that amount of time taken to oversee any ~~action~~ subproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to oversee ~~actions~~ subproblems at a much lower cost in terms of human time—to push out the Pareto frontier of oversight quality vs cost.
      (… and presumably an unstated piece here is that “understanding how well the AI’s plan/action solves a particular subproblem” might include recursive steps like “here’s a sub-sub-problem, assume the AI’s actions do a decent job solving that one”, where the human might not actually check the sub-sub-problem.)
      Does that accurately express the intended message?
      - Buck 7 Jul 2024 16:11 UTC
        LW: 11 AF: 10
        1
        AF Parent
        Sort of; I don’t totally understand why you want to phrase things in terms of subproblems instead of actions but I think it’s probably equivalent to do so, except that it’s pretty weird to describe an AI as only solving “subproblems”. Like, I think it’s kind of unnatural to describe ChatGPT as “solving many subproblems”; in some sense I guess you can think of all its answers as solutions to subproblems of the “be a good product” problem, but I don’t think that’s a very helpful frame.