Happy to give more examples; if you haven’t seen this newer post on informed oversight it might be helpful (and if not, I’m interested in understanding where the communication gaps are).
I just read the new post on informed oversight up to and including “Necessity of knowing what the agent knows”. (I saw it before but didn’t feel very motivated to read it, which I later realized was probably because I didn’t understand why it’s an important problem.) The new post uses the example of “an agent trying to defend a computer system from attackers” which does seem more motivating, but I’m still not sure why there’s a problem here. From the post:
The same thing can happen if we observe everything our agent observes, if we aren’t able to understand everything our agent understands. In the security example, literally seeing a sequence of bits moving across an interface gives you almost no information — something can look innocuous, but cause a huge amount of trouble. In order to incentivize our agent to avoid causing trouble, we need to be able to detect any trouble that the agent deliberately causes.
If the overseer sees the agent output an action that the overseer can’t understand the rationale of, why can’t the overseer just give it a low approval rating? Sure, this limits the performance of the agent to that of the overseer, but that should be fine since we can amplify the agent later? If this doesn’t work for some reason, why don’t we have the agent produce an explanation of the rationale for the action it proposes, and output that along with the action, and have the overseer use that as a hint to help judge how good the action is?
If the overseer sees the agent output an action that the overseer can’t understand the rationale of, why can’t the overseer just give it a low approval rating? Sure, this limits the performance of the agent to that of the overseer, but that should be fine since we can amplify the agent later?
Suppose that taking action X results in good consequences empirically, but discovering why is quite hard. (It seems plausible to me that this kind of regularity is very important for humans actually behaving intelligently.) If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable (at which point there will be more sophisticated actions Y that result in good consequences for reasons that the agent would have to be even more sophisticated to understand).
If this doesn’t work for some reason, why don’t we have the agent produce an explanation of the rationale for the action it proposes, and output that along with the action, and have the overseer use that as a hint to help judge how good the action is?
Two problems:
Sometimes you need hints that help you see why an action is bad. You can take this proposal all the way to debate, though you are still left with a question about whether debate actually works.
Agents can know things because of complicated regularities on the training data, and hints aren’t enough to expose this to the overseer.
Suppose that taking action X results in good consequences empirically, but discovering why is quite hard. (It seems plausible to me that this kind of regularity is very important for humans actually behaving intelligently.)
Why is the overseer unable to see that X results in good consequences empirically, and give a high approval rating as a result? (When I said “understand” I just meant that the overseer can itself see that the action is good, not that it can necessarily articulate a reason. Similarly the “explanation” from the agent can just be “X is empirically good, look for yourself”.)
I have a guess that the overseer has a disadvantage relative to the agent because the agent has a kind of memory, where it has incorporated information from all past training data, and gets more from each new feedback from the overseer, but the overseer has no memory of past training data and has to start over evaluating each new input/action pair from a fixed state of knowledge. Is this right? (If so, it seems like maybe we can fix it by letting the overseer have access to past training data? Although it seems plausible that wouldn’t work well enough so if this guess is right, I think I may understand what the problem is.)
If we accept the argument “well it worked, didn’t it?” then we are back to the regime where the agent may know something we don’t (e.g. about why the action wasn’t good even though it looked good).
Relatedly, it’s still not really clear to me what it means to “only accept actions that we understand.” If the agent presents an action that is unacceptable, for reasons the overseer doesn’t understand, how do we penalize it? It’s not like there are some actions for which we understand all consequences and others for which we don’t—any action in practice could have lots of consequences we understand and lots we don’t, and we can’t rule out the existence of consequences we don’t understand.
As you observe, the agent learns facts from the training distribution, and even if the overseer has a memory there is no guarantee that they will be able to use it as effectively as the agent. Being able to look at training data in some way (I expect implicitly) is a reason that informed oversight isn’t obviously impossible, but not reasons that this is a non-problem.
Relatedly, it’s still not really clear to me what it means to “only accept actions that we understand.” If the agent presents an action that is unacceptable, for reasons the overseer doesn’t understand, how do we penalize it? It’s not like there are some actions for which we understand all consequences and others for which we don’t—any action in practice could have lots of consequences we understand and lots we don’t, and we can’t rule out the existence of consequences we don’t understand.
What if the overseer just asks itself, “If I came up with the idea for this action myself, how much would I approve of it?” Sure, sometimes the overseer would approve something that has bad unintended/unforeseen consequences, but wouldn’t the same thing happen if the overseer was just making the decisions itself?
ETA: Is the answer that if the overseer was making the decisions itself, there wouldn’t be a risk that the process proposing the actions might deliberately propose actions that have bad consequences that the overseer can’t foresee? Would this still be a problem if we were training the agent with SL instead of RL? If not, what is the motivation for using RL here?
I feel like through this discussion I now understand the problem a little better, but it’s still not nearly as crisp as some of the other problems like “optimizing for worst case”. I think part of it is lack of a clear motivating example (like inner optimizers for “optimizing for worst case”) and part of it is that “informed oversight” is a problem that arises during the distillation step of IDA, but previously that step was described as distilling the overseer down to a faster but less capable agent. Here it seems like you’re trying to train an agent that is more capable than the overseer in some way, and I’m not entirely sure why that has changed.
ETA: Going back to the Informed Oversight article, this part almost makes sense now:
In the security example, literally seeing a sequence of bits moving across an interface gives you almost no information — something can look innocuous, but cause a huge amount of trouble. In order to incentivize our agent to avoid causing trouble, we need to be able to detect any trouble that the agent deliberately causes. Even an apparently mundane gap in our understanding could hide attacks, just as effectively as if we’d been literally unable to observe the agent’s behavior.
I think it would really help if you could give a story about why the agent is deliberately trying to cause trouble, and how it came to have more understanding than the overseer, enough to pick an action that looks good to the overseer but would actually knowingly (to the agent) cause something bad to happen.
What if the overseer just asks itself, “If I came up with the idea for this action myself, how much would I approve of it?” Sure, sometimes the overseer would approve something that has bad unintended/unforeseen consequences, but wouldn’t the same thing happen if the overseer was just making the decisions itself?
No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work. (Unless they happen to get unlucky and find great actions by chance, in which case they can get unlucky, or if they run a really powerful search, in which case yes really powerful searches are scary for the same reason.)
Would this still be a problem if we were training the agent with SL instead of RL?
You would still need informed oversight in order to solve the inner alignment problem (i.e. to actually achieve robustness in the face of sophisticated reasoning within the SL system), but you wouldn’t need it for the outer alignment problem.
If not, what is the motivation for using RL here?
I agree with Will. The point is to be competitive, I don’t see how you could be competitive if you use SL (unless it turns out that RL just doesn’t add any value, in which case I agree we don’t have to worry about RL).
like inner optimizers for “optimizing for worst case”
But you need to solve this problem in order to cope with inner optimizers.
Here it seems like you’re trying to train an agent that is more capable than the overseer in some way, and I’m not entirely sure why that has changed.
This is the step where we try to ensure that the overseer is actually more capable, in the stronger sense of knowing every particular thing the agent knows.
I think it would really help if you could give a story about why the agent is deliberately trying to cause trouble
I don’t quite understand what kind of story you are looking for, and this might have just been a communication problem in the post, so let me start by rephrasing that example:
I want my AI to protect me from attack, so I give it a high reward when it looks like I wasn’t attacked.
So my AI searches over actions to find one for which it expects I’ll conclude “I wasn’t attacked.”
Instead of actually protecting me, my AI undermines the mechanisms by which I would have detected that I was attacked.
We could run the same argument with “I want to acquire resources” instead of “I want to be protected from attack”—rather than actually helping me get resources, the agent can convince me that I have a bunch of resources when I don’t really have any.
how it came to have more understanding than the overseer
We did a lot of gradient descent over the parameters of the agent, to find parameters that cause the agent to take good actions. That gradient descent gives the agent a bunch of ability to reason about the world, because successful reasoning can help it take good actions.
The overseer can also look at that data and try to learn about general patterns etc.. But how do we guarantee that the explicit procedure used by the overseer actually captures all of the stuff the agent was able to learn? This is very non-obvious.
But you need to solve this problem in order to cope with inner optimizers.
Is “informed oversight” entirely a subproblem of “optimizing for worst case”? Your original example of art plagiarism made it seem like a very different problem which might be a significant part of my confusion.
This is the step where we try to ensure that the overseer is actually more capable, in the stronger sense of knowing every particular thing the agent knows.
This is tangential but can you remind me why it’s not a problem as far as competitiveness that your overseer is probably more costly to compute than other people’s reward/evaluation functions?
I want my AI to protect me from attack, so I give it a high reward when it looks like I wasn’t attacked.
Ok, this is definitely part of the confusion/miscommunication, as I wouldn’t have guessed this without it being explicit. Unless otherwise stated, I generally assume that overseers in your schemes follow the description given in Approval-directed agents, and only give high reward to each action if the overseer can itself anticipate good consequences from the action. (That post says, “Arthur’s actions are rated more highly than those produced by any alternative procedure. That’s comforting, but it doesn’t mean that Arthur is optimal. An optimal agent may make decisions that have consequences Hugh would approve of, even if Hugh can’t anticipate those consequences himself.” This seems to clearly imply that Hugh does not reward Arthur just for making decisions that have consequences Hugh would approve of, unless Hugh can anticipate those consequences himself.)
One of your earlier comments in this thread said “If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable (at which point there will be more sophisticated actions Y that result in good consequences for reasons that the agent would have to be even more sophisticated to understand).” So I guess that explains why the overseer in your example is doing something different, but I don’t recall seeing you mention this problem prior to this thread, so it wasn’t on my radar as something that you’re trying to solve. (I’m still not quite sure at this point that it really is a problem or that I correctly understand it. If you have explained it more somewhere, please let me know.)
Is “informed oversight” entirely a subproblem of “optimizing for worst case”? Your original example of art plagiarism made it seem like a very different problem which might be a significant part of my confusion.
No, it’s also important for getting good behavior from RL.
This is tangential but can you remind me why it’s not a problem as far as competitiveness that your overseer is probably more costly to compute than other people’s reward/evaluation functions?
This is OK iff the number of reward function evaluations is sufficiently small. If your overseer is 10x as expensive as your policy, you need to evaluate the reward function <1/10th as often as you evaluate your policy. (See semi-supervised RL.)
(Note that even “10x slowdown” could be very small compared to the loss in competitiveness from taking RL off the table, depending on how well RL works.)
Unless otherwise stated, I generally assume that overseers in your schemes follow the description given in Approval-directed agents, and only give high reward to each action if the overseer can itself anticipate good consequences from the action.
In this problem we are interesting in ensuring that the overseer is able to anticipate good consequences from an action.
If a model trained on historical data could predict good consequences, but your overseer can’t, then you are going to sacrifice competitiveness. That is, your agent won’t be motivated to use its understanding to help you achieve good consequences.
I think the confusion is coming from equivocating between multiple proposals. I’m saying, “We need to solve informed oversight for amplification to be a good training scheme.” You are asking “Why is that a problem?” and I’m trying to explain why this is a necessary component of iterated amplification. In explaining that, I’m sometimes talking about why it wouldn’t be competitive, and sometimes talking about why your model might do something unsafe if you used the obvious remedy to make it competitive. When you ask for “a story about why the model might do something unsafe,” I assumed you were asking for the latter—why would the obvious approach to making it competitive be unsafe. My earlier comment “If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable” is explaining why approval-directed agents aren’t competitive by default unless you solve something like this.
(That all said, sometimes the overseer believes that X will have good consequences because “stuff like X has had good consequences in the past;” that seems to be an important kind of reasoning that you can’t just leave out, and if you use that kind of reasoning then these risks can appear even in approval-directed agents with no hindsight. And if you don’t use this kind of reasoning you sacrifice competitiveness.)
No, it’s also important for getting good behavior from RL.
Ok.
This is OK iff the number of reward function evaluations is sufficiently small. If your overseer is 10x as expensive as your policy, you need to evaluate the reward function <1/10th as often as you evaluate your policy. (See semi-supervised RL.)
Do you have an intuition that semi-supervised RL will be competitive with standard RL?
(Note that even “10x slowdown” could be very small compared to the loss in competitiveness from taking RL off the table, depending on how well RL works.)
Do you have an intuition that RL will work much better than SL for your purposes? If so, why/how? AFAIK, today people generally use RL over SL because A) they don’t have something that can provide demonstrations, B) it’s cheaper to provide evaluations than to provide demonstrations, or C) they want to exceed the performance of their demonstrators. But none of these seem to apply in your case? If you have a demonstrator (i.e., the amplified overseer) that can provide a supervised training signal at the performance level you want the trained agent to achieve, I’m not sure what you expect RL to offer on top of SL.
Another tangentially related puzzle that I have is, it seems like the internal computations of a RL agent would differ greatly depending on how you train it. For example if you train it with a simple reward function, then I expect the RL agent might end up modeling that reward function internally at a high level of accuracy and doing some sort of optimized/heuristic search for actions that would lead to high rewards. But if you train it with an overseer that is 10x as expensive as the agent, I’m not totally sure what you expect it to do internally and why we should expect whatever that is to be competitive with the first RL agent. For example maybe it would devote a lot more compute into running a somewhat accurate model of the overseer and then do a search for actions that would lead to high approval according to the approximate model, but then wouldn’t it do worse than the first RL agent because it couldn’t test as many candidate actions (because each test would cost more) and it would have to use worse heuristics (because the objective function is lot more complex and it doesn’t have as much resources, e.g., artificial neurons, left to run the heuristics)?
I think the confusion is coming from equivocating between multiple proposals.
Yes, I think I understand that at this point.
In explaining that, I’m sometimes talking about why it wouldn’t be competitive, and sometimes talking about why your model might do something unsafe if you used the obvious remedy to make it competitive.
In the new Informed oversight post, it seems that you skipped over talking about “why it wouldn’t be competitive” and went directly to “talking about why your model might do something unsafe if you used the obvious remedy to make it competitive” which confused me because I didn’t know that’s what you were doing. (The post doesn’t seem to contain the word “competitive” at all.)
That aside, can you give an example that illustrates “why it wouldn’t be competitive”?
That all said, sometimes the overseer believes that X will have good consequences because “stuff like X has had good consequences in the past;” that seems to be an important kind of reasoning that you can’t just leave out, and if you use that kind of reasoning then these risks can appear even in approval-directed agents with no hindsight.
I think I’m still confused about this, because it seems like these risks can appear even if the overseer (or a normal human) uses this kind of reasoning to itself decide what to do. For example, suppose I had an imperfect detector of network attacks, and I try a bunch of stuff to protect my network, and one of them happens to mislead my detector into returning “no attack” even when there is an attack, and then I use this kind of reasoning to do a lot more of that in the future.
Earlier you wrote “No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work.” but if you’re saying that the overseer sometimes believes that X will have good consequences because “stuff like X has had good consequences in the past;” and uses that to make decisions, then they don’t always understand why those decisions actually work?
I see the motivation as given practical compute limits, it may be much easier to have the system find an action the overseer approves of instead of imitating the overseer directly. Using RL also allows you to use any advances that are made in RL by the machine learning community to try to remain competitive.
Would this still be a problem if we were training the agent with SL instead of RL?
Maybe this could happen with SL if SL does some kind of large search and finds a solution that looks good but is actually bad. The distilled agent would then learn to identify this action and reproduce it, which implies the agent learning some facts about the action to efficiently locate it with much less compute than the large search process. Knowing what the agent knows would allow the overseer to learn those facts, which might help in identifying this action as bad.
Search over all possible plans for one that scores high on <malicious reward function> that looks good to the overseer with some normal explanation. Output that plan and the corresponding explanation.
Giving low scores to confusing actions and asking for an explanation do not save you from the harmful plans that are found by this agent. When the agent is much more powerful than the overseer, this seems very unsafe. When the overseer is more powerful, I am unsure what would happen.
The danger that I perceive is more from the abstract argument that the agent could find bad plans that the overseer would not recognize. You could consider lobbying and spy movies as real-life analogues of this problem.
Search over all possible plans for one that scores high on <malicious reward function> that looks good to the overseer with some normal explanation. Output that plan and the corresponding explanation.
But there’s an agent close to this malicious agent in agent space, which prioritizes “looking good to the overseer” more and “scoring high on <malicious reward function>” less, so if we keeping training with RL, we’d move towards the former agent and away from the latter agent (since the former agent would tend to come up with actions that the overseer approves of more). And if we keep training, we should eventually end up with an agent that doesn’t try to score high on <malicious reward function> at all, right?
When the agent is much more powerful than the overseer, this seems very unsafe.
I agree with this, but for a different reason. I think if we optimize for “looking good to the overseer” too much, we’d reliably trigger safety problems in the overseer’s evaluation process (e.g., triggering bugs in the overseer’s code, or causing distributional shifts that the overseer can’t handle). I feel like that’s a different problem from “informed oversight” though, which I still don’t quite understand.
You could consider lobbying and spy movies as real-life analogues of this problem.
In both of those cases the lobbyist and the spy have an informational as well as computational advantage over the politician/counterintelligence officer (the politician doesn’t know a lot of details that the lobbyist knows and has less time to think about an issue than the lobbyist; the counterintelligence officer can’t see much of what the spy has seen and done and has to divide attention between lots of people who look suspicious), but in the IDA setting isn’t the overseer seeing everything that the agent is seeing, and also has more computational power than the agent?
But there’s an agent close to this malicious agent in agent space, which prioritizes “looking good to the overseer” more and “scoring high on <malicious reward function>” less, so if we keeping training with RL, we’d move towards the former agent and away from the latter agent (since the former agent would tend to come up with actions that the overseer approves of more). And if we keep training, we should eventually end up with an agent that doesn’t try to score high on <malicious reward function> at all, right?
In theory, yes, but in practice RL does not give us such nice guarantees. (Inner optimizers are a particular instantiation of this concern that RL does not find the best possible agent.)
I agree with this, but for a different reason.
I agree that it’s a different (also important) reason and that your reason does not motivate informed oversight.
In both of those cases the lobbyist and the spy have an informational as well as computational advantage over the politician/counterintelligence officer
Agreed, it’s not perfectly analogous, and that’s why I’m unsure what would happen (as opposed to being confident that bad behavior would result).
in the IDA setting isn’t the overseer seeing everything that the agent is seeing, and also has more computational power than the agent?
Yes. The worry is more from the lack of a story for why we will get good outcomes, plus some speculative stories about how we could maybe get bad outcomes (primarily inner optimizers). With informed oversight solved, you could hope to construct an argument that even if an inner optimizer arises, the overseer would be able to tell and so we wouldn’t get bad outcomes.
Inner optimizers are a particular instantiation of this concern that RL does not find the best possible agent.
I thought inner optimizers are supposed to be handled under “learning with catastrophe” / “optimizing for worst case”. In particular inner optimizers would cause “malign” failures which would constitute a catastrophe which techniques for learning with catastrophe / optimizing for worst case (such as adversarial training, verification, transparency) would detect and train the agent out of.
Is “informed oversight” just another name for that problem, or a particular approach to solving it? (If the former, why yet another name? If the latter, how is it different from “transparency”?) I haven’t seen any writing from Paul that says this, and also the original example that motivated “informed oversight” (the overseer wants to train the AI to create original art but can’t distinguish between original art and plagiarism) seems rather different from the inner optimizer problem and wouldn’t seem to constitute a “catastrophe” or a “malign failure”, so I’m still confused.
ETA: The suggestions I gave at the start of this thread (overseer gives low approval rating to actions it can’t understand, and having agent output an explanation along with an action) are supposed to be used alongside “learning with catastrophe” / “optimizing for worst case” and not meant as a replacement for them. I thought those ideas would be enough to solve the more recent motivating example for “informed oversight” that Paul gave (training an agent to defend against network attacks).
I thought inner optimizers are supposed to be handled under “learning with catastrophe” / “optimizing for worst case”. In particular inner optimizers would cause “malign” failures which would constitute a catastrophe which techniques for learning with catastrophe / optimizing for worst case (such as adversarial training, verification, transparency) would detect and train the agent out of.
Yes. Inner optimizers should either result in low performance on the training distribution (in which case we have a hope of training them out, though we may get stuck in a local optimization), or to manifestly unacceptable behavior on some possible inputs.
Is “informed oversight” just another name for that problem, or a particular approach to solving it?
Informed oversight is being able to figure out everything your agent knows about how good a proposed action is. This seems like a prerequisite both for RL training (if you want a reward function that incentivizes the correct behavior) and for adversarial training to avoid unacceptable behavior.
If the latter, how is it different from “transparency”?
People discuss a bunch of techniques under the heading of transparency/interpretability, and have a bunch of goals.
In the context of this sequence, transparency is relevant for both:
Know what the agent knows, in order to evaluate its behavior.
Figure out under what conditions the agent would behave differently, to facilitate adversarial training.
For both of those problems, it’s not obvious the solution will look anything like what is normally called transparency (or what people in that field would recognize as transparency). And even if it will look like transparency, it seems worth distinguishing different goals of that research.
So that’s why there is a different name.
I thought those ideas would be enough to solve the more recent motivating example for “informed oversight” that Paul gave (training an agent to defend against network attacks).
(I disagreed with this upthread. I don’t think “convince the overseer that an action is good” obviously incentivizes the right behavior, even if you are allowed to offer an explanation—certainly we don’t have any particular argument that it would incentivize the right behavior. It seems like informed oversight roughly captures what is needed in order for RL to create the right incentives.)
Happy to give more examples; if you haven’t seen this newer post on informed oversight it might be helpful (and if not, I’m interested in understanding where the communication gaps are).
I just read the new post on informed oversight up to and including “Necessity of knowing what the agent knows”. (I saw it before but didn’t feel very motivated to read it, which I later realized was probably because I didn’t understand why it’s an important problem.) The new post uses the example of “an agent trying to defend a computer system from attackers” which does seem more motivating, but I’m still not sure why there’s a problem here. From the post:
If the overseer sees the agent output an action that the overseer can’t understand the rationale of, why can’t the overseer just give it a low approval rating? Sure, this limits the performance of the agent to that of the overseer, but that should be fine since we can amplify the agent later? If this doesn’t work for some reason, why don’t we have the agent produce an explanation of the rationale for the action it proposes, and output that along with the action, and have the overseer use that as a hint to help judge how good the action is?
Suppose that taking action X results in good consequences empirically, but discovering why is quite hard. (It seems plausible to me that this kind of regularity is very important for humans actually behaving intelligently.) If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable (at which point there will be more sophisticated actions Y that result in good consequences for reasons that the agent would have to be even more sophisticated to understand).
Two problems:
Sometimes you need hints that help you see why an action is bad. You can take this proposal all the way to debate, though you are still left with a question about whether debate actually works.
Agents can know things because of complicated regularities on the training data, and hints aren’t enough to expose this to the overseer.
Why is the overseer unable to see that X results in good consequences empirically, and give a high approval rating as a result? (When I said “understand” I just meant that the overseer can itself see that the action is good, not that it can necessarily articulate a reason. Similarly the “explanation” from the agent can just be “X is empirically good, look for yourself”.)
I have a guess that the overseer has a disadvantage relative to the agent because the agent has a kind of memory, where it has incorporated information from all past training data, and gets more from each new feedback from the overseer, but the overseer has no memory of past training data and has to start over evaluating each new input/action pair from a fixed state of knowledge. Is this right? (If so, it seems like maybe we can fix it by letting the overseer have access to past training data? Although it seems plausible that wouldn’t work well enough so if this guess is right, I think I may understand what the problem is.)
Some problems:
If we accept the argument “well it worked, didn’t it?” then we are back to the regime where the agent may know something we don’t (e.g. about why the action wasn’t good even though it looked good).
Relatedly, it’s still not really clear to me what it means to “only accept actions that we understand.” If the agent presents an action that is unacceptable, for reasons the overseer doesn’t understand, how do we penalize it? It’s not like there are some actions for which we understand all consequences and others for which we don’t—any action in practice could have lots of consequences we understand and lots we don’t, and we can’t rule out the existence of consequences we don’t understand.
As you observe, the agent learns facts from the training distribution, and even if the overseer has a memory there is no guarantee that they will be able to use it as effectively as the agent. Being able to look at training data in some way (I expect implicitly) is a reason that informed oversight isn’t obviously impossible, but not reasons that this is a non-problem.
What if the overseer just asks itself, “If I came up with the idea for this action myself, how much would I approve of it?” Sure, sometimes the overseer would approve something that has bad unintended/unforeseen consequences, but wouldn’t the same thing happen if the overseer was just making the decisions itself?
ETA: Is the answer that if the overseer was making the decisions itself, there wouldn’t be a risk that the process proposing the actions might deliberately propose actions that have bad consequences that the overseer can’t foresee? Would this still be a problem if we were training the agent with SL instead of RL? If not, what is the motivation for using RL here?
I feel like through this discussion I now understand the problem a little better, but it’s still not nearly as crisp as some of the other problems like “optimizing for worst case”. I think part of it is lack of a clear motivating example (like inner optimizers for “optimizing for worst case”) and part of it is that “informed oversight” is a problem that arises during the distillation step of IDA, but previously that step was described as distilling the overseer down to a faster but less capable agent. Here it seems like you’re trying to train an agent that is more capable than the overseer in some way, and I’m not entirely sure why that has changed.
ETA: Going back to the Informed Oversight article, this part almost makes sense now:
I think it would really help if you could give a story about why the agent is deliberately trying to cause trouble, and how it came to have more understanding than the overseer, enough to pick an action that looks good to the overseer but would actually knowingly (to the agent) cause something bad to happen.
No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work. (Unless they happen to get unlucky and find great actions by chance, in which case they can get unlucky, or if they run a really powerful search, in which case yes really powerful searches are scary for the same reason.)
You would still need informed oversight in order to solve the inner alignment problem (i.e. to actually achieve robustness in the face of sophisticated reasoning within the SL system), but you wouldn’t need it for the outer alignment problem.
I agree with Will. The point is to be competitive, I don’t see how you could be competitive if you use SL (unless it turns out that RL just doesn’t add any value, in which case I agree we don’t have to worry about RL).
But you need to solve this problem in order to cope with inner optimizers.
This is the step where we try to ensure that the overseer is actually more capable, in the stronger sense of knowing every particular thing the agent knows.
I don’t quite understand what kind of story you are looking for, and this might have just been a communication problem in the post, so let me start by rephrasing that example:
I want my AI to protect me from attack, so I give it a high reward when it looks like I wasn’t attacked.
So my AI searches over actions to find one for which it expects I’ll conclude “I wasn’t attacked.”
Instead of actually protecting me, my AI undermines the mechanisms by which I would have detected that I was attacked.
We could run the same argument with “I want to acquire resources” instead of “I want to be protected from attack”—rather than actually helping me get resources, the agent can convince me that I have a bunch of resources when I don’t really have any.
We did a lot of gradient descent over the parameters of the agent, to find parameters that cause the agent to take good actions. That gradient descent gives the agent a bunch of ability to reason about the world, because successful reasoning can help it take good actions.
The overseer can also look at that data and try to learn about general patterns etc.. But how do we guarantee that the explicit procedure used by the overseer actually captures all of the stuff the agent was able to learn? This is very non-obvious.
Is “informed oversight” entirely a subproblem of “optimizing for worst case”? Your original example of art plagiarism made it seem like a very different problem which might be a significant part of my confusion.
This is tangential but can you remind me why it’s not a problem as far as competitiveness that your overseer is probably more costly to compute than other people’s reward/evaluation functions?
Ok, this is definitely part of the confusion/miscommunication, as I wouldn’t have guessed this without it being explicit. Unless otherwise stated, I generally assume that overseers in your schemes follow the description given in Approval-directed agents, and only give high reward to each action if the overseer can itself anticipate good consequences from the action. (That post says, “Arthur’s actions are rated more highly than those produced by any alternative procedure. That’s comforting, but it doesn’t mean that Arthur is optimal. An optimal agent may make decisions that have consequences Hugh would approve of, even if Hugh can’t anticipate those consequences himself.” This seems to clearly imply that Hugh does not reward Arthur just for making decisions that have consequences Hugh would approve of, unless Hugh can anticipate those consequences himself.)
One of your earlier comments in this thread said “If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable (at which point there will be more sophisticated actions Y that result in good consequences for reasons that the agent would have to be even more sophisticated to understand).” So I guess that explains why the overseer in your example is doing something different, but I don’t recall seeing you mention this problem prior to this thread, so it wasn’t on my radar as something that you’re trying to solve. (I’m still not quite sure at this point that it really is a problem or that I correctly understand it. If you have explained it more somewhere, please let me know.)
No, it’s also important for getting good behavior from RL.
This is OK iff the number of reward function evaluations is sufficiently small. If your overseer is 10x as expensive as your policy, you need to evaluate the reward function <1/10th as often as you evaluate your policy. (See semi-supervised RL.)
(Note that even “10x slowdown” could be very small compared to the loss in competitiveness from taking RL off the table, depending on how well RL works.)
In this problem we are interesting in ensuring that the overseer is able to anticipate good consequences from an action.
If a model trained on historical data could predict good consequences, but your overseer can’t, then you are going to sacrifice competitiveness. That is, your agent won’t be motivated to use its understanding to help you achieve good consequences.
I think the confusion is coming from equivocating between multiple proposals. I’m saying, “We need to solve informed oversight for amplification to be a good training scheme.” You are asking “Why is that a problem?” and I’m trying to explain why this is a necessary component of iterated amplification. In explaining that, I’m sometimes talking about why it wouldn’t be competitive, and sometimes talking about why your model might do something unsafe if you used the obvious remedy to make it competitive. When you ask for “a story about why the model might do something unsafe,” I assumed you were asking for the latter—why would the obvious approach to making it competitive be unsafe. My earlier comment “If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable” is explaining why approval-directed agents aren’t competitive by default unless you solve something like this.
(That all said, sometimes the overseer believes that X will have good consequences because “stuff like X has had good consequences in the past;” that seems to be an important kind of reasoning that you can’t just leave out, and if you use that kind of reasoning then these risks can appear even in approval-directed agents with no hindsight. And if you don’t use this kind of reasoning you sacrifice competitiveness.)
Ok.
Do you have an intuition that semi-supervised RL will be competitive with standard RL?
Do you have an intuition that RL will work much better than SL for your purposes? If so, why/how? AFAIK, today people generally use RL over SL because A) they don’t have something that can provide demonstrations, B) it’s cheaper to provide evaluations than to provide demonstrations, or C) they want to exceed the performance of their demonstrators. But none of these seem to apply in your case? If you have a demonstrator (i.e., the amplified overseer) that can provide a supervised training signal at the performance level you want the trained agent to achieve, I’m not sure what you expect RL to offer on top of SL.
Another tangentially related puzzle that I have is, it seems like the internal computations of a RL agent would differ greatly depending on how you train it. For example if you train it with a simple reward function, then I expect the RL agent might end up modeling that reward function internally at a high level of accuracy and doing some sort of optimized/heuristic search for actions that would lead to high rewards. But if you train it with an overseer that is 10x as expensive as the agent, I’m not totally sure what you expect it to do internally and why we should expect whatever that is to be competitive with the first RL agent. For example maybe it would devote a lot more compute into running a somewhat accurate model of the overseer and then do a search for actions that would lead to high approval according to the approximate model, but then wouldn’t it do worse than the first RL agent because it couldn’t test as many candidate actions (because each test would cost more) and it would have to use worse heuristics (because the objective function is lot more complex and it doesn’t have as much resources, e.g., artificial neurons, left to run the heuristics)?
Yes, I think I understand that at this point.
In the new Informed oversight post, it seems that you skipped over talking about “why it wouldn’t be competitive” and went directly to “talking about why your model might do something unsafe if you used the obvious remedy to make it competitive” which confused me because I didn’t know that’s what you were doing. (The post doesn’t seem to contain the word “competitive” at all.)
That aside, can you give an example that illustrates “why it wouldn’t be competitive”?
I think I’m still confused about this, because it seems like these risks can appear even if the overseer (or a normal human) uses this kind of reasoning to itself decide what to do. For example, suppose I had an imperfect detector of network attacks, and I try a bunch of stuff to protect my network, and one of them happens to mislead my detector into returning “no attack” even when there is an attack, and then I use this kind of reasoning to do a lot more of that in the future.
Earlier you wrote “No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work.” but if you’re saying that the overseer sometimes believes that X will have good consequences because “stuff like X has had good consequences in the past;” and uses that to make decisions, then they don’t always understand why those decisions actually work?
I see the motivation as given practical compute limits, it may be much easier to have the system find an action the overseer approves of instead of imitating the overseer directly. Using RL also allows you to use any advances that are made in RL by the machine learning community to try to remain competitive.
Maybe this could happen with SL if SL does some kind of large search and finds a solution that looks good but is actually bad. The distilled agent would then learn to identify this action and reproduce it, which implies the agent learning some facts about the action to efficiently locate it with much less compute than the large search process. Knowing what the agent knows would allow the overseer to learn those facts, which might help in identifying this action as bad.
Consider the following agent:
Search over all possible plans for one that scores high on <malicious reward function> that looks good to the overseer with some normal explanation. Output that plan and the corresponding explanation.
Giving low scores to confusing actions and asking for an explanation do not save you from the harmful plans that are found by this agent. When the agent is much more powerful than the overseer, this seems very unsafe. When the overseer is more powerful, I am unsure what would happen.
The danger that I perceive is more from the abstract argument that the agent could find bad plans that the overseer would not recognize. You could consider lobbying and spy movies as real-life analogues of this problem.
But there’s an agent close to this malicious agent in agent space, which prioritizes “looking good to the overseer” more and “scoring high on <malicious reward function>” less, so if we keeping training with RL, we’d move towards the former agent and away from the latter agent (since the former agent would tend to come up with actions that the overseer approves of more). And if we keep training, we should eventually end up with an agent that doesn’t try to score high on <malicious reward function> at all, right?
I agree with this, but for a different reason. I think if we optimize for “looking good to the overseer” too much, we’d reliably trigger safety problems in the overseer’s evaluation process (e.g., triggering bugs in the overseer’s code, or causing distributional shifts that the overseer can’t handle). I feel like that’s a different problem from “informed oversight” though, which I still don’t quite understand.
In both of those cases the lobbyist and the spy have an informational as well as computational advantage over the politician/counterintelligence officer (the politician doesn’t know a lot of details that the lobbyist knows and has less time to think about an issue than the lobbyist; the counterintelligence officer can’t see much of what the spy has seen and done and has to divide attention between lots of people who look suspicious), but in the IDA setting isn’t the overseer seeing everything that the agent is seeing, and also has more computational power than the agent?
In theory, yes, but in practice RL does not give us such nice guarantees. (Inner optimizers are a particular instantiation of this concern that RL does not find the best possible agent.)
I agree that it’s a different (also important) reason and that your reason does not motivate informed oversight.
Agreed, it’s not perfectly analogous, and that’s why I’m unsure what would happen (as opposed to being confident that bad behavior would result).
Yes. The worry is more from the lack of a story for why we will get good outcomes, plus some speculative stories about how we could maybe get bad outcomes (primarily inner optimizers). With informed oversight solved, you could hope to construct an argument that even if an inner optimizer arises, the overseer would be able to tell and so we wouldn’t get bad outcomes.
I thought inner optimizers are supposed to be handled under “learning with catastrophe” / “optimizing for worst case”. In particular inner optimizers would cause “malign” failures which would constitute a catastrophe which techniques for learning with catastrophe / optimizing for worst case (such as adversarial training, verification, transparency) would detect and train the agent out of.
Is “informed oversight” just another name for that problem, or a particular approach to solving it? (If the former, why yet another name? If the latter, how is it different from “transparency”?) I haven’t seen any writing from Paul that says this, and also the original example that motivated “informed oversight” (the overseer wants to train the AI to create original art but can’t distinguish between original art and plagiarism) seems rather different from the inner optimizer problem and wouldn’t seem to constitute a “catastrophe” or a “malign failure”, so I’m still confused.
ETA: The suggestions I gave at the start of this thread (overseer gives low approval rating to actions it can’t understand, and having agent output an explanation along with an action) are supposed to be used alongside “learning with catastrophe” / “optimizing for worst case” and not meant as a replacement for them. I thought those ideas would be enough to solve the more recent motivating example for “informed oversight” that Paul gave (training an agent to defend against network attacks).
Yes. Inner optimizers should either result in low performance on the training distribution (in which case we have a hope of training them out, though we may get stuck in a local optimization), or to manifestly unacceptable behavior on some possible inputs.
Informed oversight is being able to figure out everything your agent knows about how good a proposed action is. This seems like a prerequisite both for RL training (if you want a reward function that incentivizes the correct behavior) and for adversarial training to avoid unacceptable behavior.
People discuss a bunch of techniques under the heading of transparency/interpretability, and have a bunch of goals.
In the context of this sequence, transparency is relevant for both:
Know what the agent knows, in order to evaluate its behavior.
Figure out under what conditions the agent would behave differently, to facilitate adversarial training.
For both of those problems, it’s not obvious the solution will look anything like what is normally called transparency (or what people in that field would recognize as transparency). And even if it will look like transparency, it seems worth distinguishing different goals of that research.
So that’s why there is a different name.
(I disagreed with this upthread. I don’t think “convince the overseer that an action is good” obviously incentivizes the right behavior, even if you are allowed to offer an explanation—certainly we don’t have any particular argument that it would incentivize the right behavior. It seems like informed oversight roughly captures what is needed in order for RL to create the right incentives.)