Suppose that taking action X results in good consequences empirically, but discovering why is quite hard. (It seems plausible to me that this kind of regularity is very important for humans actually behaving intelligently.)
Why is the overseer unable to see that X results in good consequences empirically, and give a high approval rating as a result? (When I said “understand” I just meant that the overseer can itself see that the action is good, not that it can necessarily articulate a reason. Similarly the “explanation” from the agent can just be “X is empirically good, look for yourself”.)
I have a guess that the overseer has a disadvantage relative to the agent because the agent has a kind of memory, where it has incorporated information from all past training data, and gets more from each new feedback from the overseer, but the overseer has no memory of past training data and has to start over evaluating each new input/action pair from a fixed state of knowledge. Is this right? (If so, it seems like maybe we can fix it by letting the overseer have access to past training data? Although it seems plausible that wouldn’t work well enough so if this guess is right, I think I may understand what the problem is.)
If we accept the argument “well it worked, didn’t it?” then we are back to the regime where the agent may know something we don’t (e.g. about why the action wasn’t good even though it looked good).
Relatedly, it’s still not really clear to me what it means to “only accept actions that we understand.” If the agent presents an action that is unacceptable, for reasons the overseer doesn’t understand, how do we penalize it? It’s not like there are some actions for which we understand all consequences and others for which we don’t—any action in practice could have lots of consequences we understand and lots we don’t, and we can’t rule out the existence of consequences we don’t understand.
As you observe, the agent learns facts from the training distribution, and even if the overseer has a memory there is no guarantee that they will be able to use it as effectively as the agent. Being able to look at training data in some way (I expect implicitly) is a reason that informed oversight isn’t obviously impossible, but not reasons that this is a non-problem.
Relatedly, it’s still not really clear to me what it means to “only accept actions that we understand.” If the agent presents an action that is unacceptable, for reasons the overseer doesn’t understand, how do we penalize it? It’s not like there are some actions for which we understand all consequences and others for which we don’t—any action in practice could have lots of consequences we understand and lots we don’t, and we can’t rule out the existence of consequences we don’t understand.
What if the overseer just asks itself, “If I came up with the idea for this action myself, how much would I approve of it?” Sure, sometimes the overseer would approve something that has bad unintended/unforeseen consequences, but wouldn’t the same thing happen if the overseer was just making the decisions itself?
ETA: Is the answer that if the overseer was making the decisions itself, there wouldn’t be a risk that the process proposing the actions might deliberately propose actions that have bad consequences that the overseer can’t foresee? Would this still be a problem if we were training the agent with SL instead of RL? If not, what is the motivation for using RL here?
I feel like through this discussion I now understand the problem a little better, but it’s still not nearly as crisp as some of the other problems like “optimizing for worst case”. I think part of it is lack of a clear motivating example (like inner optimizers for “optimizing for worst case”) and part of it is that “informed oversight” is a problem that arises during the distillation step of IDA, but previously that step was described as distilling the overseer down to a faster but less capable agent. Here it seems like you’re trying to train an agent that is more capable than the overseer in some way, and I’m not entirely sure why that has changed.
ETA: Going back to the Informed Oversight article, this part almost makes sense now:
In the security example, literally seeing a sequence of bits moving across an interface gives you almost no information — something can look innocuous, but cause a huge amount of trouble. In order to incentivize our agent to avoid causing trouble, we need to be able to detect any trouble that the agent deliberately causes. Even an apparently mundane gap in our understanding could hide attacks, just as effectively as if we’d been literally unable to observe the agent’s behavior.
I think it would really help if you could give a story about why the agent is deliberately trying to cause trouble, and how it came to have more understanding than the overseer, enough to pick an action that looks good to the overseer but would actually knowingly (to the agent) cause something bad to happen.
What if the overseer just asks itself, “If I came up with the idea for this action myself, how much would I approve of it?” Sure, sometimes the overseer would approve something that has bad unintended/unforeseen consequences, but wouldn’t the same thing happen if the overseer was just making the decisions itself?
No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work. (Unless they happen to get unlucky and find great actions by chance, in which case they can get unlucky, or if they run a really powerful search, in which case yes really powerful searches are scary for the same reason.)
Would this still be a problem if we were training the agent with SL instead of RL?
You would still need informed oversight in order to solve the inner alignment problem (i.e. to actually achieve robustness in the face of sophisticated reasoning within the SL system), but you wouldn’t need it for the outer alignment problem.
If not, what is the motivation for using RL here?
I agree with Will. The point is to be competitive, I don’t see how you could be competitive if you use SL (unless it turns out that RL just doesn’t add any value, in which case I agree we don’t have to worry about RL).
like inner optimizers for “optimizing for worst case”
But you need to solve this problem in order to cope with inner optimizers.
Here it seems like you’re trying to train an agent that is more capable than the overseer in some way, and I’m not entirely sure why that has changed.
This is the step where we try to ensure that the overseer is actually more capable, in the stronger sense of knowing every particular thing the agent knows.
I think it would really help if you could give a story about why the agent is deliberately trying to cause trouble
I don’t quite understand what kind of story you are looking for, and this might have just been a communication problem in the post, so let me start by rephrasing that example:
I want my AI to protect me from attack, so I give it a high reward when it looks like I wasn’t attacked.
So my AI searches over actions to find one for which it expects I’ll conclude “I wasn’t attacked.”
Instead of actually protecting me, my AI undermines the mechanisms by which I would have detected that I was attacked.
We could run the same argument with “I want to acquire resources” instead of “I want to be protected from attack”—rather than actually helping me get resources, the agent can convince me that I have a bunch of resources when I don’t really have any.
how it came to have more understanding than the overseer
We did a lot of gradient descent over the parameters of the agent, to find parameters that cause the agent to take good actions. That gradient descent gives the agent a bunch of ability to reason about the world, because successful reasoning can help it take good actions.
The overseer can also look at that data and try to learn about general patterns etc.. But how do we guarantee that the explicit procedure used by the overseer actually captures all of the stuff the agent was able to learn? This is very non-obvious.
But you need to solve this problem in order to cope with inner optimizers.
Is “informed oversight” entirely a subproblem of “optimizing for worst case”? Your original example of art plagiarism made it seem like a very different problem which might be a significant part of my confusion.
This is the step where we try to ensure that the overseer is actually more capable, in the stronger sense of knowing every particular thing the agent knows.
This is tangential but can you remind me why it’s not a problem as far as competitiveness that your overseer is probably more costly to compute than other people’s reward/evaluation functions?
I want my AI to protect me from attack, so I give it a high reward when it looks like I wasn’t attacked.
Ok, this is definitely part of the confusion/miscommunication, as I wouldn’t have guessed this without it being explicit. Unless otherwise stated, I generally assume that overseers in your schemes follow the description given in Approval-directed agents, and only give high reward to each action if the overseer can itself anticipate good consequences from the action. (That post says, “Arthur’s actions are rated more highly than those produced by any alternative procedure. That’s comforting, but it doesn’t mean that Arthur is optimal. An optimal agent may make decisions that have consequences Hugh would approve of, even if Hugh can’t anticipate those consequences himself.” This seems to clearly imply that Hugh does not reward Arthur just for making decisions that have consequences Hugh would approve of, unless Hugh can anticipate those consequences himself.)
One of your earlier comments in this thread said “If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable (at which point there will be more sophisticated actions Y that result in good consequences for reasons that the agent would have to be even more sophisticated to understand).” So I guess that explains why the overseer in your example is doing something different, but I don’t recall seeing you mention this problem prior to this thread, so it wasn’t on my radar as something that you’re trying to solve. (I’m still not quite sure at this point that it really is a problem or that I correctly understand it. If you have explained it more somewhere, please let me know.)
Is “informed oversight” entirely a subproblem of “optimizing for worst case”? Your original example of art plagiarism made it seem like a very different problem which might be a significant part of my confusion.
No, it’s also important for getting good behavior from RL.
This is tangential but can you remind me why it’s not a problem as far as competitiveness that your overseer is probably more costly to compute than other people’s reward/evaluation functions?
This is OK iff the number of reward function evaluations is sufficiently small. If your overseer is 10x as expensive as your policy, you need to evaluate the reward function <1/10th as often as you evaluate your policy. (See semi-supervised RL.)
(Note that even “10x slowdown” could be very small compared to the loss in competitiveness from taking RL off the table, depending on how well RL works.)
Unless otherwise stated, I generally assume that overseers in your schemes follow the description given in Approval-directed agents, and only give high reward to each action if the overseer can itself anticipate good consequences from the action.
In this problem we are interesting in ensuring that the overseer is able to anticipate good consequences from an action.
If a model trained on historical data could predict good consequences, but your overseer can’t, then you are going to sacrifice competitiveness. That is, your agent won’t be motivated to use its understanding to help you achieve good consequences.
I think the confusion is coming from equivocating between multiple proposals. I’m saying, “We need to solve informed oversight for amplification to be a good training scheme.” You are asking “Why is that a problem?” and I’m trying to explain why this is a necessary component of iterated amplification. In explaining that, I’m sometimes talking about why it wouldn’t be competitive, and sometimes talking about why your model might do something unsafe if you used the obvious remedy to make it competitive. When you ask for “a story about why the model might do something unsafe,” I assumed you were asking for the latter—why would the obvious approach to making it competitive be unsafe. My earlier comment “If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable” is explaining why approval-directed agents aren’t competitive by default unless you solve something like this.
(That all said, sometimes the overseer believes that X will have good consequences because “stuff like X has had good consequences in the past;” that seems to be an important kind of reasoning that you can’t just leave out, and if you use that kind of reasoning then these risks can appear even in approval-directed agents with no hindsight. And if you don’t use this kind of reasoning you sacrifice competitiveness.)
No, it’s also important for getting good behavior from RL.
Ok.
This is OK iff the number of reward function evaluations is sufficiently small. If your overseer is 10x as expensive as your policy, you need to evaluate the reward function <1/10th as often as you evaluate your policy. (See semi-supervised RL.)
Do you have an intuition that semi-supervised RL will be competitive with standard RL?
(Note that even “10x slowdown” could be very small compared to the loss in competitiveness from taking RL off the table, depending on how well RL works.)
Do you have an intuition that RL will work much better than SL for your purposes? If so, why/how? AFAIK, today people generally use RL over SL because A) they don’t have something that can provide demonstrations, B) it’s cheaper to provide evaluations than to provide demonstrations, or C) they want to exceed the performance of their demonstrators. But none of these seem to apply in your case? If you have a demonstrator (i.e., the amplified overseer) that can provide a supervised training signal at the performance level you want the trained agent to achieve, I’m not sure what you expect RL to offer on top of SL.
Another tangentially related puzzle that I have is, it seems like the internal computations of a RL agent would differ greatly depending on how you train it. For example if you train it with a simple reward function, then I expect the RL agent might end up modeling that reward function internally at a high level of accuracy and doing some sort of optimized/heuristic search for actions that would lead to high rewards. But if you train it with an overseer that is 10x as expensive as the agent, I’m not totally sure what you expect it to do internally and why we should expect whatever that is to be competitive with the first RL agent. For example maybe it would devote a lot more compute into running a somewhat accurate model of the overseer and then do a search for actions that would lead to high approval according to the approximate model, but then wouldn’t it do worse than the first RL agent because it couldn’t test as many candidate actions (because each test would cost more) and it would have to use worse heuristics (because the objective function is lot more complex and it doesn’t have as much resources, e.g., artificial neurons, left to run the heuristics)?
I think the confusion is coming from equivocating between multiple proposals.
Yes, I think I understand that at this point.
In explaining that, I’m sometimes talking about why it wouldn’t be competitive, and sometimes talking about why your model might do something unsafe if you used the obvious remedy to make it competitive.
In the new Informed oversight post, it seems that you skipped over talking about “why it wouldn’t be competitive” and went directly to “talking about why your model might do something unsafe if you used the obvious remedy to make it competitive” which confused me because I didn’t know that’s what you were doing. (The post doesn’t seem to contain the word “competitive” at all.)
That aside, can you give an example that illustrates “why it wouldn’t be competitive”?
That all said, sometimes the overseer believes that X will have good consequences because “stuff like X has had good consequences in the past;” that seems to be an important kind of reasoning that you can’t just leave out, and if you use that kind of reasoning then these risks can appear even in approval-directed agents with no hindsight.
I think I’m still confused about this, because it seems like these risks can appear even if the overseer (or a normal human) uses this kind of reasoning to itself decide what to do. For example, suppose I had an imperfect detector of network attacks, and I try a bunch of stuff to protect my network, and one of them happens to mislead my detector into returning “no attack” even when there is an attack, and then I use this kind of reasoning to do a lot more of that in the future.
Earlier you wrote “No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work.” but if you’re saying that the overseer sometimes believes that X will have good consequences because “stuff like X has had good consequences in the past;” and uses that to make decisions, then they don’t always understand why those decisions actually work?
I see the motivation as given practical compute limits, it may be much easier to have the system find an action the overseer approves of instead of imitating the overseer directly. Using RL also allows you to use any advances that are made in RL by the machine learning community to try to remain competitive.
Would this still be a problem if we were training the agent with SL instead of RL?
Maybe this could happen with SL if SL does some kind of large search and finds a solution that looks good but is actually bad. The distilled agent would then learn to identify this action and reproduce it, which implies the agent learning some facts about the action to efficiently locate it with much less compute than the large search process. Knowing what the agent knows would allow the overseer to learn those facts, which might help in identifying this action as bad.
Why is the overseer unable to see that X results in good consequences empirically, and give a high approval rating as a result? (When I said “understand” I just meant that the overseer can itself see that the action is good, not that it can necessarily articulate a reason. Similarly the “explanation” from the agent can just be “X is empirically good, look for yourself”.)
I have a guess that the overseer has a disadvantage relative to the agent because the agent has a kind of memory, where it has incorporated information from all past training data, and gets more from each new feedback from the overseer, but the overseer has no memory of past training data and has to start over evaluating each new input/action pair from a fixed state of knowledge. Is this right? (If so, it seems like maybe we can fix it by letting the overseer have access to past training data? Although it seems plausible that wouldn’t work well enough so if this guess is right, I think I may understand what the problem is.)
Some problems:
If we accept the argument “well it worked, didn’t it?” then we are back to the regime where the agent may know something we don’t (e.g. about why the action wasn’t good even though it looked good).
Relatedly, it’s still not really clear to me what it means to “only accept actions that we understand.” If the agent presents an action that is unacceptable, for reasons the overseer doesn’t understand, how do we penalize it? It’s not like there are some actions for which we understand all consequences and others for which we don’t—any action in practice could have lots of consequences we understand and lots we don’t, and we can’t rule out the existence of consequences we don’t understand.
As you observe, the agent learns facts from the training distribution, and even if the overseer has a memory there is no guarantee that they will be able to use it as effectively as the agent. Being able to look at training data in some way (I expect implicitly) is a reason that informed oversight isn’t obviously impossible, but not reasons that this is a non-problem.
What if the overseer just asks itself, “If I came up with the idea for this action myself, how much would I approve of it?” Sure, sometimes the overseer would approve something that has bad unintended/unforeseen consequences, but wouldn’t the same thing happen if the overseer was just making the decisions itself?
ETA: Is the answer that if the overseer was making the decisions itself, there wouldn’t be a risk that the process proposing the actions might deliberately propose actions that have bad consequences that the overseer can’t foresee? Would this still be a problem if we were training the agent with SL instead of RL? If not, what is the motivation for using RL here?
I feel like through this discussion I now understand the problem a little better, but it’s still not nearly as crisp as some of the other problems like “optimizing for worst case”. I think part of it is lack of a clear motivating example (like inner optimizers for “optimizing for worst case”) and part of it is that “informed oversight” is a problem that arises during the distillation step of IDA, but previously that step was described as distilling the overseer down to a faster but less capable agent. Here it seems like you’re trying to train an agent that is more capable than the overseer in some way, and I’m not entirely sure why that has changed.
ETA: Going back to the Informed Oversight article, this part almost makes sense now:
I think it would really help if you could give a story about why the agent is deliberately trying to cause trouble, and how it came to have more understanding than the overseer, enough to pick an action that looks good to the overseer but would actually knowingly (to the agent) cause something bad to happen.
No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work. (Unless they happen to get unlucky and find great actions by chance, in which case they can get unlucky, or if they run a really powerful search, in which case yes really powerful searches are scary for the same reason.)
You would still need informed oversight in order to solve the inner alignment problem (i.e. to actually achieve robustness in the face of sophisticated reasoning within the SL system), but you wouldn’t need it for the outer alignment problem.
I agree with Will. The point is to be competitive, I don’t see how you could be competitive if you use SL (unless it turns out that RL just doesn’t add any value, in which case I agree we don’t have to worry about RL).
But you need to solve this problem in order to cope with inner optimizers.
This is the step where we try to ensure that the overseer is actually more capable, in the stronger sense of knowing every particular thing the agent knows.
I don’t quite understand what kind of story you are looking for, and this might have just been a communication problem in the post, so let me start by rephrasing that example:
I want my AI to protect me from attack, so I give it a high reward when it looks like I wasn’t attacked.
So my AI searches over actions to find one for which it expects I’ll conclude “I wasn’t attacked.”
Instead of actually protecting me, my AI undermines the mechanisms by which I would have detected that I was attacked.
We could run the same argument with “I want to acquire resources” instead of “I want to be protected from attack”—rather than actually helping me get resources, the agent can convince me that I have a bunch of resources when I don’t really have any.
We did a lot of gradient descent over the parameters of the agent, to find parameters that cause the agent to take good actions. That gradient descent gives the agent a bunch of ability to reason about the world, because successful reasoning can help it take good actions.
The overseer can also look at that data and try to learn about general patterns etc.. But how do we guarantee that the explicit procedure used by the overseer actually captures all of the stuff the agent was able to learn? This is very non-obvious.
Is “informed oversight” entirely a subproblem of “optimizing for worst case”? Your original example of art plagiarism made it seem like a very different problem which might be a significant part of my confusion.
This is tangential but can you remind me why it’s not a problem as far as competitiveness that your overseer is probably more costly to compute than other people’s reward/evaluation functions?
Ok, this is definitely part of the confusion/miscommunication, as I wouldn’t have guessed this without it being explicit. Unless otherwise stated, I generally assume that overseers in your schemes follow the description given in Approval-directed agents, and only give high reward to each action if the overseer can itself anticipate good consequences from the action. (That post says, “Arthur’s actions are rated more highly than those produced by any alternative procedure. That’s comforting, but it doesn’t mean that Arthur is optimal. An optimal agent may make decisions that have consequences Hugh would approve of, even if Hugh can’t anticipate those consequences himself.” This seems to clearly imply that Hugh does not reward Arthur just for making decisions that have consequences Hugh would approve of, unless Hugh can anticipate those consequences himself.)
One of your earlier comments in this thread said “If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable (at which point there will be more sophisticated actions Y that result in good consequences for reasons that the agent would have to be even more sophisticated to understand).” So I guess that explains why the overseer in your example is doing something different, but I don’t recall seeing you mention this problem prior to this thread, so it wasn’t on my radar as something that you’re trying to solve. (I’m still not quite sure at this point that it really is a problem or that I correctly understand it. If you have explained it more somewhere, please let me know.)
No, it’s also important for getting good behavior from RL.
This is OK iff the number of reward function evaluations is sufficiently small. If your overseer is 10x as expensive as your policy, you need to evaluate the reward function <1/10th as often as you evaluate your policy. (See semi-supervised RL.)
(Note that even “10x slowdown” could be very small compared to the loss in competitiveness from taking RL off the table, depending on how well RL works.)
In this problem we are interesting in ensuring that the overseer is able to anticipate good consequences from an action.
If a model trained on historical data could predict good consequences, but your overseer can’t, then you are going to sacrifice competitiveness. That is, your agent won’t be motivated to use its understanding to help you achieve good consequences.
I think the confusion is coming from equivocating between multiple proposals. I’m saying, “We need to solve informed oversight for amplification to be a good training scheme.” You are asking “Why is that a problem?” and I’m trying to explain why this is a necessary component of iterated amplification. In explaining that, I’m sometimes talking about why it wouldn’t be competitive, and sometimes talking about why your model might do something unsafe if you used the obvious remedy to make it competitive. When you ask for “a story about why the model might do something unsafe,” I assumed you were asking for the latter—why would the obvious approach to making it competitive be unsafe. My earlier comment “If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable” is explaining why approval-directed agents aren’t competitive by default unless you solve something like this.
(That all said, sometimes the overseer believes that X will have good consequences because “stuff like X has had good consequences in the past;” that seems to be an important kind of reasoning that you can’t just leave out, and if you use that kind of reasoning then these risks can appear even in approval-directed agents with no hindsight. And if you don’t use this kind of reasoning you sacrifice competitiveness.)
Ok.
Do you have an intuition that semi-supervised RL will be competitive with standard RL?
Do you have an intuition that RL will work much better than SL for your purposes? If so, why/how? AFAIK, today people generally use RL over SL because A) they don’t have something that can provide demonstrations, B) it’s cheaper to provide evaluations than to provide demonstrations, or C) they want to exceed the performance of their demonstrators. But none of these seem to apply in your case? If you have a demonstrator (i.e., the amplified overseer) that can provide a supervised training signal at the performance level you want the trained agent to achieve, I’m not sure what you expect RL to offer on top of SL.
Another tangentially related puzzle that I have is, it seems like the internal computations of a RL agent would differ greatly depending on how you train it. For example if you train it with a simple reward function, then I expect the RL agent might end up modeling that reward function internally at a high level of accuracy and doing some sort of optimized/heuristic search for actions that would lead to high rewards. But if you train it with an overseer that is 10x as expensive as the agent, I’m not totally sure what you expect it to do internally and why we should expect whatever that is to be competitive with the first RL agent. For example maybe it would devote a lot more compute into running a somewhat accurate model of the overseer and then do a search for actions that would lead to high approval according to the approximate model, but then wouldn’t it do worse than the first RL agent because it couldn’t test as many candidate actions (because each test would cost more) and it would have to use worse heuristics (because the objective function is lot more complex and it doesn’t have as much resources, e.g., artificial neurons, left to run the heuristics)?
Yes, I think I understand that at this point.
In the new Informed oversight post, it seems that you skipped over talking about “why it wouldn’t be competitive” and went directly to “talking about why your model might do something unsafe if you used the obvious remedy to make it competitive” which confused me because I didn’t know that’s what you were doing. (The post doesn’t seem to contain the word “competitive” at all.)
That aside, can you give an example that illustrates “why it wouldn’t be competitive”?
I think I’m still confused about this, because it seems like these risks can appear even if the overseer (or a normal human) uses this kind of reasoning to itself decide what to do. For example, suppose I had an imperfect detector of network attacks, and I try a bunch of stuff to protect my network, and one of them happens to mislead my detector into returning “no attack” even when there is an attack, and then I use this kind of reasoning to do a lot more of that in the future.
Earlier you wrote “No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work.” but if you’re saying that the overseer sometimes believes that X will have good consequences because “stuff like X has had good consequences in the past;” and uses that to make decisions, then they don’t always understand why those decisions actually work?
I see the motivation as given practical compute limits, it may be much easier to have the system find an action the overseer approves of instead of imitating the overseer directly. Using RL also allows you to use any advances that are made in RL by the machine learning community to try to remain competitive.
Maybe this could happen with SL if SL does some kind of large search and finds a solution that looks good but is actually bad. The distilled agent would then learn to identify this action and reproduce it, which implies the agent learning some facts about the action to efficiently locate it with much less compute than the large search process. Knowing what the agent knows would allow the overseer to learn those facts, which might help in identifying this action as bad.