What if the overseer just asks itself, “If I came up with the idea for this action myself, how much would I approve of it?” Sure, sometimes the overseer would approve something that has bad unintended/unforeseen consequences, but wouldn’t the same thing happen if the overseer was just making the decisions itself?
No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work. (Unless they happen to get unlucky and find great actions by chance, in which case they can get unlucky, or if they run a really powerful search, in which case yes really powerful searches are scary for the same reason.)
Would this still be a problem if we were training the agent with SL instead of RL?
You would still need informed oversight in order to solve the inner alignment problem (i.e. to actually achieve robustness in the face of sophisticated reasoning within the SL system), but you wouldn’t need it for the outer alignment problem.
If not, what is the motivation for using RL here?
I agree with Will. The point is to be competitive, I don’t see how you could be competitive if you use SL (unless it turns out that RL just doesn’t add any value, in which case I agree we don’t have to worry about RL).
like inner optimizers for “optimizing for worst case”
But you need to solve this problem in order to cope with inner optimizers.
Here it seems like you’re trying to train an agent that is more capable than the overseer in some way, and I’m not entirely sure why that has changed.
This is the step where we try to ensure that the overseer is actually more capable, in the stronger sense of knowing every particular thing the agent knows.
I think it would really help if you could give a story about why the agent is deliberately trying to cause trouble
I don’t quite understand what kind of story you are looking for, and this might have just been a communication problem in the post, so let me start by rephrasing that example:
I want my AI to protect me from attack, so I give it a high reward when it looks like I wasn’t attacked.
So my AI searches over actions to find one for which it expects I’ll conclude “I wasn’t attacked.”
Instead of actually protecting me, my AI undermines the mechanisms by which I would have detected that I was attacked.
We could run the same argument with “I want to acquire resources” instead of “I want to be protected from attack”—rather than actually helping me get resources, the agent can convince me that I have a bunch of resources when I don’t really have any.
how it came to have more understanding than the overseer
We did a lot of gradient descent over the parameters of the agent, to find parameters that cause the agent to take good actions. That gradient descent gives the agent a bunch of ability to reason about the world, because successful reasoning can help it take good actions.
The overseer can also look at that data and try to learn about general patterns etc.. But how do we guarantee that the explicit procedure used by the overseer actually captures all of the stuff the agent was able to learn? This is very non-obvious.
But you need to solve this problem in order to cope with inner optimizers.
Is “informed oversight” entirely a subproblem of “optimizing for worst case”? Your original example of art plagiarism made it seem like a very different problem which might be a significant part of my confusion.
This is the step where we try to ensure that the overseer is actually more capable, in the stronger sense of knowing every particular thing the agent knows.
This is tangential but can you remind me why it’s not a problem as far as competitiveness that your overseer is probably more costly to compute than other people’s reward/evaluation functions?
I want my AI to protect me from attack, so I give it a high reward when it looks like I wasn’t attacked.
Ok, this is definitely part of the confusion/miscommunication, as I wouldn’t have guessed this without it being explicit. Unless otherwise stated, I generally assume that overseers in your schemes follow the description given in Approval-directed agents, and only give high reward to each action if the overseer can itself anticipate good consequences from the action. (That post says, “Arthur’s actions are rated more highly than those produced by any alternative procedure. That’s comforting, but it doesn’t mean that Arthur is optimal. An optimal agent may make decisions that have consequences Hugh would approve of, even if Hugh can’t anticipate those consequences himself.” This seems to clearly imply that Hugh does not reward Arthur just for making decisions that have consequences Hugh would approve of, unless Hugh can anticipate those consequences himself.)
One of your earlier comments in this thread said “If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable (at which point there will be more sophisticated actions Y that result in good consequences for reasons that the agent would have to be even more sophisticated to understand).” So I guess that explains why the overseer in your example is doing something different, but I don’t recall seeing you mention this problem prior to this thread, so it wasn’t on my radar as something that you’re trying to solve. (I’m still not quite sure at this point that it really is a problem or that I correctly understand it. If you have explained it more somewhere, please let me know.)
Is “informed oversight” entirely a subproblem of “optimizing for worst case”? Your original example of art plagiarism made it seem like a very different problem which might be a significant part of my confusion.
No, it’s also important for getting good behavior from RL.
This is tangential but can you remind me why it’s not a problem as far as competitiveness that your overseer is probably more costly to compute than other people’s reward/evaluation functions?
This is OK iff the number of reward function evaluations is sufficiently small. If your overseer is 10x as expensive as your policy, you need to evaluate the reward function <1/10th as often as you evaluate your policy. (See semi-supervised RL.)
(Note that even “10x slowdown” could be very small compared to the loss in competitiveness from taking RL off the table, depending on how well RL works.)
Unless otherwise stated, I generally assume that overseers in your schemes follow the description given in Approval-directed agents, and only give high reward to each action if the overseer can itself anticipate good consequences from the action.
In this problem we are interesting in ensuring that the overseer is able to anticipate good consequences from an action.
If a model trained on historical data could predict good consequences, but your overseer can’t, then you are going to sacrifice competitiveness. That is, your agent won’t be motivated to use its understanding to help you achieve good consequences.
I think the confusion is coming from equivocating between multiple proposals. I’m saying, “We need to solve informed oversight for amplification to be a good training scheme.” You are asking “Why is that a problem?” and I’m trying to explain why this is a necessary component of iterated amplification. In explaining that, I’m sometimes talking about why it wouldn’t be competitive, and sometimes talking about why your model might do something unsafe if you used the obvious remedy to make it competitive. When you ask for “a story about why the model might do something unsafe,” I assumed you were asking for the latter—why would the obvious approach to making it competitive be unsafe. My earlier comment “If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable” is explaining why approval-directed agents aren’t competitive by default unless you solve something like this.
(That all said, sometimes the overseer believes that X will have good consequences because “stuff like X has had good consequences in the past;” that seems to be an important kind of reasoning that you can’t just leave out, and if you use that kind of reasoning then these risks can appear even in approval-directed agents with no hindsight. And if you don’t use this kind of reasoning you sacrifice competitiveness.)
No, it’s also important for getting good behavior from RL.
Ok.
This is OK iff the number of reward function evaluations is sufficiently small. If your overseer is 10x as expensive as your policy, you need to evaluate the reward function <1/10th as often as you evaluate your policy. (See semi-supervised RL.)
Do you have an intuition that semi-supervised RL will be competitive with standard RL?
(Note that even “10x slowdown” could be very small compared to the loss in competitiveness from taking RL off the table, depending on how well RL works.)
Do you have an intuition that RL will work much better than SL for your purposes? If so, why/how? AFAIK, today people generally use RL over SL because A) they don’t have something that can provide demonstrations, B) it’s cheaper to provide evaluations than to provide demonstrations, or C) they want to exceed the performance of their demonstrators. But none of these seem to apply in your case? If you have a demonstrator (i.e., the amplified overseer) that can provide a supervised training signal at the performance level you want the trained agent to achieve, I’m not sure what you expect RL to offer on top of SL.
Another tangentially related puzzle that I have is, it seems like the internal computations of a RL agent would differ greatly depending on how you train it. For example if you train it with a simple reward function, then I expect the RL agent might end up modeling that reward function internally at a high level of accuracy and doing some sort of optimized/heuristic search for actions that would lead to high rewards. But if you train it with an overseer that is 10x as expensive as the agent, I’m not totally sure what you expect it to do internally and why we should expect whatever that is to be competitive with the first RL agent. For example maybe it would devote a lot more compute into running a somewhat accurate model of the overseer and then do a search for actions that would lead to high approval according to the approximate model, but then wouldn’t it do worse than the first RL agent because it couldn’t test as many candidate actions (because each test would cost more) and it would have to use worse heuristics (because the objective function is lot more complex and it doesn’t have as much resources, e.g., artificial neurons, left to run the heuristics)?
I think the confusion is coming from equivocating between multiple proposals.
Yes, I think I understand that at this point.
In explaining that, I’m sometimes talking about why it wouldn’t be competitive, and sometimes talking about why your model might do something unsafe if you used the obvious remedy to make it competitive.
In the new Informed oversight post, it seems that you skipped over talking about “why it wouldn’t be competitive” and went directly to “talking about why your model might do something unsafe if you used the obvious remedy to make it competitive” which confused me because I didn’t know that’s what you were doing. (The post doesn’t seem to contain the word “competitive” at all.)
That aside, can you give an example that illustrates “why it wouldn’t be competitive”?
That all said, sometimes the overseer believes that X will have good consequences because “stuff like X has had good consequences in the past;” that seems to be an important kind of reasoning that you can’t just leave out, and if you use that kind of reasoning then these risks can appear even in approval-directed agents with no hindsight.
I think I’m still confused about this, because it seems like these risks can appear even if the overseer (or a normal human) uses this kind of reasoning to itself decide what to do. For example, suppose I had an imperfect detector of network attacks, and I try a bunch of stuff to protect my network, and one of them happens to mislead my detector into returning “no attack” even when there is an attack, and then I use this kind of reasoning to do a lot more of that in the future.
Earlier you wrote “No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work.” but if you’re saying that the overseer sometimes believes that X will have good consequences because “stuff like X has had good consequences in the past;” and uses that to make decisions, then they don’t always understand why those decisions actually work?
No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work. (Unless they happen to get unlucky and find great actions by chance, in which case they can get unlucky, or if they run a really powerful search, in which case yes really powerful searches are scary for the same reason.)
You would still need informed oversight in order to solve the inner alignment problem (i.e. to actually achieve robustness in the face of sophisticated reasoning within the SL system), but you wouldn’t need it for the outer alignment problem.
I agree with Will. The point is to be competitive, I don’t see how you could be competitive if you use SL (unless it turns out that RL just doesn’t add any value, in which case I agree we don’t have to worry about RL).
But you need to solve this problem in order to cope with inner optimizers.
This is the step where we try to ensure that the overseer is actually more capable, in the stronger sense of knowing every particular thing the agent knows.
I don’t quite understand what kind of story you are looking for, and this might have just been a communication problem in the post, so let me start by rephrasing that example:
I want my AI to protect me from attack, so I give it a high reward when it looks like I wasn’t attacked.
So my AI searches over actions to find one for which it expects I’ll conclude “I wasn’t attacked.”
Instead of actually protecting me, my AI undermines the mechanisms by which I would have detected that I was attacked.
We could run the same argument with “I want to acquire resources” instead of “I want to be protected from attack”—rather than actually helping me get resources, the agent can convince me that I have a bunch of resources when I don’t really have any.
We did a lot of gradient descent over the parameters of the agent, to find parameters that cause the agent to take good actions. That gradient descent gives the agent a bunch of ability to reason about the world, because successful reasoning can help it take good actions.
The overseer can also look at that data and try to learn about general patterns etc.. But how do we guarantee that the explicit procedure used by the overseer actually captures all of the stuff the agent was able to learn? This is very non-obvious.
Is “informed oversight” entirely a subproblem of “optimizing for worst case”? Your original example of art plagiarism made it seem like a very different problem which might be a significant part of my confusion.
This is tangential but can you remind me why it’s not a problem as far as competitiveness that your overseer is probably more costly to compute than other people’s reward/evaluation functions?
Ok, this is definitely part of the confusion/miscommunication, as I wouldn’t have guessed this without it being explicit. Unless otherwise stated, I generally assume that overseers in your schemes follow the description given in Approval-directed agents, and only give high reward to each action if the overseer can itself anticipate good consequences from the action. (That post says, “Arthur’s actions are rated more highly than those produced by any alternative procedure. That’s comforting, but it doesn’t mean that Arthur is optimal. An optimal agent may make decisions that have consequences Hugh would approve of, even if Hugh can’t anticipate those consequences himself.” This seems to clearly imply that Hugh does not reward Arthur just for making decisions that have consequences Hugh would approve of, unless Hugh can anticipate those consequences himself.)
One of your earlier comments in this thread said “If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable (at which point there will be more sophisticated actions Y that result in good consequences for reasons that the agent would have to be even more sophisticated to understand).” So I guess that explains why the overseer in your example is doing something different, but I don’t recall seeing you mention this problem prior to this thread, so it wasn’t on my radar as something that you’re trying to solve. (I’m still not quite sure at this point that it really is a problem or that I correctly understand it. If you have explained it more somewhere, please let me know.)
No, it’s also important for getting good behavior from RL.
This is OK iff the number of reward function evaluations is sufficiently small. If your overseer is 10x as expensive as your policy, you need to evaluate the reward function <1/10th as often as you evaluate your policy. (See semi-supervised RL.)
(Note that even “10x slowdown” could be very small compared to the loss in competitiveness from taking RL off the table, depending on how well RL works.)
In this problem we are interesting in ensuring that the overseer is able to anticipate good consequences from an action.
If a model trained on historical data could predict good consequences, but your overseer can’t, then you are going to sacrifice competitiveness. That is, your agent won’t be motivated to use its understanding to help you achieve good consequences.
I think the confusion is coming from equivocating between multiple proposals. I’m saying, “We need to solve informed oversight for amplification to be a good training scheme.” You are asking “Why is that a problem?” and I’m trying to explain why this is a necessary component of iterated amplification. In explaining that, I’m sometimes talking about why it wouldn’t be competitive, and sometimes talking about why your model might do something unsafe if you used the obvious remedy to make it competitive. When you ask for “a story about why the model might do something unsafe,” I assumed you were asking for the latter—why would the obvious approach to making it competitive be unsafe. My earlier comment “If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable” is explaining why approval-directed agents aren’t competitive by default unless you solve something like this.
(That all said, sometimes the overseer believes that X will have good consequences because “stuff like X has had good consequences in the past;” that seems to be an important kind of reasoning that you can’t just leave out, and if you use that kind of reasoning then these risks can appear even in approval-directed agents with no hindsight. And if you don’t use this kind of reasoning you sacrifice competitiveness.)
Ok.
Do you have an intuition that semi-supervised RL will be competitive with standard RL?
Do you have an intuition that RL will work much better than SL for your purposes? If so, why/how? AFAIK, today people generally use RL over SL because A) they don’t have something that can provide demonstrations, B) it’s cheaper to provide evaluations than to provide demonstrations, or C) they want to exceed the performance of their demonstrators. But none of these seem to apply in your case? If you have a demonstrator (i.e., the amplified overseer) that can provide a supervised training signal at the performance level you want the trained agent to achieve, I’m not sure what you expect RL to offer on top of SL.
Another tangentially related puzzle that I have is, it seems like the internal computations of a RL agent would differ greatly depending on how you train it. For example if you train it with a simple reward function, then I expect the RL agent might end up modeling that reward function internally at a high level of accuracy and doing some sort of optimized/heuristic search for actions that would lead to high rewards. But if you train it with an overseer that is 10x as expensive as the agent, I’m not totally sure what you expect it to do internally and why we should expect whatever that is to be competitive with the first RL agent. For example maybe it would devote a lot more compute into running a somewhat accurate model of the overseer and then do a search for actions that would lead to high approval according to the approximate model, but then wouldn’t it do worse than the first RL agent because it couldn’t test as many candidate actions (because each test would cost more) and it would have to use worse heuristics (because the objective function is lot more complex and it doesn’t have as much resources, e.g., artificial neurons, left to run the heuristics)?
Yes, I think I understand that at this point.
In the new Informed oversight post, it seems that you skipped over talking about “why it wouldn’t be competitive” and went directly to “talking about why your model might do something unsafe if you used the obvious remedy to make it competitive” which confused me because I didn’t know that’s what you were doing. (The post doesn’t seem to contain the word “competitive” at all.)
That aside, can you give an example that illustrates “why it wouldn’t be competitive”?
I think I’m still confused about this, because it seems like these risks can appear even if the overseer (or a normal human) uses this kind of reasoning to itself decide what to do. For example, suppose I had an imperfect detector of network attacks, and I try a bunch of stuff to protect my network, and one of them happens to mislead my detector into returning “no attack” even when there is an attack, and then I use this kind of reasoning to do a lot more of that in the future.
Earlier you wrote “No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work.” but if you’re saying that the overseer sometimes believes that X will have good consequences because “stuff like X has had good consequences in the past;” and uses that to make decisions, then they don’t always understand why those decisions actually work?