Is “informed oversight” entirely a subproblem of “optimizing for worst case”? Your original example of art plagiarism made it seem like a very different problem which might be a significant part of my confusion.
No, it’s also important for getting good behavior from RL.
This is tangential but can you remind me why it’s not a problem as far as competitiveness that your overseer is probably more costly to compute than other people’s reward/evaluation functions?
This is OK iff the number of reward function evaluations is sufficiently small. If your overseer is 10x as expensive as your policy, you need to evaluate the reward function <1/10th as often as you evaluate your policy. (See semi-supervised RL.)
(Note that even “10x slowdown” could be very small compared to the loss in competitiveness from taking RL off the table, depending on how well RL works.)
Unless otherwise stated, I generally assume that overseers in your schemes follow the description given in Approval-directed agents, and only give high reward to each action if the overseer can itself anticipate good consequences from the action.
In this problem we are interesting in ensuring that the overseer is able to anticipate good consequences from an action.
If a model trained on historical data could predict good consequences, but your overseer can’t, then you are going to sacrifice competitiveness. That is, your agent won’t be motivated to use its understanding to help you achieve good consequences.
I think the confusion is coming from equivocating between multiple proposals. I’m saying, “We need to solve informed oversight for amplification to be a good training scheme.” You are asking “Why is that a problem?” and I’m trying to explain why this is a necessary component of iterated amplification. In explaining that, I’m sometimes talking about why it wouldn’t be competitive, and sometimes talking about why your model might do something unsafe if you used the obvious remedy to make it competitive. When you ask for “a story about why the model might do something unsafe,” I assumed you were asking for the latter—why would the obvious approach to making it competitive be unsafe. My earlier comment “If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable” is explaining why approval-directed agents aren’t competitive by default unless you solve something like this.
(That all said, sometimes the overseer believes that X will have good consequences because “stuff like X has had good consequences in the past;” that seems to be an important kind of reasoning that you can’t just leave out, and if you use that kind of reasoning then these risks can appear even in approval-directed agents with no hindsight. And if you don’t use this kind of reasoning you sacrifice competitiveness.)
No, it’s also important for getting good behavior from RL.
Ok.
This is OK iff the number of reward function evaluations is sufficiently small. If your overseer is 10x as expensive as your policy, you need to evaluate the reward function <1/10th as often as you evaluate your policy. (See semi-supervised RL.)
Do you have an intuition that semi-supervised RL will be competitive with standard RL?
(Note that even “10x slowdown” could be very small compared to the loss in competitiveness from taking RL off the table, depending on how well RL works.)
Do you have an intuition that RL will work much better than SL for your purposes? If so, why/how? AFAIK, today people generally use RL over SL because A) they don’t have something that can provide demonstrations, B) it’s cheaper to provide evaluations than to provide demonstrations, or C) they want to exceed the performance of their demonstrators. But none of these seem to apply in your case? If you have a demonstrator (i.e., the amplified overseer) that can provide a supervised training signal at the performance level you want the trained agent to achieve, I’m not sure what you expect RL to offer on top of SL.
Another tangentially related puzzle that I have is, it seems like the internal computations of a RL agent would differ greatly depending on how you train it. For example if you train it with a simple reward function, then I expect the RL agent might end up modeling that reward function internally at a high level of accuracy and doing some sort of optimized/heuristic search for actions that would lead to high rewards. But if you train it with an overseer that is 10x as expensive as the agent, I’m not totally sure what you expect it to do internally and why we should expect whatever that is to be competitive with the first RL agent. For example maybe it would devote a lot more compute into running a somewhat accurate model of the overseer and then do a search for actions that would lead to high approval according to the approximate model, but then wouldn’t it do worse than the first RL agent because it couldn’t test as many candidate actions (because each test would cost more) and it would have to use worse heuristics (because the objective function is lot more complex and it doesn’t have as much resources, e.g., artificial neurons, left to run the heuristics)?
I think the confusion is coming from equivocating between multiple proposals.
Yes, I think I understand that at this point.
In explaining that, I’m sometimes talking about why it wouldn’t be competitive, and sometimes talking about why your model might do something unsafe if you used the obvious remedy to make it competitive.
In the new Informed oversight post, it seems that you skipped over talking about “why it wouldn’t be competitive” and went directly to “talking about why your model might do something unsafe if you used the obvious remedy to make it competitive” which confused me because I didn’t know that’s what you were doing. (The post doesn’t seem to contain the word “competitive” at all.)
That aside, can you give an example that illustrates “why it wouldn’t be competitive”?
That all said, sometimes the overseer believes that X will have good consequences because “stuff like X has had good consequences in the past;” that seems to be an important kind of reasoning that you can’t just leave out, and if you use that kind of reasoning then these risks can appear even in approval-directed agents with no hindsight.
I think I’m still confused about this, because it seems like these risks can appear even if the overseer (or a normal human) uses this kind of reasoning to itself decide what to do. For example, suppose I had an imperfect detector of network attacks, and I try a bunch of stuff to protect my network, and one of them happens to mislead my detector into returning “no attack” even when there is an attack, and then I use this kind of reasoning to do a lot more of that in the future.
Earlier you wrote “No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work.” but if you’re saying that the overseer sometimes believes that X will have good consequences because “stuff like X has had good consequences in the past;” and uses that to make decisions, then they don’t always understand why those decisions actually work?
No, it’s also important for getting good behavior from RL.
This is OK iff the number of reward function evaluations is sufficiently small. If your overseer is 10x as expensive as your policy, you need to evaluate the reward function <1/10th as often as you evaluate your policy. (See semi-supervised RL.)
(Note that even “10x slowdown” could be very small compared to the loss in competitiveness from taking RL off the table, depending on how well RL works.)
In this problem we are interesting in ensuring that the overseer is able to anticipate good consequences from an action.
If a model trained on historical data could predict good consequences, but your overseer can’t, then you are going to sacrifice competitiveness. That is, your agent won’t be motivated to use its understanding to help you achieve good consequences.
I think the confusion is coming from equivocating between multiple proposals. I’m saying, “We need to solve informed oversight for amplification to be a good training scheme.” You are asking “Why is that a problem?” and I’m trying to explain why this is a necessary component of iterated amplification. In explaining that, I’m sometimes talking about why it wouldn’t be competitive, and sometimes talking about why your model might do something unsafe if you used the obvious remedy to make it competitive. When you ask for “a story about why the model might do something unsafe,” I assumed you were asking for the latter—why would the obvious approach to making it competitive be unsafe. My earlier comment “If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable” is explaining why approval-directed agents aren’t competitive by default unless you solve something like this.
(That all said, sometimes the overseer believes that X will have good consequences because “stuff like X has had good consequences in the past;” that seems to be an important kind of reasoning that you can’t just leave out, and if you use that kind of reasoning then these risks can appear even in approval-directed agents with no hindsight. And if you don’t use this kind of reasoning you sacrifice competitiveness.)
Ok.
Do you have an intuition that semi-supervised RL will be competitive with standard RL?
Do you have an intuition that RL will work much better than SL for your purposes? If so, why/how? AFAIK, today people generally use RL over SL because A) they don’t have something that can provide demonstrations, B) it’s cheaper to provide evaluations than to provide demonstrations, or C) they want to exceed the performance of their demonstrators. But none of these seem to apply in your case? If you have a demonstrator (i.e., the amplified overseer) that can provide a supervised training signal at the performance level you want the trained agent to achieve, I’m not sure what you expect RL to offer on top of SL.
Another tangentially related puzzle that I have is, it seems like the internal computations of a RL agent would differ greatly depending on how you train it. For example if you train it with a simple reward function, then I expect the RL agent might end up modeling that reward function internally at a high level of accuracy and doing some sort of optimized/heuristic search for actions that would lead to high rewards. But if you train it with an overseer that is 10x as expensive as the agent, I’m not totally sure what you expect it to do internally and why we should expect whatever that is to be competitive with the first RL agent. For example maybe it would devote a lot more compute into running a somewhat accurate model of the overseer and then do a search for actions that would lead to high approval according to the approximate model, but then wouldn’t it do worse than the first RL agent because it couldn’t test as many candidate actions (because each test would cost more) and it would have to use worse heuristics (because the objective function is lot more complex and it doesn’t have as much resources, e.g., artificial neurons, left to run the heuristics)?
Yes, I think I understand that at this point.
In the new Informed oversight post, it seems that you skipped over talking about “why it wouldn’t be competitive” and went directly to “talking about why your model might do something unsafe if you used the obvious remedy to make it competitive” which confused me because I didn’t know that’s what you were doing. (The post doesn’t seem to contain the word “competitive” at all.)
That aside, can you give an example that illustrates “why it wouldn’t be competitive”?
I think I’m still confused about this, because it seems like these risks can appear even if the overseer (or a normal human) uses this kind of reasoning to itself decide what to do. For example, suppose I had an imperfect detector of network attacks, and I try a bunch of stuff to protect my network, and one of them happens to mislead my detector into returning “no attack” even when there is an attack, and then I use this kind of reasoning to do a lot more of that in the future.
Earlier you wrote “No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work.” but if you’re saying that the overseer sometimes believes that X will have good consequences because “stuff like X has had good consequences in the past;” and uses that to make decisions, then they don’t always understand why those decisions actually work?