No, it’s also important for getting good behavior from RL.
Ok.
This is OK iff the number of reward function evaluations is sufficiently small. If your overseer is 10x as expensive as your policy, you need to evaluate the reward function <1/10th as often as you evaluate your policy. (See semi-supervised RL.)
Do you have an intuition that semi-supervised RL will be competitive with standard RL?
(Note that even “10x slowdown” could be very small compared to the loss in competitiveness from taking RL off the table, depending on how well RL works.)
Do you have an intuition that RL will work much better than SL for your purposes? If so, why/how? AFAIK, today people generally use RL over SL because A) they don’t have something that can provide demonstrations, B) it’s cheaper to provide evaluations than to provide demonstrations, or C) they want to exceed the performance of their demonstrators. But none of these seem to apply in your case? If you have a demonstrator (i.e., the amplified overseer) that can provide a supervised training signal at the performance level you want the trained agent to achieve, I’m not sure what you expect RL to offer on top of SL.
Another tangentially related puzzle that I have is, it seems like the internal computations of a RL agent would differ greatly depending on how you train it. For example if you train it with a simple reward function, then I expect the RL agent might end up modeling that reward function internally at a high level of accuracy and doing some sort of optimized/heuristic search for actions that would lead to high rewards. But if you train it with an overseer that is 10x as expensive as the agent, I’m not totally sure what you expect it to do internally and why we should expect whatever that is to be competitive with the first RL agent. For example maybe it would devote a lot more compute into running a somewhat accurate model of the overseer and then do a search for actions that would lead to high approval according to the approximate model, but then wouldn’t it do worse than the first RL agent because it couldn’t test as many candidate actions (because each test would cost more) and it would have to use worse heuristics (because the objective function is lot more complex and it doesn’t have as much resources, e.g., artificial neurons, left to run the heuristics)?
I think the confusion is coming from equivocating between multiple proposals.
Yes, I think I understand that at this point.
In explaining that, I’m sometimes talking about why it wouldn’t be competitive, and sometimes talking about why your model might do something unsafe if you used the obvious remedy to make it competitive.
In the new Informed oversight post, it seems that you skipped over talking about “why it wouldn’t be competitive” and went directly to “talking about why your model might do something unsafe if you used the obvious remedy to make it competitive” which confused me because I didn’t know that’s what you were doing. (The post doesn’t seem to contain the word “competitive” at all.)
That aside, can you give an example that illustrates “why it wouldn’t be competitive”?
That all said, sometimes the overseer believes that X will have good consequences because “stuff like X has had good consequences in the past;” that seems to be an important kind of reasoning that you can’t just leave out, and if you use that kind of reasoning then these risks can appear even in approval-directed agents with no hindsight.
I think I’m still confused about this, because it seems like these risks can appear even if the overseer (or a normal human) uses this kind of reasoning to itself decide what to do. For example, suppose I had an imperfect detector of network attacks, and I try a bunch of stuff to protect my network, and one of them happens to mislead my detector into returning “no attack” even when there is an attack, and then I use this kind of reasoning to do a lot more of that in the future.
Earlier you wrote “No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work.” but if you’re saying that the overseer sometimes believes that X will have good consequences because “stuff like X has had good consequences in the past;” and uses that to make decisions, then they don’t always understand why those decisions actually work?
Ok.
Do you have an intuition that semi-supervised RL will be competitive with standard RL?
Do you have an intuition that RL will work much better than SL for your purposes? If so, why/how? AFAIK, today people generally use RL over SL because A) they don’t have something that can provide demonstrations, B) it’s cheaper to provide evaluations than to provide demonstrations, or C) they want to exceed the performance of their demonstrators. But none of these seem to apply in your case? If you have a demonstrator (i.e., the amplified overseer) that can provide a supervised training signal at the performance level you want the trained agent to achieve, I’m not sure what you expect RL to offer on top of SL.
Another tangentially related puzzle that I have is, it seems like the internal computations of a RL agent would differ greatly depending on how you train it. For example if you train it with a simple reward function, then I expect the RL agent might end up modeling that reward function internally at a high level of accuracy and doing some sort of optimized/heuristic search for actions that would lead to high rewards. But if you train it with an overseer that is 10x as expensive as the agent, I’m not totally sure what you expect it to do internally and why we should expect whatever that is to be competitive with the first RL agent. For example maybe it would devote a lot more compute into running a somewhat accurate model of the overseer and then do a search for actions that would lead to high approval according to the approximate model, but then wouldn’t it do worse than the first RL agent because it couldn’t test as many candidate actions (because each test would cost more) and it would have to use worse heuristics (because the objective function is lot more complex and it doesn’t have as much resources, e.g., artificial neurons, left to run the heuristics)?
Yes, I think I understand that at this point.
In the new Informed oversight post, it seems that you skipped over talking about “why it wouldn’t be competitive” and went directly to “talking about why your model might do something unsafe if you used the obvious remedy to make it competitive” which confused me because I didn’t know that’s what you were doing. (The post doesn’t seem to contain the word “competitive” at all.)
That aside, can you give an example that illustrates “why it wouldn’t be competitive”?
I think I’m still confused about this, because it seems like these risks can appear even if the overseer (or a normal human) uses this kind of reasoning to itself decide what to do. For example, suppose I had an imperfect detector of network attacks, and I try a bunch of stuff to protect my network, and one of them happens to mislead my detector into returning “no attack” even when there is an attack, and then I use this kind of reasoning to do a lot more of that in the future.
Earlier you wrote “No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work.” but if you’re saying that the overseer sometimes believes that X will have good consequences because “stuff like X has had good consequences in the past;” and uses that to make decisions, then they don’t always understand why those decisions actually work?