Search over all possible plans for one that scores high on <malicious reward function> that looks good to the overseer with some normal explanation. Output that plan and the corresponding explanation.
Giving low scores to confusing actions and asking for an explanation do not save you from the harmful plans that are found by this agent. When the agent is much more powerful than the overseer, this seems very unsafe. When the overseer is more powerful, I am unsure what would happen.
The danger that I perceive is more from the abstract argument that the agent could find bad plans that the overseer would not recognize. You could consider lobbying and spy movies as real-life analogues of this problem.
Search over all possible plans for one that scores high on <malicious reward function> that looks good to the overseer with some normal explanation. Output that plan and the corresponding explanation.
But there’s an agent close to this malicious agent in agent space, which prioritizes “looking good to the overseer” more and “scoring high on <malicious reward function>” less, so if we keeping training with RL, we’d move towards the former agent and away from the latter agent (since the former agent would tend to come up with actions that the overseer approves of more). And if we keep training, we should eventually end up with an agent that doesn’t try to score high on <malicious reward function> at all, right?
When the agent is much more powerful than the overseer, this seems very unsafe.
I agree with this, but for a different reason. I think if we optimize for “looking good to the overseer” too much, we’d reliably trigger safety problems in the overseer’s evaluation process (e.g., triggering bugs in the overseer’s code, or causing distributional shifts that the overseer can’t handle). I feel like that’s a different problem from “informed oversight” though, which I still don’t quite understand.
You could consider lobbying and spy movies as real-life analogues of this problem.
In both of those cases the lobbyist and the spy have an informational as well as computational advantage over the politician/counterintelligence officer (the politician doesn’t know a lot of details that the lobbyist knows and has less time to think about an issue than the lobbyist; the counterintelligence officer can’t see much of what the spy has seen and done and has to divide attention between lots of people who look suspicious), but in the IDA setting isn’t the overseer seeing everything that the agent is seeing, and also has more computational power than the agent?
But there’s an agent close to this malicious agent in agent space, which prioritizes “looking good to the overseer” more and “scoring high on <malicious reward function>” less, so if we keeping training with RL, we’d move towards the former agent and away from the latter agent (since the former agent would tend to come up with actions that the overseer approves of more). And if we keep training, we should eventually end up with an agent that doesn’t try to score high on <malicious reward function> at all, right?
In theory, yes, but in practice RL does not give us such nice guarantees. (Inner optimizers are a particular instantiation of this concern that RL does not find the best possible agent.)
I agree with this, but for a different reason.
I agree that it’s a different (also important) reason and that your reason does not motivate informed oversight.
In both of those cases the lobbyist and the spy have an informational as well as computational advantage over the politician/counterintelligence officer
Agreed, it’s not perfectly analogous, and that’s why I’m unsure what would happen (as opposed to being confident that bad behavior would result).
in the IDA setting isn’t the overseer seeing everything that the agent is seeing, and also has more computational power than the agent?
Yes. The worry is more from the lack of a story for why we will get good outcomes, plus some speculative stories about how we could maybe get bad outcomes (primarily inner optimizers). With informed oversight solved, you could hope to construct an argument that even if an inner optimizer arises, the overseer would be able to tell and so we wouldn’t get bad outcomes.
Inner optimizers are a particular instantiation of this concern that RL does not find the best possible agent.
I thought inner optimizers are supposed to be handled under “learning with catastrophe” / “optimizing for worst case”. In particular inner optimizers would cause “malign” failures which would constitute a catastrophe which techniques for learning with catastrophe / optimizing for worst case (such as adversarial training, verification, transparency) would detect and train the agent out of.
Is “informed oversight” just another name for that problem, or a particular approach to solving it? (If the former, why yet another name? If the latter, how is it different from “transparency”?) I haven’t seen any writing from Paul that says this, and also the original example that motivated “informed oversight” (the overseer wants to train the AI to create original art but can’t distinguish between original art and plagiarism) seems rather different from the inner optimizer problem and wouldn’t seem to constitute a “catastrophe” or a “malign failure”, so I’m still confused.
ETA: The suggestions I gave at the start of this thread (overseer gives low approval rating to actions it can’t understand, and having agent output an explanation along with an action) are supposed to be used alongside “learning with catastrophe” / “optimizing for worst case” and not meant as a replacement for them. I thought those ideas would be enough to solve the more recent motivating example for “informed oversight” that Paul gave (training an agent to defend against network attacks).
I thought inner optimizers are supposed to be handled under “learning with catastrophe” / “optimizing for worst case”. In particular inner optimizers would cause “malign” failures which would constitute a catastrophe which techniques for learning with catastrophe / optimizing for worst case (such as adversarial training, verification, transparency) would detect and train the agent out of.
Yes. Inner optimizers should either result in low performance on the training distribution (in which case we have a hope of training them out, though we may get stuck in a local optimization), or to manifestly unacceptable behavior on some possible inputs.
Is “informed oversight” just another name for that problem, or a particular approach to solving it?
Informed oversight is being able to figure out everything your agent knows about how good a proposed action is. This seems like a prerequisite both for RL training (if you want a reward function that incentivizes the correct behavior) and for adversarial training to avoid unacceptable behavior.
If the latter, how is it different from “transparency”?
People discuss a bunch of techniques under the heading of transparency/interpretability, and have a bunch of goals.
In the context of this sequence, transparency is relevant for both:
Know what the agent knows, in order to evaluate its behavior.
Figure out under what conditions the agent would behave differently, to facilitate adversarial training.
For both of those problems, it’s not obvious the solution will look anything like what is normally called transparency (or what people in that field would recognize as transparency). And even if it will look like transparency, it seems worth distinguishing different goals of that research.
So that’s why there is a different name.
I thought those ideas would be enough to solve the more recent motivating example for “informed oversight” that Paul gave (training an agent to defend against network attacks).
(I disagreed with this upthread. I don’t think “convince the overseer that an action is good” obviously incentivizes the right behavior, even if you are allowed to offer an explanation—certainly we don’t have any particular argument that it would incentivize the right behavior. It seems like informed oversight roughly captures what is needed in order for RL to create the right incentives.)
Consider the following agent:
Search over all possible plans for one that scores high on <malicious reward function> that looks good to the overseer with some normal explanation. Output that plan and the corresponding explanation.
Giving low scores to confusing actions and asking for an explanation do not save you from the harmful plans that are found by this agent. When the agent is much more powerful than the overseer, this seems very unsafe. When the overseer is more powerful, I am unsure what would happen.
The danger that I perceive is more from the abstract argument that the agent could find bad plans that the overseer would not recognize. You could consider lobbying and spy movies as real-life analogues of this problem.
But there’s an agent close to this malicious agent in agent space, which prioritizes “looking good to the overseer” more and “scoring high on <malicious reward function>” less, so if we keeping training with RL, we’d move towards the former agent and away from the latter agent (since the former agent would tend to come up with actions that the overseer approves of more). And if we keep training, we should eventually end up with an agent that doesn’t try to score high on <malicious reward function> at all, right?
I agree with this, but for a different reason. I think if we optimize for “looking good to the overseer” too much, we’d reliably trigger safety problems in the overseer’s evaluation process (e.g., triggering bugs in the overseer’s code, or causing distributional shifts that the overseer can’t handle). I feel like that’s a different problem from “informed oversight” though, which I still don’t quite understand.
In both of those cases the lobbyist and the spy have an informational as well as computational advantage over the politician/counterintelligence officer (the politician doesn’t know a lot of details that the lobbyist knows and has less time to think about an issue than the lobbyist; the counterintelligence officer can’t see much of what the spy has seen and done and has to divide attention between lots of people who look suspicious), but in the IDA setting isn’t the overseer seeing everything that the agent is seeing, and also has more computational power than the agent?
In theory, yes, but in practice RL does not give us such nice guarantees. (Inner optimizers are a particular instantiation of this concern that RL does not find the best possible agent.)
I agree that it’s a different (also important) reason and that your reason does not motivate informed oversight.
Agreed, it’s not perfectly analogous, and that’s why I’m unsure what would happen (as opposed to being confident that bad behavior would result).
Yes. The worry is more from the lack of a story for why we will get good outcomes, plus some speculative stories about how we could maybe get bad outcomes (primarily inner optimizers). With informed oversight solved, you could hope to construct an argument that even if an inner optimizer arises, the overseer would be able to tell and so we wouldn’t get bad outcomes.
I thought inner optimizers are supposed to be handled under “learning with catastrophe” / “optimizing for worst case”. In particular inner optimizers would cause “malign” failures which would constitute a catastrophe which techniques for learning with catastrophe / optimizing for worst case (such as adversarial training, verification, transparency) would detect and train the agent out of.
Is “informed oversight” just another name for that problem, or a particular approach to solving it? (If the former, why yet another name? If the latter, how is it different from “transparency”?) I haven’t seen any writing from Paul that says this, and also the original example that motivated “informed oversight” (the overseer wants to train the AI to create original art but can’t distinguish between original art and plagiarism) seems rather different from the inner optimizer problem and wouldn’t seem to constitute a “catastrophe” or a “malign failure”, so I’m still confused.
ETA: The suggestions I gave at the start of this thread (overseer gives low approval rating to actions it can’t understand, and having agent output an explanation along with an action) are supposed to be used alongside “learning with catastrophe” / “optimizing for worst case” and not meant as a replacement for them. I thought those ideas would be enough to solve the more recent motivating example for “informed oversight” that Paul gave (training an agent to defend against network attacks).
Yes. Inner optimizers should either result in low performance on the training distribution (in which case we have a hope of training them out, though we may get stuck in a local optimization), or to manifestly unacceptable behavior on some possible inputs.
Informed oversight is being able to figure out everything your agent knows about how good a proposed action is. This seems like a prerequisite both for RL training (if you want a reward function that incentivizes the correct behavior) and for adversarial training to avoid unacceptable behavior.
People discuss a bunch of techniques under the heading of transparency/interpretability, and have a bunch of goals.
In the context of this sequence, transparency is relevant for both:
Know what the agent knows, in order to evaluate its behavior.
Figure out under what conditions the agent would behave differently, to facilitate adversarial training.
For both of those problems, it’s not obvious the solution will look anything like what is normally called transparency (or what people in that field would recognize as transparency). And even if it will look like transparency, it seems worth distinguishing different goals of that research.
So that’s why there is a different name.
(I disagreed with this upthread. I don’t think “convince the overseer that an action is good” obviously incentivizes the right behavior, even if you are allowed to offer an explanation—certainly we don’t have any particular argument that it would incentivize the right behavior. It seems like informed oversight roughly captures what is needed in order for RL to create the right incentives.)