Inner optimizers are a particular instantiation of this concern that RL does not find the best possible agent.
I thought inner optimizers are supposed to be handled under “learning with catastrophe” / “optimizing for worst case”. In particular inner optimizers would cause “malign” failures which would constitute a catastrophe which techniques for learning with catastrophe / optimizing for worst case (such as adversarial training, verification, transparency) would detect and train the agent out of.
Is “informed oversight” just another name for that problem, or a particular approach to solving it? (If the former, why yet another name? If the latter, how is it different from “transparency”?) I haven’t seen any writing from Paul that says this, and also the original example that motivated “informed oversight” (the overseer wants to train the AI to create original art but can’t distinguish between original art and plagiarism) seems rather different from the inner optimizer problem and wouldn’t seem to constitute a “catastrophe” or a “malign failure”, so I’m still confused.
ETA: The suggestions I gave at the start of this thread (overseer gives low approval rating to actions it can’t understand, and having agent output an explanation along with an action) are supposed to be used alongside “learning with catastrophe” / “optimizing for worst case” and not meant as a replacement for them. I thought those ideas would be enough to solve the more recent motivating example for “informed oversight” that Paul gave (training an agent to defend against network attacks).
I thought inner optimizers are supposed to be handled under “learning with catastrophe” / “optimizing for worst case”. In particular inner optimizers would cause “malign” failures which would constitute a catastrophe which techniques for learning with catastrophe / optimizing for worst case (such as adversarial training, verification, transparency) would detect and train the agent out of.
Yes. Inner optimizers should either result in low performance on the training distribution (in which case we have a hope of training them out, though we may get stuck in a local optimization), or to manifestly unacceptable behavior on some possible inputs.
Is “informed oversight” just another name for that problem, or a particular approach to solving it?
Informed oversight is being able to figure out everything your agent knows about how good a proposed action is. This seems like a prerequisite both for RL training (if you want a reward function that incentivizes the correct behavior) and for adversarial training to avoid unacceptable behavior.
If the latter, how is it different from “transparency”?
People discuss a bunch of techniques under the heading of transparency/interpretability, and have a bunch of goals.
In the context of this sequence, transparency is relevant for both:
Know what the agent knows, in order to evaluate its behavior.
Figure out under what conditions the agent would behave differently, to facilitate adversarial training.
For both of those problems, it’s not obvious the solution will look anything like what is normally called transparency (or what people in that field would recognize as transparency). And even if it will look like transparency, it seems worth distinguishing different goals of that research.
So that’s why there is a different name.
I thought those ideas would be enough to solve the more recent motivating example for “informed oversight” that Paul gave (training an agent to defend against network attacks).
(I disagreed with this upthread. I don’t think “convince the overseer that an action is good” obviously incentivizes the right behavior, even if you are allowed to offer an explanation—certainly we don’t have any particular argument that it would incentivize the right behavior. It seems like informed oversight roughly captures what is needed in order for RL to create the right incentives.)
I thought inner optimizers are supposed to be handled under “learning with catastrophe” / “optimizing for worst case”. In particular inner optimizers would cause “malign” failures which would constitute a catastrophe which techniques for learning with catastrophe / optimizing for worst case (such as adversarial training, verification, transparency) would detect and train the agent out of.
Is “informed oversight” just another name for that problem, or a particular approach to solving it? (If the former, why yet another name? If the latter, how is it different from “transparency”?) I haven’t seen any writing from Paul that says this, and also the original example that motivated “informed oversight” (the overseer wants to train the AI to create original art but can’t distinguish between original art and plagiarism) seems rather different from the inner optimizer problem and wouldn’t seem to constitute a “catastrophe” or a “malign failure”, so I’m still confused.
ETA: The suggestions I gave at the start of this thread (overseer gives low approval rating to actions it can’t understand, and having agent output an explanation along with an action) are supposed to be used alongside “learning with catastrophe” / “optimizing for worst case” and not meant as a replacement for them. I thought those ideas would be enough to solve the more recent motivating example for “informed oversight” that Paul gave (training an agent to defend against network attacks).
Yes. Inner optimizers should either result in low performance on the training distribution (in which case we have a hope of training them out, though we may get stuck in a local optimization), or to manifestly unacceptable behavior on some possible inputs.
Informed oversight is being able to figure out everything your agent knows about how good a proposed action is. This seems like a prerequisite both for RL training (if you want a reward function that incentivizes the correct behavior) and for adversarial training to avoid unacceptable behavior.
People discuss a bunch of techniques under the heading of transparency/interpretability, and have a bunch of goals.
In the context of this sequence, transparency is relevant for both:
Know what the agent knows, in order to evaluate its behavior.
Figure out under what conditions the agent would behave differently, to facilitate adversarial training.
For both of those problems, it’s not obvious the solution will look anything like what is normally called transparency (or what people in that field would recognize as transparency). And even if it will look like transparency, it seems worth distinguishing different goals of that research.
So that’s why there is a different name.
(I disagreed with this upthread. I don’t think “convince the overseer that an action is good” obviously incentivizes the right behavior, even if you are allowed to offer an explanation—certainly we don’t have any particular argument that it would incentivize the right behavior. It seems like informed oversight roughly captures what is needed in order for RL to create the right incentives.)