An update using the REINFORCE policy gradient estimator would have the form:
If M2 has adversarial examples or other kinds of robustness or security problems, and we keep doing this training for a long time, wouldn’t the training process sooner or later sample an X that exploits M2 (gets a high reward relative to other answers without actually being a good answer), which causes the update step to increase the probability of M1 giving that output, and eventually causes M1 to give that output with high probability?
Do you know if Paul or anyone else has addressed this anywhere? For example is the plan to make sure M2 has no such robustness problems (if so how)?
If we have a perfect distillation algorithm, these both converge to argmaxX(M2(X)) in the limit of infinite computation.
Maybe another way to address it would be, instead of doing maximization (in the limit of infinite computation), do quantilization instead?
ETA: I just noticed this part of the post:
But it might also cause problems if we can make a series of updates that cause the learned answering system to behave very differently from the original human demonstrators. We might want to be careful about the degree to which an RL learned policy can differ from the original demonstration.
If M2 has adversarial examples or other kinds of robustness or security problems, and we keep doing this training for a long time, wouldn’t the training process sooner or later sample an X that exploits M2 (gets a high reward relative to other answers without actually being a good answer), which causes the update step to increase the probability of M1 giving that output, and eventually causes M1 to give that output with high probability?
I agree, and think that this problem occurs both in imitation IA and RL IA
For example is the plan to make sure M2 has no such robustness problems (if so how)?
I believe the answer is yes, and I think this is something that would need to be worked out/demonstrated. I think there is one hope that if M2 can increase the amount computing/evaluation power it uses for each new sample X as we take more samples, then you can keep taking more samples without ever accepting an adversarial one (This assumes something like for any adversarial example, all M2 with at least some finite amount of computing power will reject it). There’s maybe another hope that you could make M2 robust if you’re allowed to reject many plausibly good X in order to avoid false positives. I think both of these hopes are in the IOU status, and maybe Paul has a different way to put this picture that makes more sense.
If M2 has adversarial examples or other kinds of robustness or security problems, and we keep doing this training for a long time, wouldn’t the training process sooner or later sample an X that exploits M2 (gets a high reward relative to other answers without actually being a good answer), which causes the update step to increase the probability of M1 giving that output, and eventually causes M1 to give that output with high probability?
Do you know if Paul or anyone else has addressed this anywhere? For example is the plan to make sure M2 has no such robustness problems (if so how)?
Maybe another way to address it would be, instead of doing maximization (in the limit of infinite computation), do quantilization instead?
ETA: I just noticed this part of the post:
Is this talking about the same concern as mine?
I agree, and think that this problem occurs both in imitation IA and RL IA
I believe the answer is yes, and I think this is something that would need to be worked out/demonstrated. I think there is one hope that if M2 can increase the amount computing/evaluation power it uses for each new sample X as we take more samples, then you can keep taking more samples without ever accepting an adversarial one (This assumes something like for any adversarial example, all M2 with at least some finite amount of computing power will reject it). There’s maybe another hope that you could make M2 robust if you’re allowed to reject many plausibly good X in order to avoid false positives. I think both of these hopes are in the IOU status, and maybe Paul has a different way to put this picture that makes more sense.