Wei Dai comments on Reinforcement Learning in the Iterated Amplification Framework

Wei Dai 24 Aug 2019 17:52 UTC
LW: 4 AF: 2
AF

An update using the REINFORCE policy gradient estimator would have the form:

If M2 has adversarial examples or other kinds of robustness or security problems, and we keep doing this training for a long time, wouldn’t the training process sooner or later sample an X that exploits M2 (gets a high reward relative to other answers without actually being a good answer), which causes the update step to increase the probability of M1 giving that output, and eventually causes M1 to give that output with high probability?

Do you know if Paul or anyone else has addressed this anywhere? For example is the plan to make sure M2 has no such robustness problems (if so how)?

If we have a perfect distillation algorithm, these both converge to ${a r g m a x}_{X} (M 2 (X))$ in the limit of infinite computation.

Maybe another way to address it would be, instead of doing maximization (in the limit of infinite computation), do quantilization instead?

ETA: I just noticed this part of the post:

But it might also cause problems if we can make a series of updates that cause the learned answering system to behave very differently from the original human demonstrators. We might want to be careful about the degree to which an RL learned policy can differ from the original demonstration.

Is this talking about the same concern as mine?
What links here?
- Wei Dai's comment on Thoughts on reward engineering by paulfchristiano (28 Aug 2019 8:47 UTC; 11 points)
- evhub's comment on Thoughts on reward engineering by paulfchristiano (29 Aug 2019 21:06 UTC; 7 points)
- William_S 16 Feb 2020 1:17 UTC
  LW: 1 AF: 1
  AF Parent
  If M2 has adversarial examples or other kinds of robustness or security problems, and we keep doing this training for a long time, wouldn’t the training process sooner or later sample an X that exploits M2 (gets a high reward relative to other answers without actually being a good answer), which causes the update step to increase the probability of M1 giving that output, and eventually causes M1 to give that output with high probability?
  I agree, and think that this problem occurs both in imitation IA and RL IA
  For example is the plan to make sure M2 has no such robustness problems (if so how)?
  I believe the answer is yes, and I think this is something that would need to be worked out/demonstrated. I think there is one hope that if M2 can increase the amount computing/evaluation power it uses for each new sample X as we take more samples, then you can keep taking more samples without ever accepting an adversarial one (This assumes something like for any adversarial example, all M2 with at least some finite amount of computing power will reject it). There’s maybe another hope that you could make M2 robust if you’re allowed to reject many plausibly good X in order to avoid false positives. I think both of these hopes are in the IOU status, and maybe Paul has a different way to put this picture that makes more sense.