The point of the distillation step, thus, is just to increase sample efficiency by letting you get additional training in without requiring additional calls to H
Note that my proposed modification does allow for that, if the adversary predicts that both of the answers are sufficiently good that neither one needs to be recursed on. Tuning α in my version should allow you to get whatever sample efficiency you want. An annealing schedule could also make sense.
(Also, the sum isn’t a typo—I’m using the adversary to predict the negative of the loss, not the loss, which I admit is confusing and I should probably switch it.)
Note that my proposed modification does allow for that, if the adversary predicts that both of the answers are sufficiently good that neither one needs to be recursed on. Tuning α in my version should allow you to get whatever sample efficiency you want. An annealing schedule could also make sense.
Ah, yeah, I see it now.