Rohin Shah comments on A Concrete Proposal for Adversarial IDA

Rohin Shah Apr 5, 2019, 6:52 PM
LW: 3 AF: 2
AF
Given that you are training the model during amplification, I don’t really see why you also have a distillation step, and an iteration step. I believe the point of that separation is to allow amplification to not involve ML at all, so that you can avoid dealing with the issues around bootstrapping—but if you train while amplifying, you are already bootstrapping. In addition, you’re requiring that exactly one subquestion be sent to the human, but it seems better to allow it to be zero, one or two, depending on how confident the adversary is in the ML model’s answer. Concretely, I would get rid of both distillation and iteration, and change step 4 of the amplification procedure:
4. For $i \in {1, 2}$ , flip a biased coin $r_{i} \sim Bernoulli (f (S_{i}))$ , where $f$ is a function that computes recursion probabilities from adversary scores. If $r_{i} = True$ , compute $A_{i}$ by recursing on $S Q_{i}$ , else set $A_{i} = M L (S Q_{i})$ .
You could compute $r_{i} = S_{i} + ϵ < α + ϵ$ if you want to use a confidence threshold with Boltzmann exploration.
This new procedure allows for the behavior you have with distillation, in the cases where it actually makes sense to do so: you recover distillation in the case where the adversary thinks that the answers from $M L$ to both subquestions are good.
The last two adversary losses have a typo: you should be computing the difference between the adversary’s prediction and the true loss, not the sum.
Meta: I found this post quite hard to read, since everything was written in math with very little exposition.
What links here?
- Alignment Newsletter #53 by Rohin Shah (Apr 18, 2019, 5:20 PM; 20 points)
- Rohin Shah's comment on A Concrete Proposal for Adversarial IDA by evhub (Apr 5, 2019, 6:53 PM; 11 points)
- evhub Apr 20, 2019, 4:58 AM
  LW: 3 AF: 2
  AF Parent
  I considered collapsing all of it into one (as Paul has talked about previously), but as you note the amplification procedure I describe here basically already does that. The point of the distillation step, thus, is just to increase sample efficiency by letting you get additional training in without requiring additional calls to $H$ . I do agree that you could include the iteration procedure described here into the amplification procedure, which is probably a good idea, though you’d probably want to anneal $α$ in that situation, as $Adv$ starts out really bad, whereas in this setup you shouldn’t have to do any annealing because by the time you get to that point $Adv$ should be performing well enough that it will automatically anneal as its predictions get better. Also, apologies for the math—I didn’t really have the time to write up more explanation, so it was a choice between posting it as is or not posting it at all, and I went with posting it as is.
  (Also, the sum isn’t a typo—I’m using the adversary to predict the negative of the loss, not the loss, which I admit is confusing and I should probably switch it.)
  - Rohin Shah Apr 21, 2019, 8:06 AM
    LW: 4 AF: 3
    AF Parent
    I didn’t really have the time to write up more explanation, so it was a choice between posting it as is or not posting it at all, and I went with posting it as is.
    Makes sense. I think I could not tell how much I should be trying to understand this until I understood it. I probably would have chosen not to read it if I had known how long it would take and how important I thought it was (ex-post, not ex-ante). For posts where that’s likely to be true, I would push for not posting at all.
    Another way you could see this: given my current state of knowledge about this post, I think I could spend ~15 minutes making it significantly easier to understand. The resulting post would have been one that I could have read more than 15 minutes faster, probably, for the same level of understanding.
    I think it’s not worth making a post if you don’t get at least one person reading it in as much depth as I did; so you should at the very least be willing to trade off some of your time for an equal amount of time of that reader, and the benefit scales massively the more readers you have. The fact that this was not something you wanted to do feels like a fairly strong signal that it’s not worth posting since it will waste other people’s time.
    (Of course, it might have taken you longer than 15 minutes to make the post easier to understand, or readers might usually not take a whole 15+ minutes more to understand a post without exposition, but I think the underlying point remains.)
  - Rohin Shah Apr 21, 2019, 7:49 AM
    LW: 2 AF: 1
    AF Parent
    The point of the distillation step, thus, is just to increase sample efficiency by letting you get additional training in without requiring additional calls to H
    Note that my proposed modification does allow for that, if the adversary predicts that both of the answers are sufficiently good that neither one needs to be recursed on. Tuning $α$ in my version should allow you to get whatever sample efficiency you want. An annealing schedule could also make sense.
    (Also, the sum isn’t a typo—I’m using the adversary to predict the negative of the loss, not the loss, which I admit is confusing and I should probably switch it.)
    Ah, yeah, I see it now.