Fabien Roger comments on Auditing LMs with counterfactual search: a tool for control and ELK

Fabien Roger 21 Feb 2024 15:12 UTC
2 points
0
Mmh, logP(x’|x) - logP(x’) is quite a crazy beast when I think about it. For instance, if it’s obvious from x’<t and x’<t|x that the next word is “the” (logP ≈ 0), and that for both the logP of seeing “avocado” is −10, then you get no penalty from outputting “avocado”? Is the main reason why you don’t get gibberish that P(x’) is higher entropy than P(x’|x), and therefore the network is rewarded for choosing one of the few high prob tokens in x’|x? Or is it mostly that you will choose y’ so that P(y|x’) can only be high for non-gibberish looking text?
Also, since you have to use rejection sampling / RL (since you only get a single y’|x’ per sample), you probably don’t get better exploration properties than regular RL? The main advantage is that you can modify samples instead of making changes to policies?
- Jacob Pfau 21 Feb 2024 22:55 UTC
  1 point
  0
  Parent
  [Sidenote: realizing I should’ve introduced notation for auditor judgment let’s say A(x’,x)!=A(x) denotes that the trusted model found x’ useful for updating its judgment on input x. A would usually also be a function of y and y’ but I’ll suppress that for brevity.]
  
  In the above experiment, I sampled from P(x’|x) then BoN adjusted for P(x’) which amounts to using logP(x’|x) - c*logP(x’) for small c. Intuitively, when equally weighted, the sum entirely removes the usual fluency constraint on x’ which, as you say, would reassign probability to gibberish (e.g. gibberish happening to re-use tokens from x’.) I’m not too concerned about this, weighting objectives with c lets us re-introduce this fluency constraint. The downstream evaluation, A(x’,x)!=A(x) of the auditor model provides a more costly check to avoid the x’ reporter model exploration devolving to useless, gibberish cases. As you say, P(y’|x’) provides another useful constraint.
  
  I agree that traditional exploration concerns in RL likely remain. I think what you’re saying about modifying samples is that we have a dense reward here, i.e. logP(x’|x) - logP(x’), that can be cheaply evaluated for any token-level change? This makes exploration faster compared to the regular critique/debate setting where dense rewards can only be noisily estimated as e.g. in an advantage function. I’d agree with that.
  
  On exploration, a second difference between this counterfactual critique compared to vanilla, just-ask-for-critique RL is specific to super-human domains involving ontology mismatch. Exploration in just-ask-for-critique RL systematically favors human-simulation++ arguments: such a just-ask-for-critique agent may successfully explore to plenty of novel critiques without ever touching on latent knowledge obscured by ontology mismatch. This raises two concerns in the super-human domain (1) is this just-ask-for-critique agent just finding solutions which are persuasive but not true (and not even related to the original input x)? (2) is this just-ask-for-critique scheme better described as a capabilities-enhancing scheme rather than capabilities elicitation?
  
  In light of these concerns, we could reframe the issue as an inductive bias problem which logP(x’|x) - logP(x’) regularization seeks to fix when compared to directly exploring for x’ satisfying A(x’,x)!=A(x).
  - Fabien Roger 22 Feb 2024 17:11 UTC
    2 points
    0
    Parent
    I think what you’re saying about modifying samples is that we have a dense reward here, i.e. logP(x’|x) - logP(x’), that can be cheaply evaluated for any token-level change? This makes exploration faster compared to the regular critique/debate setting where dense rewards can only be noisily estimated as e.g. in an advantage function.
    This is not what I meant. Regularization terms (e.g. KL divergence) are almost always dense anyway. I was pointing out that you have the classic RL problem of only having one “real” reward information (y’|x’) per trajectory.
    About the ontology mismatch stuff, I think the core idea is not logP(x’|x) - logP(x’), but the idea about how you use sample-level exploration to spot contradictions and use that to update the predictor of y. I think this is a potentially cool and under-explored idea (though I’m not sure if it is applicable beyond the kind of “near-miss false negative results” from your code example). A follow-up post I would find very interesting would flesh out exactly how this training works, detail a broader range of situations where it is useful, and demonstrate that it works in a toy setting.
    - Jacob Pfau 22 Feb 2024 19:48 UTC
      1 point
      0
      Parent
      Ah I see, you’re right there.
      
      Agreed that a lot will hinge on the training details working out. I plan to look into this.