Jacob Pfau comments on Auditing LMs with counterfactual search: a tool for control and ELK

Jacob Pfau 21 Feb 2024 0:27 UTC
1 point
0
Thanks for the comments, I’m very interested in getting clearer on what cases of ELK this method can and cannot be expected to elicit. Though broadly I do not agree that the proposed method is limited to “changes [current] human might have done when rewriting/paraphrasing the code” (I’m particularly worried about cases mentioned in the paragraph starting with ‘Another problem arises …’)

There are two mechanisms by which the sort of approach I’m suggesting helps us find a superset of human-simulation critiques/knowledge:

A) logP(x’|x)-logP(x’) loss can be low even when the initial tokens of x’ are very low probability, i.e. cases in which a human would not be expected to start talking about x’. Any experiment fix which can be in-context learned scores well on this loss; this may include most super-human fixes we care about! (relying on ICL will likely breakdown when we move to heavily RL-ed models, but the easier LM setting seems like the sensible starting point to iterate on this sort of research.)

B) Now that we have a well-defined scoring function that includes super-human critiques, we can train a model against it via search-and-distillation and generalization will help us explore to reach solutions that would be exponentially slower to find via temperature sampling (or even zero weight on temperature sampling) on the original model. How well exploration works here is unclear and evaluating this should be a priority!

Would be interested to hear where you disagree with these points.

As for exploration hacking, I agree that it’s a potential issue. I’d be particularly worried when the y predictor and the x’ predictor are both getting updated interactively. I’m not too concerned if we have a fixed y predictor, and just train the x’ predictor (I think of the latter as the control setting).
- Fabien Roger 21 Feb 2024 15:12 UTC
  2 points
  0
  Parent
  Mmh, logP(x’|x) - logP(x’) is quite a crazy beast when I think about it. For instance, if it’s obvious from x’<t and x’<t|x that the next word is “the” (logP ≈ 0), and that for both the logP of seeing “avocado” is −10, then you get no penalty from outputting “avocado”? Is the main reason why you don’t get gibberish that P(x’) is higher entropy than P(x’|x), and therefore the network is rewarded for choosing one of the few high prob tokens in x’|x? Or is it mostly that you will choose y’ so that P(y|x’) can only be high for non-gibberish looking text?
  Also, since you have to use rejection sampling / RL (since you only get a single y’|x’ per sample), you probably don’t get better exploration properties than regular RL? The main advantage is that you can modify samples instead of making changes to policies?
  - Jacob Pfau 21 Feb 2024 22:55 UTC
    1 point
    0
    Parent
    [Sidenote: realizing I should’ve introduced notation for auditor judgment let’s say A(x’,x)!=A(x) denotes that the trusted model found x’ useful for updating its judgment on input x. A would usually also be a function of y and y’ but I’ll suppress that for brevity.]
    
    In the above experiment, I sampled from P(x’|x) then BoN adjusted for P(x’) which amounts to using logP(x’|x) - c*logP(x’) for small c. Intuitively, when equally weighted, the sum entirely removes the usual fluency constraint on x’ which, as you say, would reassign probability to gibberish (e.g. gibberish happening to re-use tokens from x’.) I’m not too concerned about this, weighting objectives with c lets us re-introduce this fluency constraint. The downstream evaluation, A(x’,x)!=A(x) of the auditor model provides a more costly check to avoid the x’ reporter model exploration devolving to useless, gibberish cases. As you say, P(y’|x’) provides another useful constraint.
    
    I agree that traditional exploration concerns in RL likely remain. I think what you’re saying about modifying samples is that we have a dense reward here, i.e. logP(x’|x) - logP(x’), that can be cheaply evaluated for any token-level change? This makes exploration faster compared to the regular critique/debate setting where dense rewards can only be noisily estimated as e.g. in an advantage function. I’d agree with that.
    
    On exploration, a second difference between this counterfactual critique compared to vanilla, just-ask-for-critique RL is specific to super-human domains involving ontology mismatch. Exploration in just-ask-for-critique RL systematically favors human-simulation++ arguments: such a just-ask-for-critique agent may successfully explore to plenty of novel critiques without ever touching on latent knowledge obscured by ontology mismatch. This raises two concerns in the super-human domain (1) is this just-ask-for-critique agent just finding solutions which are persuasive but not true (and not even related to the original input x)? (2) is this just-ask-for-critique scheme better described as a capabilities-enhancing scheme rather than capabilities elicitation?
    
    In light of these concerns, we could reframe the issue as an inductive bias problem which logP(x’|x) - logP(x’) regularization seeks to fix when compared to directly exploring for x’ satisfying A(x’,x)!=A(x).
    - Fabien Roger 22 Feb 2024 17:11 UTC
      2 points
      0
      Parent
      I think what you’re saying about modifying samples is that we have a dense reward here, i.e. logP(x’|x) - logP(x’), that can be cheaply evaluated for any token-level change? This makes exploration faster compared to the regular critique/debate setting where dense rewards can only be noisily estimated as e.g. in an advantage function.
      This is not what I meant. Regularization terms (e.g. KL divergence) are almost always dense anyway. I was pointing out that you have the classic RL problem of only having one “real” reward information (y’|x’) per trajectory.
      About the ontology mismatch stuff, I think the core idea is not logP(x’|x) - logP(x’), but the idea about how you use sample-level exploration to spot contradictions and use that to update the predictor of y. I think this is a potentially cool and under-explored idea (though I’m not sure if it is applicable beyond the kind of “near-miss false negative results” from your code example). A follow-up post I would find very interesting would flesh out exactly how this training works, detail a broader range of situations where it is useful, and demonstrate that it works in a toy setting.
      - Jacob Pfau 22 Feb 2024 19:48 UTC
        1 point
        0
        Parent
        Ah I see, you’re right there.
        
        Agreed that a lot will hinge on the training details working out. I plan to look into this.