Fabien Roger comments on Auditing LMs with counterfactual search: a tool for control and ELK

Fabien Roger 22 Feb 2024 17:11 UTC
2 points
0
I think what you’re saying about modifying samples is that we have a dense reward here, i.e. logP(x’|x) - logP(x’), that can be cheaply evaluated for any token-level change? This makes exploration faster compared to the regular critique/debate setting where dense rewards can only be noisily estimated as e.g. in an advantage function.
This is not what I meant. Regularization terms (e.g. KL divergence) are almost always dense anyway. I was pointing out that you have the classic RL problem of only having one “real” reward information (y’|x’) per trajectory.
About the ontology mismatch stuff, I think the core idea is not logP(x’|x) - logP(x’), but the idea about how you use sample-level exploration to spot contradictions and use that to update the predictor of y. I think this is a potentially cool and under-explored idea (though I’m not sure if it is applicable beyond the kind of “near-miss false negative results” from your code example). A follow-up post I would find very interesting would flesh out exactly how this training works, detail a broader range of situations where it is useful, and demonstrate that it works in a toy setting.
- Jacob Pfau 22 Feb 2024 19:48 UTC
  1 point
  0
  Parent
  Ah I see, you’re right there.
  
  Agreed that a lot will hinge on the training details working out. I plan to look into this.