Charlie Steiner comments on Alignment Gaps

Charlie Steiner Jun 10, 2024, 1:27 PM
5 points
2
I enjoyed this post, especially because it pointed out argumentation games to me (even if I put no particular stock in debate as a good alignment mechanism, I wasn’t aware of this prior literature).

However, sometimes this comes off like asking a paper about human psychology why they didn’t cite a paper about mouse psychology. Certainly psychologists studying humans should at least be aware of the existence of research on mice and maybe some highlights, but not every psychology paper needs a paragraph about mice.

Similarly, not talking about interpretability of decision trees or formal languages is no particular failing for a paper on interpretability of deep neural nets. (Though if we do start mapping circuits to formal languages, of course I hope to see useful citations.)

The case of preferences, and citing early work on preferences, seems more nuanced. To a large extent it’s more like engineering than science—you’re trying to solve a problem—what previous attempts at solution should you compare to to help readers? I haven’t read that Working With Preferences book you link (Is it any good, or did you just google it up?), but I imagine it’s pretty rare I’ll want a value learning paper to compare to prospect theory or nonmonotonic logics (though the fraction that would benefit from talking about nonmonotonic logic sound pretty interesting.)
- kcyras Jun 10, 2024, 6:14 PM
  1 point
  0
  Parent
  Thanks, appreciated. My expectation is not that one should cite the generally relevant research. Rather, one is expected to quickly browse, at least by similar keywords, the existing literature for seemingly closely related works. If anything comes up, there are a few options.
  1. Just mention the work—this shows the reader that you’ve at least browsed around, but also gives the opportunity to somebody else to explore that relevant angle. Arguably, interpretability researchers that aim at actually figuring out model internals must be aware of formal verification of DNNs and should position their work accordingly.
  2. Mention and argue that the work is not applicable. This additionally justifies one’s motivation. I’ve seen it done wrt mech interp in one of the main papers, where the typical heuristic explainability research was discarded as inapplicable, and appropriately so. A similar argument could be made towards formal verification.
  3. Consider using the existing approach or show what concrete problem it fails to address. This may even further incentivise any proponents of the existing approach to improve that work. In short, it’s not so much about citing others’ work (that’s only the metrics game), but also about giving the opportunity for others to relate your work to that of others (that’s science).
  Re preferences. The book is fine (yes I’ve read it), but reading it is for somebody working deeply on preference-based reasoning. The point is about skimming through or at least picking up the conceptual claims as to what the research is about. A relatively quick browse through such relevant materials should help to realise that value learning and value-based reasoning have been extensively studied. Again, using one of the 3 options above, there may be opportunities to both identify new problems and to point one’s readers to (not) pursue other lines of inquiry.