ryan_greenblatt comments on How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

ryan_greenblatt 15 Dec 2022 18:58 UTC
LW: 16 AF: 11
4
AF
I am happy to take a “non-worst-case” empirical perspective in studying this problem. In particular, I suspect it will be very helpful – and possibly necessary – to use incidental empirical properties of deep learning systems, which often have a surprising amount of useful emergent structure (as I will discuss more under “Intuitions”).

One reason I feel sad about depending on incidental properties is that it likely implies the solution isn’t robust enough to optimize against. This is a key desiderata in an ELK solution. I imagine this optimization would typically come from 2 sources:
- Directly trying to train against the ELK outputs (IMO quite important)
- The AI gaming the ELK solution by manipulating it’s thoughts (IMO not very important, but it depends on exactly how non-robust the solution is)
That’s not to say non-robust ELK approaches aren’t useful. So as long as you never apply too many bits of optimization against the approach it should remain a useful test for other techniques.

It also seems plausible that work like this could eventually lead to a robust solution.
- Collin 15 Dec 2022 19:51 UTC
  7 points
  4
  Parent
  I agree this proposal wouldn’t be robust enough to optimize against as-stated, but this doesn’t bother me much for a couple reasons:
  - This seems like a very natural sub-problem that captures a large fraction of the difficulty of the full problem while being more tractable. Even just from a general research perspective that seems quite appealing—at a minimum, I think solving this would teach us a lot.
  - It seems like even without optimization this could give us access to something like aligned superintelligent oracle models. I think this would represent significant progress and would be a very useful tool for more robust solutions.
  I have some more detailed thoughts about how we could extend this to a full/robust solution (though I’ve also deliberately thought much less about that than how to solve this sub-problem), but I don’t think that’s really the point—this already seems like a pretty robustly good problem to work on to me.
  (But I do think this is an important point that I forgot to mention, so thanks for bringing it up!)