Collin comments on How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

Collin 20 Dec 2022 1:35 UTC
LW: 20 AF: 10
0
AF
There were a number of iterations with major tweaks. It went something like:
- I spent a while thinking about the problem conceptually, and developed a pretty strong intuition that something like this should be possible.
- I tried to show it experimentally. There were no signs of life for a while (it turns out you need to get a bunch of details right to see any real signal—a regime that I think is likely my comparative advantage) but I eventually got it to sometimes work using a PCA-based method. I think it took some work to make that more reliable, which led to what we refer to in the paper as CRC-TPC.
- That method had some issues, but we also found that there was also low-hanging fruit in the sense that a good direction often appeared in one of the top 2 principal components (instead of just the top one). It also seemed kind of weird to really care about high-variance directions even when variance isn’t necessarily functionally meaningful (since you can rescale subsequent layers).
- This led to CRC-BSS, which is scale-invariant. This worked better (a bit more reliable, seemed to work well in cases where the good direction was in the top 2 principal components, etc.). But it was still based on the original intuition of clustering.
- I started developing the intuition that “old school” or “geometric” unsupervised methods—like clustering—can be decent but that they’re not really the right way to think about things relative to a more “functional” deep learning perspective. I also thought we should be able to do something similar without explicitly relying on linear structure in the representations, and eventually started thinking about my interpretation of what CRC is doing as finding a direction satisfying consistency properties. After another round of experimentation with the method, this finally led to CCS.
Each stage required a number of iterations to get various details right (and even then, I’m pretty sure I could continue to improve things with more iterations like that, but decided that’s not really the point of the paper or my comparative advantage).
In general I do a lot of back and forth between thinking conceptually about the problem for long periods of time to develop intuitions (I’m extremely intuitions-driven) and periods where I focus on experiments that were inspired by those intuitions.
I feel like I have more to say on this topic, so maybe I’ll write a future post about it with more details, but I’ll leave it at that for now. Hopefully this is helpful.
- Charlie Steiner 20 Dec 2022 1:50 UTC
  LW: 3 AF: 1
  0
  AF Parent
  Just what I wanted :D