Joseph Bloom comments on Don’t Dismiss Simple Alignment Approaches

Joseph Bloom 7 Oct 2023 7:59 UTC
LW: 32 AF: 9
18
AF
My vibe from this post is something like “we’re making on stuff that could be helpful so there’s stuff to work on!” and this is a vibe I like. However, I suspect that for people who might not be as excited about these approaches, you’re likely not touching on important cruxes (eg: do these approaches really scale? Are some agendas capabilities enhancing? Will these solve deceptive alignment or just corrigible alignment?)

I also think that if the goal is to actually make progress and not to maximize the number of people making progress or who feel like they’re making progress, then engaging with those cruxes is important before people invest substantive energy (ie: beyond upskilling). However as a directional update for people who are otherwise pretty cynical, this seems like a good update.
- Chris_Leong 20 Dec 2023 9:09 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Hmm… I suppose this is pretty good evidence that CCS may not be as promising as it first appeared, esp. the banana/shed results.
  
  https://www.lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1
  
  Update: Seems like the banana results are being challenged.
- Chris_Leong 7 Oct 2023 9:01 UTC
  2 points
  −1
  Parent
  I wrote in one of my footnotes:
  You may object that RLHF is mostly capabilities. I also tend to think about it as being primarily a capabilities advance, but it is an advance in alignment as well
  ie. it belongs in the reference class when figuring out the difficulty of making progress on alignment.
  Regarding scalability, I wrote:
  
  I suspect that all of these approaches are still very far away from where we need to be. I consider them substantial advances nonetheless for two key reasons: having a baseline helps people choose an appropriate level of ambition, and also makes it easier to empirically discover the key issues in solving a problem more fully.
  Maybe that addresses the point about not scaling to a particular extent. Although, definitely not completely. It’s also possible that some of these techniques are effectively a trap in that they’d appear to work up to a certain level, so we’d start to rely on them and then we’d end up getting screwed.
  - Nate Showell 8 Oct 2023 17:13 UTC
    3 points
    0
    Parent
    Has anyone developed a metric for quantifying the level of linearity versus nonlinearity of a model’s representations? A metric like that would let us compare the levels of linearity for models of different sizes, which would help us extrapolate whether interpretability and alignment techniques that rely on approximate linearity will scale to larger models.
    - Chris_Leong 9 Oct 2023 1:22 UTC
      3 points
      0
      Parent
      I don’t know, but would love to find out.
      - Nate Showell 9 Nov 2023 2:51 UTC
        1 point
        0
        Parent
        I asked on Discord and someone told me this:
        A simple way to quantify this: first define a “feature” as some decision boundary over the data domain, then train a linear classifier to predict that decision boundary from the network’s activations on that data. Quantify the “linearity” of the feature in the network as the accuracy that the linear classifier achieves.
        For example, train a classifier to detect when some text has positive or negative sentiment, then pass the same text through some pretrained LLM (e.g. BERT) whose “feature-linearity” you’re trying to measure, and try to predict the sentiment from the BERT’s activation vectors using linear regression. The accuracy of this linear model tells you how linear the “sentiment” feature is in your LLM.
    - Thomas Kwa 5 Nov 2023 22:06 UTC
      2 points
      0
      Parent
      IMO the most useful version of this would be to get empirical evidence on techniques. E.g. erasing certain concepts using LEACE and seeing if they can inhibit the model’s use of those concepts including during further training. It seems hard to ensure otherwise that there is not some gap between your definitions and reality.