abramdemski comments on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

abramdemski 7 Oct 2023 21:23 UTC
LW: 41 AF: 17
20
AF
The basic idea is not new to me—I can’t recall where, but I think I’ve probably seen a talk observing that linear combinations of neurons, rather than individual neurons, are what you’d expect to be meaningful (under some assumptions) because that’s how the next layer of neurons looks at a layer—since linear combinations are what’s important to the network, it would be weird if it turned out individual neurons were particularly meaningful. This wasn’t even surprising to me at the time I first learned about it.
But it’s great to see it illustrated so well!
In my view, this provides relatively little insights to the hard questions of what it even means to understand what is going on inside a network (so, for example, it doesn’t provide any obvious progress on the hard version of ELK). So how useful this ultimately turns out to be for aligning superintelligence depends on how useful “weak methods” in general are. (IE methods with empirical validation but which don’t come with strong theoretical arguments that they will work in general.)
That being said, I am quite glad that such good progress is being made, even if it’s what I would classify as “weak methods”.
What links here?
- Timelines are short, p(doom) is high: a global stop to frontier AI development until x-safety consensus is our only reasonable hope by Greg_Colbourn ⏸️ (EA Forum; 12 Oct 2023 11:24 UTC; 73 points)
- Joel Burget 8 Oct 2023 22:37 UTC
  LW: 6 AF: 4
  2
  AF Parent
  How would you distinguish between weak and strong methods?
  - abramdemski 16 Oct 2023 23:53 UTC
    LW: 10 AF: 5
    0
    AF Parent
    “Weak methods” means confidence is achieved more empirically, so there’s always a question of how well the results will generalize for some new AI system (as we scale existing technology up or change details of NN architectures, gradient methods, etc). “Strong methods” means there’s a strong argument (most centrally, a proof) based on a detailed gears-level understanding of what’s happening, so there is much less doubt about what systems the method will successfully apply to.
    - TurnTrout 19 Oct 2023 0:38 UTC
      LW: 3 AF: 2
      −1
      AF Parent
      as we scale existing technology up or change details of NN architectures, gradient methods, etc
      I think most practical alignment techniques have scaled quite nicely, with CCS maybe being an exception, and we don’t currently know how to scale the interp advances in OP’s paper.
      Blessings of scale (IIRC): RLHF, constitutional AI / AI-driven dataset inclusion decisions / meta-ethics, activation steering / activation addition (LLAMA2-chat results forthcoming), adversarial training / redteaming, prompt engineering (though RLHF can interfere with responsiveness),…
      I think the prior strongly favors “scaling boosts alignability” (at least in “pre-deceptive” regimes, though I have become increasingly skeptical of that purported phase transition, or at least its character).
      “Weak methods” means confidence is achieved more empirically
      I’d personally say “empirically promising methods” instead of “weak methods.”