Joel Burget comments on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Joel Burget 8 Oct 2023 22:37 UTC
LW: 6 AF: 4
2
AF
How would you distinguish between weak and strong methods?
- abramdemski 16 Oct 2023 23:53 UTC
  LW: 10 AF: 5
  0
  AF Parent
  “Weak methods” means confidence is achieved more empirically, so there’s always a question of how well the results will generalize for some new AI system (as we scale existing technology up or change details of NN architectures, gradient methods, etc). “Strong methods” means there’s a strong argument (most centrally, a proof) based on a detailed gears-level understanding of what’s happening, so there is much less doubt about what systems the method will successfully apply to.
  - TurnTrout 19 Oct 2023 0:38 UTC
    LW: 3 AF: 2
    −1
    AF Parent
    as we scale existing technology up or change details of NN architectures, gradient methods, etc
    I think most practical alignment techniques have scaled quite nicely, with CCS maybe being an exception, and we don’t currently know how to scale the interp advances in OP’s paper.
    Blessings of scale (IIRC): RLHF, constitutional AI / AI-driven dataset inclusion decisions / meta-ethics, activation steering / activation addition (LLAMA2-chat results forthcoming), adversarial training / redteaming, prompt engineering (though RLHF can interfere with responsiveness),…
    I think the prior strongly favors “scaling boosts alignability” (at least in “pre-deceptive” regimes, though I have become increasingly skeptical of that purported phase transition, or at least its character).
    “Weak methods” means confidence is achieved more empirically
    I’d personally say “empirically promising methods” instead of “weak methods.”