starship006 comments on Shallow review of live agendas in alignment & safety

starship006 28 Nov 2023 4:03 UTC
3 points
0
Reverse engineering. Unclear if this is being pushed much anymore. 2022: Anthropic circuits, Interpretability In The Wild, Grokking mod arithmetic
FWIW, I was one of Neel’s MATS 4.1 scholars and I would classify ³⁄₄ of Neel’s scholar’s outputs as reverse engineering some component of LLMs (for completeness, this is the other one, which doesn’t nicely fit as ‘reverse engineering’ imo). I would also say that this is still an active direction of research (lots of ground to cover with MLP neurons, polysemantic heads, and more)
- technicalities 28 Nov 2023 10:03 UTC
  2 points
  0
  Parent
  You’re clearly right, thanks