The field of complex systems seems like a great source of ideas for interpretability and alignment. In lieu of a longer comment, I’ll just leave this great review by Teehan et al. on emergent structures in LLMs. Section 3 in particular is great.
The field of complex systems seems like a great source of ideas for interpretability and alignment. In lieu of a longer comment, I’ll just leave this great review by Teehan et al. on emergent structures in LLMs. Section 3 in particular is great.