Case studies: finding algorithms inside networks that implement specific capabilities. My favorite papers here are Olsson et al. (2022), Nanda et al. (2023), Wang et al. (2022) and Li et al. (2022); I’m excited to see more work which builds on the last in particular to find world-models and internally-represented goals within networks.
If you want to build on Li et al (the Othello paper), my follow-up work is likely to be a useful starting point, and then the post I wrote about future directions I’m particularly excited about