Mechanistic interpretability research in a similar vein to the work of Chris Olah and David Bau, but with less of a focus on circuits-style interpretability and more focus on research whose insights can scale to models with many billions of parameters and larger. Some example approaches might be:
Using deep learning to automate deep learning interpretability—for example, training a language model to give semantic labels to neurons or other internal circuits.
Studying the high-level algorithms that models use to perform e.g, in-context learning or prompt programming.
In their announcement post they mention:
Thanks!