This project explores the use of Sparse Autoencoders (SAEs) to track the development of features in large language models throughout their training. It investigates whether features can be reliably matched between different SAEs trained on various checkpoints of Pythia 70M and characterises the development of these features over the course of training. The findings show that features can successfully be matched between different SAEs. The results also support the distributional simplicity bias hypothesis, indicating that simpler features are learned early in training, with more complex features emerging later. While the focus was on a relatively small model, the results lay the groundwork for future research into larger models and the identification of potentially deceptive capabilities. This work aims to enhance the interpretability and safety of AI systems by providing a deeper understanding of feature development and the dynamics of neural network training, with the ultimate goal of making a general statement about the overall likelihood of dangerous deceptive alignment arising in practice.
I think it would also be fascinating to compare models before, during, and after RLHF.