Very interesting work! This is only a half-formed thought, but the diagrams you’ve created very much remind me of similar diagrams used to display learned “topics” in classic topic models like Latent Dirichlet Allocation (Figure 8 from the paper is below):
I think there’s possibly something to be gained by viewing what the MLPs and attention heads are learning as something like “topic models”—and it may be the case that some of the methods developed for evaluating topic interpretability and consistency will be valuable here. A couple of references:
Very interesting work! This is only a half-formed thought, but the diagrams you’ve created very much remind me of similar diagrams used to display learned “topics” in classic topic models like Latent Dirichlet Allocation (Figure 8 from the paper is below):
I think there’s possibly something to be gained by viewing what the MLPs and attention heads are learning as something like “topic models”—and it may be the case that some of the methods developed for evaluating topic interpretability and consistency will be valuable here. A couple of references:
Reading Tea Leaves: How Humans Interpret Topic Models (Chang et. al. 2009)
Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality (Lau, Newman & Baldwin, 2014)