The latent variables the model uses to perform tasks are latent variables that humans use. Contra ideas that language models reasons in completely alien ways. I agree that the two cited works are not particularly strong evidence : (
The model uses these latent variables with a bias towards shorter algorithms (e.g shallower paths in transformers). This is important as it’s possible that even when performing really narrow tasks, models could use a very large number of (individually understandable) latent variables in long algorithms such that no human could feasibly understand what’s going on.
I’m not sure what the end product of automated interpretability is, I think it would be pure speculation to make claims here.
By “crisp internal representations” I mean that
The latent variables the model uses to perform tasks are latent variables that humans use. Contra ideas that language models reasons in completely alien ways. I agree that the two cited works are not particularly strong evidence : (
The model uses these latent variables with a bias towards shorter algorithms (e.g shallower paths in transformers). This is important as it’s possible that even when performing really narrow tasks, models could use a very large number of (individually understandable) latent variables in long algorithms such that no human could feasibly understand what’s going on.
I’m not sure what the end product of automated interpretability is, I think it would be pure speculation to make claims here.