(Quality: Low, only read when you have nothing better to do—also not much citing)
30-minute high-LLM-temp stream-of-consciousness on “How do we make mechanistic interpretability work for non-transformers, or just any architectures?”
We want a general way to reverse engineer circuits
e.g., Should be able to rediscover properties we discovered from transformers
Concrete Example: we spent a bunch of effort reverse engineering transformer-type architectures—then boom, suddenly some parallel-GPU-friendly-LSTM architecutre turns out to have better scaling properties, and everyone starts using it. LSTMs have different inductive biases, like things in the same layer being able to communicate multiple times with each other (unlike transformers), which incentivizes e.g., reusing components (more search-y?).
Formalize:
You have task X. You train a model A with inductive bias I_A. You also train a model B with inductive bias I_B. Your mechanistic interpretability techniques work well on deciphering A, but not B. You want your mechanistic interpretability techniques to work well for B, too.
Proposal: Communication channel
Train a Transformer on task X
Existing Mechanistic interpretability work does well on interpreting this architecture
Somehow stitch the LSTM to the transformer (?)
I’m trying to get at to the idea of “interface conversion,” that by the virtue of SGD being greedy, it will try to convert the outputs of transformer-friendly types
Now you can better understand the intermediate outputs of the LSTM by just running mechanistic interpretability on the transformer layers whose input are from the LSTM
(I don’t know if I’m making any sense here, my LLM temp is > 1)
Proposal: approximation via large models?
Train a larger transformer architecture to approximate the smaller LSTM model (either just input output pairs, or intermediate features, or intermediate features across multiple time-steps, etc):
the basic idea is that a smaller model would be more subject to following its natural gradient shaped by the inductive bias, while larger model (with direct access to the intermediate outputs of the smaller model) would be able to approximate it despite not having as much inductive bias incentive towards it.
probably false but illustrative example: Train small LSTM on chess. By the virtue of being able to run serial computation on same layers, it focuses on algorithms that have repeating modular parts. In contrast, a small Transformer would learn algorithms that don’t have such repeating modular parts. But instead, train a large transformer to “approximate” the small LSTM—it should be able to do so by, e.g., inefficiently having identical modules across multiple layers. Now use mechanistic interpretability on that.
Proposal: redirect GPS?
Thane’s value formation picture says GPS should be incentivized to reverse-engineer the heuristics because it has access to inter-heuristic communication channel. Maybe, in the middle of training, gradually swap different parts of the model with those that have different inductive biases, see GPS gradually learn to reverse-engineer those, and mechanistically-interpret how GPS exactly does that, and reimplement in human code?
Proposal: Interpretability techniques based on behavioral constraints
e.g., Discovering Latent Knowledge without Supervision, putting constraints?
How to do we “back out” inductive biases, just given e.g., architecture, training setup? What is the type signature?
(Quality: Low, only read when you have nothing better to do—also not much citing)
30-minute high-LLM-temp stream-of-consciousness on “How do we make mechanistic interpretability work for non-transformers, or just any architectures?”
We want a general way to reverse engineer circuits
e.g., Should be able to rediscover properties we discovered from transformers
Concrete Example: we spent a bunch of effort reverse engineering transformer-type architectures—then boom, suddenly some parallel-GPU-friendly-LSTM architecutre turns out to have better scaling properties, and everyone starts using it. LSTMs have different inductive biases, like things in the same layer being able to communicate multiple times with each other (unlike transformers), which incentivizes e.g., reusing components (more search-y?).
Formalize:
You have task X. You train a model A with inductive bias I_A. You also train a model B with inductive bias I_B. Your mechanistic interpretability techniques work well on deciphering A, but not B. You want your mechanistic interpretability techniques to work well for B, too.
Proposal: Communication channel
Train a Transformer on task X
Existing Mechanistic interpretability work does well on interpreting this architecture
Somehow stitch the LSTM to the transformer (?)
I’m trying to get at to the idea of “interface conversion,” that by the virtue of SGD being greedy, it will try to convert the outputs of transformer-friendly types
Now you can better understand the intermediate outputs of the LSTM by just running mechanistic interpretability on the transformer layers whose input are from the LSTM
(I don’t know if I’m making any sense here, my LLM temp is > 1)
Proposal: approximation via large models?
Train a larger transformer architecture to approximate the smaller LSTM model (either just input output pairs, or intermediate features, or intermediate features across multiple time-steps, etc):
the basic idea is that a smaller model would be more subject to following its natural gradient shaped by the inductive bias, while larger model (with direct access to the intermediate outputs of the smaller model) would be able to approximate it despite not having as much inductive bias incentive towards it.
probably false but illustrative example: Train small LSTM on chess. By the virtue of being able to run serial computation on same layers, it focuses on algorithms that have repeating modular parts. In contrast, a small Transformer would learn algorithms that don’t have such repeating modular parts. But instead, train a large transformer to “approximate” the small LSTM—it should be able to do so by, e.g., inefficiently having identical modules across multiple layers. Now use mechanistic interpretability on that.
Proposal: redirect GPS?
Thane’s value formation picture says GPS should be incentivized to reverse-engineer the heuristics because it has access to inter-heuristic communication channel. Maybe, in the middle of training, gradually swap different parts of the model with those that have different inductive biases, see GPS gradually learn to reverse-engineer those, and mechanistically-interpret how GPS exactly does that, and reimplement in human code?
Proposal: Interpretability techniques based on behavioral constraints
e.g., Discovering Latent Knowledge without Supervision, putting constraints?
How to do we “back out” inductive biases, just given e.g., architecture, training setup? What is the type signature?
(I need to read more literature)