Many people in interpretability currently seem interested in ideas like enumerative safety, where you describe every part of a neural network to ensure all the parts look safe. Those people often also talk about a fundamental trade-off in interpretability between the completeness and precision of an explanation for a neural network’s behavior and its description length.
I feel like, at the moment, these sorts of considerations are all premature and beside the point.
I don’t understand how GPT-4 can talk. Not in the sense that I don’t have an accurate, human-intuitive description of every part of GPT-4 that contributes to it talking well. My confusion is more fundamental than that. I don’t understand how GPT-4 can talk the way a 17th-century scholar wouldn’t understand how a Toyota Corolla can move. I have no gears-level model for how anything like this could be done at all. I don’t want a description of every single plate and cable in a Toyota Corolla, and I’m not thinking about the balance between the length of the Corolla blueprint and its fidelity as a central issue of interpretability as a field.
What I want right now is a basic understanding of combustion engines. I want to understand the key internal gears of LLMs that are currently completely mysterious to me, the parts where I don’t have any functional model at all for how they even could work. What I ultimately want to get out of Interpretability at the moment is a sketch of Python code I could write myself, without a numeric optimizer as an intermediary, that would be able to talk.
When doing bottom up interpretability, it’s pretty unclear if you can answer questions like “how does GPT-4 talk” without being able to explain arbitrary parts to a high degree of accuracy.
I agree that top down interpretability trying to answer more basic questions seems good. (And generally I think top down interpretability looks more promising than bottom up interpretability at current margins.)
(By interpretability, I mean work aimed at having humans understand the algorithm/approach the model to uses to solve tasks. I don’t mean literally any work which involves using the internals of the model in some non-basic way.)
I have no gears-level model for how anything like this could be done at all. [...] What I want right now is a basic understanding of combustion engines. I want to understand the key internal gears of LLMs that are currently completely mysterious to me, the parts where I don’t have any functional model at all for how they even could work. What I ultimately want to get out of Interpretability at the moment is a sketch of Python code I could write myself.
It’s not obvious to me that what you seem to want exists. I think the way LLMs work might not be well described as having key internal gears or having an at-all illuminating python code sketch.
(I’d guess something sorta close to what you seem to be describing, but ultimately disappointing and mostly unilluminating exists. And something tremendously complex but ultimately pretty illuminating if you fully understood it might exist.)
I very strongly agree with the spirit of this post. Though personally I am a bit more hesitant about what exactly it is that I want in terms of understanding how it is that GPT-4 can talk. In particular I can imagine that my understanding of how GPT-4 could talk might be satisfied by understanding the principles by which it talks, but without necessarily being able to from scratch write a talking machine. Maybe what I’d be after in terms of what I can build is a talking machine of a certain toyish flavor—a machine that can talk in a synthetic/toy language. The full complexity of its current ability seems to have too much structure to be constructed from first princples. Though of course one doesn’t know until our understanding is more complete.
Interesting question. I’d suggest starting by doing interpretability on some of the TinyStories models and corpus: they have models with as few as 1–2 layers, 64-or-more dimensional embeddings, and only millions of parameters that can talk (childish) English. That sounds like the sort of thing that might actually be enumerable, with enough work. I think trying to figure that that might be a great testing ground for current ideas in interpretability: large enough to not be a toy model, but small enough to hopefully be tractable.
The tiny story status seems quite simple, in the sense that I can see how you could provide TinyStories levels of loss by following simple rules plus a bunch of memorization.
Empirically, one of the best models in the tiny stories paper is a super wide 1L transformer, which basically is bigrams, trigrams, and slightly more complicated variants [see Bucks post] but nothing that requires a step of reasoning.
I am actually quite uncertain where the significant gap between TinyStories, GPT-2 and GPT-4 is. Maybe I could fully understand TinyStories-1L if I tried, would this tell us about GPT-4? I feel like the result for TinyStories will be a bunch of heuristics.
Is that TinyStories model a super-wide attention-only transformer (the topic of the mechanistic interp work and Buck’s post you cite). I tried to figure it out briefly and couldn’t tell, but I bet it isn’t, and instead has extra stuff like an MLP block.
Regardless, in my view it would be a big advance to really understand how the TinyStories models work. Maybe they are “a bunch of heuristics” but maybe that’s all GPT-4, and our own minds, are as well…
That model has an Attention and MLP block (GPT2-style model with 1 layer but a bit wider, 21M params).
I changed my mind over the course of this morning. TheTinyStories models’ language isn’t that bad, and I think it’d be a decent research project to try to fully understand one of these.
I’ve been playing around with the models this morning, quotes from the 1-layer model:
Once upon a time, there was a lovely girl called Chloe. She loved to go for a walk every morning and one day she came across a road.
One day, she decided she wanted to go for a ride. She jumped up and down, and as she jumped into the horn, shouting whatever makes you feel like.
When Chloe was flying in the sky, she saw some big white smoke above her. She was so amazed and decided to fly down and take a closer look. When Chloe got to the edge of a park, there was a firework show. The girl smiled and said “Oh, hello, there. I will make sure to finish in my flying body before it gets too cold,” it said.
So Chloe flew to the park again, with a very persistent look at the white horn. She was very proud of her creation and was thankful for being so brave. Summary: Chloe, a persistent girl, explores the park with the help of a firework sparkle and is shown how brave the firework can be persistent.
and
Once upon a time, there lived a young boy. His name was Caleb. He loved to learn new things and gain healthy by playing outside.
One day, Caleb was in the garden and he started eating an onion. He was struggling to find enough food to eat, but he couldn’t find anything.
Just then, Caleb appeared with a magical lake. The young boy told Caleb he could help him find his way home if he ate the onion. Caleb was so excited to find the garden had become narrow enough for Caleb to get his wish.
Caleb thought about what the pepper was thinking. He then decided to try and find a safer way to play with them next time. From then on, Caleb became healthier and could eat sweets and sweets in the house.
With the peppers, Caleb ate delicious pepper and could be heard by again. He was really proud of himself and soon enough he was playing in the garden again.
This feels like the kind of inconsistency I expect from a model that has only one layer. It can recall that the story was about flying and stuff, and the names, but it feels a bit like the model doesn’t remember what it said a paragraph before.
2-layer model:
Once upon a time, there was a lazy bear. He lived in a tall village surrounded by thick trees and lonely rivers.
The bear wanted to explore the far side of the mountain, so he asked a kind bird if he wanted to come. The bird said, “Yes, but first let me seat in my big tree. Follow me!”
The bear was excited and followed the bird. They soon arrived at a beautiful mountain. The mountain was rich with juicy, delicious fruit. The bear was so happy and thanked the bird for his help. They both shared the fruit and had a great time.
The bear said goodbye to the bird and returned to his big tree, feeling very happy and content. From then on, the bear went for food every day and could often seat in his tall tree by the river. Summary: A lazy bear ventures on a mountain and finds a kind bird who helps him find food on his travels. The bear is happy and content with the food and a delicious dessert.
and
Once upon a time, there were two best friends, a gingerbread fox and a gingerbread wolf. Everyone loved the treats and had a great time together, playing games and eating the treats.
The gingerbread fox spoke up and said: “Let’s be like buying a house for something else!” But the ginger suggested that they go to the market instead. The friends agreed and they both went to the market.
Back home, the gingerbread fox was happy to have shared the treats with the friends. They all ate the treats with the chocolates, ran around and giggled together. The gingerbread fox thought this was the perfect idea, and every day the friends ate their treats and laughed together.
The friends were very happy and enjoyed every single morsel of it. No one else was enjoying the fun and laughter that followed. And every day, the friends continued to discuss different things and discover new new things to imagine. Summary: Two best friends, gingerbread and chocolate, go to the market to buy treats but end up only buying a small house for a treat each, which they enjoy doing together.
I think if we can fully understand (in the Python code sense, probably with a bunch of lookup tables) how these models work this will give us some insight into where we’re at with interpretability. Do the explanations feel sufficiently compressed? Does it feel like there’s a simpler explanation that the code & tables we’ve written?
Edit: Specifically I’m thinking of
Train SAEs on all layers
Use this for Attention QK circuits (and transform OV circuit into SAE basis, or Transcoder basis)
Yup: the 1L model samples are full of non-sequiturs, to the level I can’t imagine a human child telling a story that badly; whereas the first 2L model example has maybe one non-sequitur/plot jump (the way the story ignores the content of bird’s first line of dialog), which the rest of the story then works into it so it ends up almost making sense, in retrospect (except it would have made better sense if the bear had said that line). They second example has a few non-sequiturs, but they’re again not glaring and continuous the way the 1L output is. (As a parent) I can imagine a rather small human child telling a story with about the 2L level of plot inconsistencies.
From rereading the Tiny Stories paper, the 1L model did a really bad job of maintaining the internal consistency of the story and figuring out and allowing for the logical consequences of events, but otherwise did a passably good job of speaking coherent childish English. So the choice on transformer block count would depend on how interested you are in learning how to speak English that is coherent as well as grammatical. Personally I’d probably want to look at something in the 3–4-layer range, so it has an input layer, and output layer, and at least one middle layer, and might actually contain some small circuits.
I would LOVE to have an automated way of converting a Tiny Stories-size transformer to some form of declarative language spaghetti code. It would probably help to start with a heavily-quantized version. For example, a model trained using the techniques of the recent paper on building AI using trinary logic (so roughly a 1.6-bit quantization, and eliminating matrix multiplication entirely) might be a good place to start, combined with the sort of techniques the model-pruning folks have been working on for which model-internal interactions are important on the training set and which are just noise and can be discarded.
I strongly suspect that every transformer model is just a vast pile of heuristics. In certain cases, if trained on a situation that genuinely is simple and has a specific algorithm to solve it runnable during a model forward-pass (like modular arithmetic, for example), and with enough data to grok it, then the resulting heuristic may actually be an elegant True Name algorithm for the problem. Otherwise, it’s just going to be a pile of heuristics that SGD found and tuned. Fortunately SGD (for reasons that singular learning theory illuminates) has a simplicity bias that gives a prior that acts like Occam’s Razor or a Kolmogorov Complexity prior, so tends to prefer algorithms that generalize well (especially as the amount of data tends to infinity, thus groking), but obviously finding True Names isn’t going to be guaranteed.
What I want right now is a basic understanding of combustion engines. I want to understand the key internal gears of LLMs that are currently completely mysterious to me, the parts where I don’t have any functional model at all for how they even could work. What I ultimately want to get out of Interpretability at the moment is a sketch of Python code I could write myself, without a numeric optimizer as an intermediary, that would be able to talk.
How would you operationalize this in ML terms? E.g. how much loss in performance would you consider acceptable, on how wide a distribution of e.g. GPT-4′s capabilities, how many lines of python code, etc.? Would you consider acceptable existing rough theoretical explanations, e.g. An Information-Theoretic Analysis of In-Context Learning? (I suspect not, because no ‘sketch of python code’ feasibility).
(I’ll note that by default I’m highly skeptical of any current-day-human producing anything like a comprehensible, not-extremely-long ‘sketch of Python code’ of GPT-4 in a reasonable amount of time. For comparison, how hopeful would you be of producing the same for a smart human’s brain? And on some dimensions—e.g. knowledge—GPT-4 is vastly superhuman.)
I think OP just wanted some declarative code (I don’t think Python is the ideal choice of language, but basically anything that’s not a Turing tarpit is fine) that could speak fairly coherent English. I suspect if you had a functional transformer decompiler the results aof appling it to a Tiny Stories-size model are going to be tens to hundreds of megabytes of spaghetti, so understanding that in detail is going to be huge slog, but on the other hand, this is an actual operationalization of the Chinese Room argument (or in this case, English Room)! I agree it would be fascinating, if we can get a significant fraction of the model’s perplexity score. If it is, as people seem to suspect, mostly or entirely a pile of spaghetti, understanding even a representative (frequency-of-importance biased) statistical sample of it (say, enough for generating a few specific sentences) would still be fascinating.
I don’t want a description of every single plate and cable in a Toyota Corolla, I’m not thinking about the balance between the length of the Corolla blueprint and its fidelity as a central issue of interpretability as a field.
What I want right now is a basic understanding of combustion engines.
This is the wrong ‘length’. The right version of brute-force length is not “every weight and bias in the network” but “the program trace of running the network on every datapoint in pretrain”. Compressing the explanation (not just the source code) is the thing connected to understanding. This is what we found from getting formal proofs of model behavior in Compact Proofs of Model Performance via Mechanistic Interpretability.
Does the 17th-century scholar have the requisite background to understand the transcript of how bringing the metal plates in the spark plug close enough together results in the formation of a spark? And how gasoline will ignite and expand? I think given these two building blocks, a complete description of the frame-by-frame motion of the Toyota Corolla would eventually convince the 17th-century scholar that such motion is possible, and what remains would just be fitting the explanation into their head all at once. We already have the corresponding building blocks for neural nets: floating point operations.
Many people in interpretability currently seem interested in ideas like enumerative safety, where you describe every part of a neural network to ensure all the parts look safe. Those people often also talk about a fundamental trade-off in interpretability between the completeness and precision of an explanation for a neural network’s behavior and its description length.
I feel like, at the moment, these sorts of considerations are all premature and beside the point.
I don’t understand how GPT-4 can talk. Not in the sense that I don’t have an accurate, human-intuitive description of every part of GPT-4 that contributes to it talking well. My confusion is more fundamental than that. I don’t understand how GPT-4 can talk the way a 17th-century scholar wouldn’t understand how a Toyota Corolla can move. I have no gears-level model for how anything like this could be done at all. I don’t want a description of every single plate and cable in a Toyota Corolla, and I’m not thinking about the balance between the length of the Corolla blueprint and its fidelity as a central issue of interpretability as a field.
What I want right now is a basic understanding of combustion engines. I want to understand the key internal gears of LLMs that are currently completely mysterious to me, the parts where I don’t have any functional model at all for how they even could work. What I ultimately want to get out of Interpretability at the moment is a sketch of Python code I could write myself, without a numeric optimizer as an intermediary, that would be able to talk.
When doing bottom up interpretability, it’s pretty unclear if you can answer questions like “how does GPT-4 talk” without being able to explain arbitrary parts to a high degree of accuracy.
I agree that top down interpretability trying to answer more basic questions seems good. (And generally I think top down interpretability looks more promising than bottom up interpretability at current margins.)
(By interpretability, I mean work aimed at having humans understand the algorithm/approach the model to uses to solve tasks. I don’t mean literally any work which involves using the internals of the model in some non-basic way.)
It’s not obvious to me that what you seem to want exists. I think the way LLMs work might not be well described as having key internal gears or having an at-all illuminating python code sketch.
(I’d guess something sorta close to what you seem to be describing, but ultimately disappointing and mostly unilluminating exists. And something tremendously complex but ultimately pretty illuminating if you fully understood it might exist.)
What motivates your believing that?
I very strongly agree with the spirit of this post. Though personally I am a bit more hesitant about what exactly it is that I want in terms of understanding how it is that GPT-4 can talk. In particular I can imagine that my understanding of how GPT-4 could talk might be satisfied by understanding the principles by which it talks, but without necessarily being able to from scratch write a talking machine. Maybe what I’d be after in terms of what I can build is a talking machine of a certain toyish flavor—a machine that can talk in a synthetic/toy language. The full complexity of its current ability seems to have too much structure to be constructed from first princples. Though of course one doesn’t know until our understanding is more complete.
Interesting question. I’d suggest starting by doing interpretability on some of the TinyStories models and corpus: they have models with as few as 1–2 layers, 64-or-more dimensional embeddings, and only millions of parameters that can talk (childish) English. That sounds like the sort of thing that might actually be enumerable, with enough work. I think trying to figure that that might be a great testing ground for current ideas in interpretability: large enough to not be a toy model, but small enough to hopefully be tractable.
The tiny story status seems quite simple, in the sense that I can see how you could provide TinyStories levels of loss by following simple rules plus a bunch of memorization.
Empirically, one of the best models in the tiny stories paper is a super wide 1L transformer, which basically is bigrams, trigrams, and slightly more complicated variants [see Bucks post] but nothing that requires a step of reasoning.
I am actually quite uncertain where the significant gap between TinyStories, GPT-2 and GPT-4 is. Maybe I could fully understand TinyStories-1L if I tried, would this tell us about GPT-4? I feel like the result for TinyStories will be a bunch of heuristics.
Is that TinyStories model a super-wide attention-only transformer (the topic of the mechanistic interp work and Buck’s post you cite). I tried to figure it out briefly and couldn’t tell, but I bet it isn’t, and instead has extra stuff like an MLP block.
Regardless, in my view it would be a big advance to really understand how the TinyStories models work. Maybe they are “a bunch of heuristics” but maybe that’s all GPT-4, and our own minds, are as well…
That model has an Attention and MLP block (GPT2-style model with 1 layer but a bit wider, 21M params).
I changed my mind over the course of this morning. TheTinyStories models’ language isn’t that bad, and I think it’d be a decent research project to try to fully understand one of these.
I’ve been playing around with the models this morning, quotes from the 1-layer model:
and
This feels like the kind of inconsistency I expect from a model that has only one layer. It can recall that the story was about flying and stuff, and the names, but it feels a bit like the model doesn’t remember what it said a paragraph before.
2-layer model:
and
I think if we can fully understand (in the Python code sense, probably with a bunch of lookup tables) how these models work this will give us some insight into where we’re at with interpretability. Do the explanations feel sufficiently compressed? Does it feel like there’s a simpler explanation that the code & tables we’ve written?
Edit: Specifically I’m thinking of
Train SAEs on all layers
Use this for Attention QK circuits (and transform OV circuit into SAE basis, or Transcoder basis)
Use Transcoders for MLPs
(Transcoders vs SAEs are somewhat redundant / different approaches, figure out how to connect everything together)
Yup: the 1L model samples are full of non-sequiturs, to the level I can’t imagine a human child telling a story that badly; whereas the first 2L model example has maybe one non-sequitur/plot jump (the way the story ignores the content of bird’s first line of dialog), which the rest of the story then works into it so it ends up almost making sense, in retrospect (except it would have made better sense if the bear had said that line). They second example has a few non-sequiturs, but they’re again not glaring and continuous the way the 1L output is. (As a parent) I can imagine a rather small human child telling a story with about the 2L level of plot inconsistencies.
From rereading the Tiny Stories paper, the 1L model did a really bad job of maintaining the internal consistency of the story and figuring out and allowing for the logical consequences of events, but otherwise did a passably good job of speaking coherent childish English. So the choice on transformer block count would depend on how interested you are in learning how to speak English that is coherent as well as grammatical. Personally I’d probably want to look at something in the 3–4-layer range, so it has an input layer, and output layer, and at least one middle layer, and might actually contain some small circuits.
I would LOVE to have an automated way of converting a Tiny Stories-size transformer to some form of declarative language spaghetti code. It would probably help to start with a heavily-quantized version. For example, a model trained using the techniques of the recent paper on building AI using trinary logic (so roughly a 1.6-bit quantization, and eliminating matrix multiplication entirely) might be a good place to start, combined with the sort of techniques the model-pruning folks have been working on for which model-internal interactions are important on the training set and which are just noise and can be discarded.
I strongly suspect that every transformer model is just a vast pile of heuristics. In certain cases, if trained on a situation that genuinely is simple and has a specific algorithm to solve it runnable during a model forward-pass (like modular arithmetic, for example), and with enough data to grok it, then the resulting heuristic may actually be an elegant True Name algorithm for the problem. Otherwise, it’s just going to be a pile of heuristics that SGD found and tuned. Fortunately SGD (for reasons that singular learning theory illuminates) has a simplicity bias that gives a prior that acts like Occam’s Razor or a Kolmogorov Complexity prior, so tends to prefer algorithms that generalize well (especially as the amount of data tends to infinity, thus groking), but obviously finding True Names isn’t going to be guaranteed.
How would you operationalize this in ML terms? E.g. how much loss in performance would you consider acceptable, on how wide a distribution of e.g. GPT-4′s capabilities, how many lines of python code, etc.? Would you consider acceptable existing rough theoretical explanations, e.g. An Information-Theoretic Analysis of In-Context Learning? (I suspect not, because no ‘sketch of python code’ feasibility).
(I’ll note that by default I’m highly skeptical of any current-day-human producing anything like a comprehensible, not-extremely-long ‘sketch of Python code’ of GPT-4 in a reasonable amount of time. For comparison, how hopeful would you be of producing the same for a smart human’s brain? And on some dimensions—e.g. knowledge—GPT-4 is vastly superhuman.)
I think OP just wanted some declarative code (I don’t think Python is the ideal choice of language, but basically anything that’s not a Turing tarpit is fine) that could speak fairly coherent English. I suspect if you had a functional transformer decompiler the results aof appling it to a Tiny Stories-size model are going to be tens to hundreds of megabytes of spaghetti, so understanding that in detail is going to be huge slog, but on the other hand, this is an actual operationalization of the Chinese Room argument (or in this case, English Room)! I agree it would be fascinating, if we can get a significant fraction of the model’s perplexity score. If it is, as people seem to suspect, mostly or entirely a pile of spaghetti, understanding even a representative (frequency-of-importance biased) statistical sample of it (say, enough for generating a few specific sentences) would still be fascinating.
This is the wrong ‘length’. The right version of brute-force length is not “every weight and bias in the network” but “the program trace of running the network on every datapoint in pretrain”. Compressing the explanation (not just the source code) is the thing connected to understanding. This is what we found from getting formal proofs of model behavior in Compact Proofs of Model Performance via Mechanistic Interpretability.
Does the 17th-century scholar have the requisite background to understand the transcript of how bringing the metal plates in the spark plug close enough together results in the formation of a spark? And how gasoline will ignite and expand? I think given these two building blocks, a complete description of the frame-by-frame motion of the Toyota Corolla would eventually convince the 17th-century scholar that such motion is possible, and what remains would just be fitting the explanation into their head all at once. We already have the corresponding building blocks for neural nets: floating point operations.