Interpretability on an LLM might, for example, tell me a great deal about the statistical relationships between the word “human” and the word “dog” in various contexts. And the “trivial” sense in which this tells me about physical stuff is that the texts in question are embedded in the world—they’re characters on my screen, for instance.
The problem is that I don’t care that much about the characters on my screen in-and-of themselves. I mostly care about the characters on my screen insofar as they tell me about other things, like e.g. physical humans and dogs.
So, say I’m doing interpretability work on an LLM, and I find some statistical pattern between the word “human” and the word “dog”. (Flag: this is oversimplified compared to actual interp.) What does that pattern tell me about physical humans and physical dogs, the things I actually care about? How does that pattern even relate to physical humans and physical dogs? Well shit, that’s a whole very complicated question in its own right.
On the other hand, if I’m doing interp work on an image generator… I’m forced to start lower-level, so by the time I’m working with things like humans and dogs I’ve already understood a whole lot of stuff about the lower-level patterns which constitute humans and dogs (which is itself probably useful, that’s exactly the sort of thing I want to learn about). Then I find some relationship between the human parts of the image and the dog parts of the image, and insofar as the generator was trained on real-world images, that much more directly tells me about physical humans and dogs and how they relate statistically (like e.g. where they’re likely to be located relative to each other).
But why do we care more about statistical relationships between physical humans and dogs than statistical relationships between the word “human” and the word “dog” as characters on your screen? For most of what interp is currently trying to do, it seems to me that the underlying domain the network learns to model doesn’t matter that much. I wouldn’t want to make up some completely fake domain from scratch, because the data distribution of my fake domain might qualitatively differ from the sorts of domains a serious AI would need to model. And then maybe the network we get works genuinely differently than networks that model real domains, so our research results don’t transfer. But image generation and internet token prediction both seem very entangled with the structure of the universe, so I’d expect them both to have the right sort of high level structure and yield results that mostly transfer.
On the other hand, if I’m doing interp work on an image generator… I’m forced to start lower-level, so by the time I’m working with things like humans and dogs I’ve already understood a whole lot of stuff about the lower-level patterns which constitute humans and dogs
For this specific purpose, I agree with you that language models seem less suitable. And if this is what you’re trying to tackle directly at the moment I can see why you would want to use domains like image generation and fluid simulation for that, rather than internet text prediction. But I think there’s good attack angles on important problems in interp that don’t route through investigating this sort of question as one of the first steps.
But why do we care more about statistical relationships between physical humans and dogs than statistical relationships between the word “human” and the word “dog” as characters on your screen?
An overly-cute but not completely wrong answer: because I care about whether AI kills all the physical humans, not whether something somewhere writes the string “kill all the humans”. My terminal-ish values are mostly over the physical stuff.
I think the point you’re trying to make is roughly “well, it’s all pretty entangled with the physical stuff anyway, so why favor one medium or another? Instrumentally, either suffices.”. And the point I’m trying to make in response is “it matters a lot how complicated the relationship is between the medium and the physical stuff, because terminally it’s the physical we care about, so instrumentally stuff that’s more simply related to the physical stuff is a lot more useful to understand.”.
An overly-cute but not completely wrong answer: because I care about whether AI kills all the physical humans, not whether something somewhere writes the string “kill all the humans”. My terminal-ish values are mostly over the physical stuff.
I don’t understand this argument. Interpretability is not currently trying to look at AIs to determine whether they will kill us. That’s way too advanced for where we’re at. We’re more at the stage of asking questions like “Is the learning coefficient of a network composed of n independent superposed circuits equal to the learning coefficients of the individual n circuits summed, or greater?”
The laws of why and when neural networks learn to be modular, why and when they learn to do activation error-correction, what the locality of updating algorithms prevents them from learning, how they do inductive inference in-context, how their low-level algorithms correspond to something we would recognise as cognition, etc. are presumably fairly general and look more or less the same whether the network is trained on a domain that very directly relates to physical stuff or not.
Interpretability is not currently trying to look at AIs to determine whether they will kill us. That’s way too advanced for where we’re at.
Right, and that’s a problem. There’s this big qualitative gap between the kinds of questions interp is even trying to address today, and the kinds of questions it needs to address. It’s the gap between talking about stuff inside the net, and talking about stuff in the environment (which the stuff inside the net represents).
And I think the focus on LLMs is largely to blame for that gap seeming “way too advanced for where we’re at”. I expect it’s much easier to cross if we focus on image models instead.
(And to be clear, even after crossing the internals/environment gap, there will still be a long ways to go before we’re ready to ask about e.g. whether an AI will kill us. But the internals/environment gap is the main qualitative barrier I know of; after that it should be more a matter of iteration and ordinary science.)
Interpretability on an LLM might, for example, tell me a great deal about the statistical relationships between the word “human” and the word “dog” in various contexts. And the “trivial” sense in which this tells me about physical stuff is that the texts in question are embedded in the world—they’re characters on my screen, for instance.
The problem is that I don’t care that much about the characters on my screen in-and-of themselves. I mostly care about the characters on my screen insofar as they tell me about other things, like e.g. physical humans and dogs.
So, say I’m doing interpretability work on an LLM, and I find some statistical pattern between the word “human” and the word “dog”. (Flag: this is oversimplified compared to actual interp.) What does that pattern tell me about physical humans and physical dogs, the things I actually care about? How does that pattern even relate to physical humans and physical dogs? Well shit, that’s a whole very complicated question in its own right.
On the other hand, if I’m doing interp work on an image generator… I’m forced to start lower-level, so by the time I’m working with things like humans and dogs I’ve already understood a whole lot of stuff about the lower-level patterns which constitute humans and dogs (which is itself probably useful, that’s exactly the sort of thing I want to learn about). Then I find some relationship between the human parts of the image and the dog parts of the image, and insofar as the generator was trained on real-world images, that much more directly tells me about physical humans and dogs and how they relate statistically (like e.g. where they’re likely to be located relative to each other).
But why do we care more about statistical relationships between physical humans and dogs than statistical relationships between the word “human” and the word “dog” as characters on your screen? For most of what interp is currently trying to do, it seems to me that the underlying domain the network learns to model doesn’t matter that much. I wouldn’t want to make up some completely fake domain from scratch, because the data distribution of my fake domain might qualitatively differ from the sorts of domains a serious AI would need to model. And then maybe the network we get works genuinely differently than networks that model real domains, so our research results don’t transfer. But image generation and internet token prediction both seem very entangled with the structure of the universe, so I’d expect them both to have the right sort of high level structure and yield results that mostly transfer.
For this specific purpose, I agree with you that language models seem less suitable. And if this is what you’re trying to tackle directly at the moment I can see why you would want to use domains like image generation and fluid simulation for that, rather than internet text prediction. But I think there’s good attack angles on important problems in interp that don’t route through investigating this sort of question as one of the first steps.
An overly-cute but not completely wrong answer: because I care about whether AI kills all the physical humans, not whether something somewhere writes the string “kill all the humans”. My terminal-ish values are mostly over the physical stuff.
I think the point you’re trying to make is roughly “well, it’s all pretty entangled with the physical stuff anyway, so why favor one medium or another? Instrumentally, either suffices.”. And the point I’m trying to make in response is “it matters a lot how complicated the relationship is between the medium and the physical stuff, because terminally it’s the physical we care about, so instrumentally stuff that’s more simply related to the physical stuff is a lot more useful to understand.”.
I don’t understand this argument. Interpretability is not currently trying to look at AIs to determine whether they will kill us. That’s way too advanced for where we’re at. We’re more at the stage of asking questions like “Is the learning coefficient of a network composed of n independent superposed circuits equal to the learning coefficients of the individual n circuits summed, or greater?”
The laws of why and when neural networks learn to be modular, why and when they learn to do activation error-correction, what the locality of updating algorithms prevents them from learning, how they do inductive inference in-context, how their low-level algorithms correspond to something we would recognise as cognition, etc. are presumably fairly general and look more or less the same whether the network is trained on a domain that very directly relates to physical stuff or not.
Right, and that’s a problem. There’s this big qualitative gap between the kinds of questions interp is even trying to address today, and the kinds of questions it needs to address. It’s the gap between talking about stuff inside the net, and talking about stuff in the environment (which the stuff inside the net represents).
And I think the focus on LLMs is largely to blame for that gap seeming “way too advanced for where we’re at”. I expect it’s much easier to cross if we focus on image models instead.
(And to be clear, even after crossing the internals/environment gap, there will still be a long ways to go before we’re ready to ask about e.g. whether an AI will kill us. But the internals/environment gap is the main qualitative barrier I know of; after that it should be more a matter of iteration and ordinary science.)