The thing I ultimately care about is patterns in our physical world, like trees or humans or painted rocks. I am interested in patterns in speech/text (like e.g. bigram distributions) basically only insofar as they tell something useful about patterns in the physical world. I am also interested in patterns in pixels only insofar as they tell something useful about patterns in the physical world. But it’s a lot easier to go from “pattern in pixels” to “pattern in physical world” than from “pattern in tokens” to “pattern in physical world”. (Excluding the trivial sense in which tokens are embedded in the physical world and therefore any pattern in tokens is a pattern in the physical world; that’s not what we’re talking about here.)
That’s the sense in which pixels are “closer to the metal”, and why I care about that property.
Excluding the trivial sense in which tokens are embedded in the physical world and therefore any pattern in tokens is a pattern in the physical world; that’s not what we’re talking about here.
I suppose my confusion might be related to this part. Why are tokens embedded in the physical world only in a “trivial” sense? I don’t really see how the laws and heuristics of predicting internet text are in a different category from the laws of creating images of cars for the purposes we care about when doing interpretability.
I guess you could tell a story where looking into a network that does internet next-token prediction might show you both the network’s high-level concepts created over high-level statistical patterns of tokens, and human high-level concepts as the low-level tokens and words themselves, and an interpretability researcher who is not thinking carefully might risk confusing themselves by mixing the two up. But while that story may sound sort of plausible when described in the abstract like that, it doesn’t actually ring true to me. For the kind of work most people are doing in interpretability right now, I can’t come up with a concrete instantiation of this abstract failure mode class that I’d actually be concerned about. So, at the moment, I’m not paying this much mind compared to other constraints when picking what models to look at.
Does the above sound like I’m at least arguing with your thesis, or am I guessing wrong on what class of failure modes you are even worried about?
Interpretability on an LLM might, for example, tell me a great deal about the statistical relationships between the word “human” and the word “dog” in various contexts. And the “trivial” sense in which this tells me about physical stuff is that the texts in question are embedded in the world—they’re characters on my screen, for instance.
The problem is that I don’t care that much about the characters on my screen in-and-of themselves. I mostly care about the characters on my screen insofar as they tell me about other things, like e.g. physical humans and dogs.
So, say I’m doing interpretability work on an LLM, and I find some statistical pattern between the word “human” and the word “dog”. (Flag: this is oversimplified compared to actual interp.) What does that pattern tell me about physical humans and physical dogs, the things I actually care about? How does that pattern even relate to physical humans and physical dogs? Well shit, that’s a whole very complicated question in its own right.
On the other hand, if I’m doing interp work on an image generator… I’m forced to start lower-level, so by the time I’m working with things like humans and dogs I’ve already understood a whole lot of stuff about the lower-level patterns which constitute humans and dogs (which is itself probably useful, that’s exactly the sort of thing I want to learn about). Then I find some relationship between the human parts of the image and the dog parts of the image, and insofar as the generator was trained on real-world images, that much more directly tells me about physical humans and dogs and how they relate statistically (like e.g. where they’re likely to be located relative to each other).
But why do we care more about statistical relationships between physical humans and dogs than statistical relationships between the word “human” and the word “dog” as characters on your screen? For most of what interp is currently trying to do, it seems to me that the underlying domain the network learns to model doesn’t matter that much. I wouldn’t want to make up some completely fake domain from scratch, because the data distribution of my fake domain might qualitatively differ from the sorts of domains a serious AI would need to model. And then maybe the network we get works genuinely differently than networks that model real domains, so our research results don’t transfer. But image generation and internet token prediction both seem very entangled with the structure of the universe, so I’d expect them both to have the right sort of high level structure and yield results that mostly transfer.
On the other hand, if I’m doing interp work on an image generator… I’m forced to start lower-level, so by the time I’m working with things like humans and dogs I’ve already understood a whole lot of stuff about the lower-level patterns which constitute humans and dogs
For this specific purpose, I agree with you that language models seem less suitable. And if this is what you’re trying to tackle directly at the moment I can see why you would want to use domains like image generation and fluid simulation for that, rather than internet text prediction. But I think there’s good attack angles on important problems in interp that don’t route through investigating this sort of question as one of the first steps.
But why do we care more about statistical relationships between physical humans and dogs than statistical relationships between the word “human” and the word “dog” as characters on your screen?
An overly-cute but not completely wrong answer: because I care about whether AI kills all the physical humans, not whether something somewhere writes the string “kill all the humans”. My terminal-ish values are mostly over the physical stuff.
I think the point you’re trying to make is roughly “well, it’s all pretty entangled with the physical stuff anyway, so why favor one medium or another? Instrumentally, either suffices.”. And the point I’m trying to make in response is “it matters a lot how complicated the relationship is between the medium and the physical stuff, because terminally it’s the physical we care about, so instrumentally stuff that’s more simply related to the physical stuff is a lot more useful to understand.”.
An overly-cute but not completely wrong answer: because I care about whether AI kills all the physical humans, not whether something somewhere writes the string “kill all the humans”. My terminal-ish values are mostly over the physical stuff.
I don’t understand this argument. Interpretability is not currently trying to look at AIs to determine whether they will kill us. That’s way too advanced for where we’re at. We’re more at the stage of asking questions like “Is the learning coefficient of a network composed of n independent superposed circuits equal to the learning coefficients of the individual n circuits summed, or greater?”
The laws of why and when neural networks learn to be modular, why and when they learn to do activation error-correction, what the locality of updating algorithms prevents them from learning, how they do inductive inference in-context, how their low-level algorithms correspond to something we would recognise as cognition, etc. are presumably fairly general and look more or less the same whether the network is trained on a domain that very directly relates to physical stuff or not.
Interpretability is not currently trying to look at AIs to determine whether they will kill us. That’s way too advanced for where we’re at.
Right, and that’s a problem. There’s this big qualitative gap between the kinds of questions interp is even trying to address today, and the kinds of questions it needs to address. It’s the gap between talking about stuff inside the net, and talking about stuff in the environment (which the stuff inside the net represents).
And I think the focus on LLMs is largely to blame for that gap seeming “way too advanced for where we’re at”. I expect it’s much easier to cross if we focus on image models instead.
(And to be clear, even after crossing the internals/environment gap, there will still be a long ways to go before we’re ready to ask about e.g. whether an AI will kill us. But the internals/environment gap is the main qualitative barrier I know of; after that it should be more a matter of iteration and ordinary science.)
The thing I ultimately care about is patterns in our physical world, like trees or humans or painted rocks. I am interested in patterns in speech/text (like e.g. bigram distributions) basically only insofar as they tell something useful about patterns in the physical world. I am also interested in patterns in pixels only insofar as they tell something useful about patterns in the physical world. But it’s a lot easier to go from “pattern in pixels” to “pattern in physical world” than from “pattern in tokens” to “pattern in physical world”. (Excluding the trivial sense in which tokens are embedded in the physical world and therefore any pattern in tokens is a pattern in the physical world; that’s not what we’re talking about here.)
That’s the sense in which pixels are “closer to the metal”, and why I care about that property.
Does that make sense?
I suppose my confusion might be related to this part. Why are tokens embedded in the physical world only in a “trivial” sense? I don’t really see how the laws and heuristics of predicting internet text are in a different category from the laws of creating images of cars for the purposes we care about when doing interpretability.
I guess you could tell a story where looking into a network that does internet next-token prediction might show you both the network’s high-level concepts created over high-level statistical patterns of tokens, and human high-level concepts as the low-level tokens and words themselves, and an interpretability researcher who is not thinking carefully might risk confusing themselves by mixing the two up. But while that story may sound sort of plausible when described in the abstract like that, it doesn’t actually ring true to me. For the kind of work most people are doing in interpretability right now, I can’t come up with a concrete instantiation of this abstract failure mode class that I’d actually be concerned about. So, at the moment, I’m not paying this much mind compared to other constraints when picking what models to look at.
Does the above sound like I’m at least arguing with your thesis, or am I guessing wrong on what class of failure modes you are even worried about?
Interpretability on an LLM might, for example, tell me a great deal about the statistical relationships between the word “human” and the word “dog” in various contexts. And the “trivial” sense in which this tells me about physical stuff is that the texts in question are embedded in the world—they’re characters on my screen, for instance.
The problem is that I don’t care that much about the characters on my screen in-and-of themselves. I mostly care about the characters on my screen insofar as they tell me about other things, like e.g. physical humans and dogs.
So, say I’m doing interpretability work on an LLM, and I find some statistical pattern between the word “human” and the word “dog”. (Flag: this is oversimplified compared to actual interp.) What does that pattern tell me about physical humans and physical dogs, the things I actually care about? How does that pattern even relate to physical humans and physical dogs? Well shit, that’s a whole very complicated question in its own right.
On the other hand, if I’m doing interp work on an image generator… I’m forced to start lower-level, so by the time I’m working with things like humans and dogs I’ve already understood a whole lot of stuff about the lower-level patterns which constitute humans and dogs (which is itself probably useful, that’s exactly the sort of thing I want to learn about). Then I find some relationship between the human parts of the image and the dog parts of the image, and insofar as the generator was trained on real-world images, that much more directly tells me about physical humans and dogs and how they relate statistically (like e.g. where they’re likely to be located relative to each other).
But why do we care more about statistical relationships between physical humans and dogs than statistical relationships between the word “human” and the word “dog” as characters on your screen? For most of what interp is currently trying to do, it seems to me that the underlying domain the network learns to model doesn’t matter that much. I wouldn’t want to make up some completely fake domain from scratch, because the data distribution of my fake domain might qualitatively differ from the sorts of domains a serious AI would need to model. And then maybe the network we get works genuinely differently than networks that model real domains, so our research results don’t transfer. But image generation and internet token prediction both seem very entangled with the structure of the universe, so I’d expect them both to have the right sort of high level structure and yield results that mostly transfer.
For this specific purpose, I agree with you that language models seem less suitable. And if this is what you’re trying to tackle directly at the moment I can see why you would want to use domains like image generation and fluid simulation for that, rather than internet text prediction. But I think there’s good attack angles on important problems in interp that don’t route through investigating this sort of question as one of the first steps.
An overly-cute but not completely wrong answer: because I care about whether AI kills all the physical humans, not whether something somewhere writes the string “kill all the humans”. My terminal-ish values are mostly over the physical stuff.
I think the point you’re trying to make is roughly “well, it’s all pretty entangled with the physical stuff anyway, so why favor one medium or another? Instrumentally, either suffices.”. And the point I’m trying to make in response is “it matters a lot how complicated the relationship is between the medium and the physical stuff, because terminally it’s the physical we care about, so instrumentally stuff that’s more simply related to the physical stuff is a lot more useful to understand.”.
I don’t understand this argument. Interpretability is not currently trying to look at AIs to determine whether they will kill us. That’s way too advanced for where we’re at. We’re more at the stage of asking questions like “Is the learning coefficient of a network composed of n independent superposed circuits equal to the learning coefficients of the individual n circuits summed, or greater?”
The laws of why and when neural networks learn to be modular, why and when they learn to do activation error-correction, what the locality of updating algorithms prevents them from learning, how they do inductive inference in-context, how their low-level algorithms correspond to something we would recognise as cognition, etc. are presumably fairly general and look more or less the same whether the network is trained on a domain that very directly relates to physical stuff or not.
Right, and that’s a problem. There’s this big qualitative gap between the kinds of questions interp is even trying to address today, and the kinds of questions it needs to address. It’s the gap between talking about stuff inside the net, and talking about stuff in the environment (which the stuff inside the net represents).
And I think the focus on LLMs is largely to blame for that gap seeming “way too advanced for where we’re at”. I expect it’s much easier to cross if we focus on image models instead.
(And to be clear, even after crossing the internals/environment gap, there will still be a long ways to go before we’re ready to ask about e.g. whether an AI will kill us. But the internals/environment gap is the main qualitative barrier I know of; after that it should be more a matter of iteration and ordinary science.)