for almost all purposes “prompt X ⇒ outcome Y” is all the interpretation we can get.
I’m very confused. Can we not do LLM interpretability to try to figure out whether or where superposition holds? Is it not useful to see how SAEs help us identify and intervene on specific internal representations that LLMs generate for real-world concepts?
As an outsider to interpretability, it has long been my (rough) understanding that most of the useful work in interpretability deals precisely with attempts to figure out what is going on inside the model rather than how it responds to outside prompts. So I don’t know what the thesis statement refers to...
Everything has an insanely large amount of information. To interpret something, we need to be able to see what “energy” (definitely literal energy, but likely also metaphorical energy) that information relates to, as the energy is more bounded and unified than the information.
Not really, because this requires some notion of the same vs distinct features, which is not so interesting when the use of LLMs is so brief.
I’m sorry, but I still don’t really understand what you mean here. The phrase “the use of LLMs is so brief” is ambiguous to me. Do you mean to say:
a new, better LLM will come out soon anyway, making your work on current LLMs obsolete?
LLM context windows are really small, so you “use” them only for a brief time?
the entire LLM paradigm will be replaced by something else soon?
something totally different from all of the above?
I don’t think so since you’ve often got more direct ways of intervening (e.g. applying gradient updates).
But isn’t this rather… prosaic and “mundane”?
I thought the idea behind these methods that I have linked was to serve as the building blocks for future work on ontology identification and ultimately getting a clearer picture of what is going on internally, which is a crucial part of stuff like Wentworth’s “Retarget the Search” and other research directions like it.
So the fact that SAE-based updates of the model do not currently result in more impressive outputs than basic fine-tuning does not matter as much compared to the fact that they work at all, which gives us reason to believe that we might be able to scale them up to useful, strong-interpretability levels. Or at the very least that the insights we get from them could help in future efforts to obtain this.
Kind of like how you can teach a dog to sit pretty well just by basic reinforcement, but if you actually had a gears-level understanding of how its brain worked, down to the minute details, and the ability to directly modify the circuits in its mind that represented the concept of “sitting”, then you would be able to do this much more quickly, efficiently, and robustly.
Maybe it helps if I start by giving some different applications one might want to use artificial agency for:
As a map: We might want to use the LLM as a map of the world, for instance by prompting us with data from the world and having it assist us with navigating that data. Now, the purpose of a map is to reflect as little information as possible about the world while still providing the bare minimum backbone needed to navigate the world.
This doesn’t work well with LLMs because they are instead trained to model information, so they will carry as much information as possible, and any map-making they do will be an accident driven by mimicking the information it’s seen of mapmakers, rather than primarily as an attempt to eliminate information about the world.
As a controller: We might want to use the LLM to perform small pushes to a chaotic system at times when the system reaches bifurcations where its state is extremely sensitive, such that the system moves in a desirable direction. But again I think LLMs are so busy copying information around that they don’t notice such sensitivities except by accident.
As a coder: Since LLMs are so busy outputting information instead of manipulating “energy”, maybe we could hope that they could assemble a big pile of information that we could “energize” in a relevant way, e.g. if they could write a large codebase and we could then excute it on a CPU and have a program that does something interesting in the world. But in order for this to work, the program shouldn’t have obstacles that stop the “energy” dead in its tracks (e.g. bugs that cause it to crash). But again the LLM isn’t optimizing for doing that, it’s just trying to copy information around that looks like software, and it only makes space for the energy of the CPU and the program functionality as a side-effect of that. (Or as the old saying goes, it’s maximizing lines of code written, not minimizing lines of code used.)
So, that gives us the thesis: To interpret the LLMs, we’d want to build a map of how they connect to the energy in the world, but they really don’t connect very well, so there’s not much to build a map of. The only thing you could really point out is the (input, output) relationships, but once you’ve characterized concrete (input, output) pairs, there’s not really much more of interest to say.
I’m sorry, but I still don’t really understand what you mean here. The phrase “the use of LLMs is so brief” is ambiguous to me. Do you mean to say:
a new, better LLM will come out soon anyway, making your work on current LLMs obsolete?
LLM context windows are really small, so you “use” them only for a brief time?
the entire LLM paradigm will be replaced by something else soon?
something totally different from all of the above?
Perhaps both the first and the second, but especially the second: As described above, we might hope you could use them extensively and recursively to build up a big thing, because then for interpretability you could study how to manipulate the contours of that big thing. But that doesn’t really work. So people only use them briefly, rather than extensively.
I thought the idea behind these methods that I have linked was to serve as the building blocks for future work on ontology identification and ultimately getting a clearer picture of what is going on internally, which is a crucial part of stuff like Wentworth’s “Retarget the Search” and other research directions like it.
Retargeting the search is only interesting if the search is able to do big stuff in the world, which LLMs according to the thesis are not.
I’m very confused. Can we not do LLM interpretability to try to figure out whether or where superposition holds? Is it not useful to see how SAEs help us identify and intervene on specific internal representations that LLMs generate for real-world concepts?
As an outsider to interpretability, it has long been my (rough) understanding that most of the useful work in interpretability deals precisely with attempts to figure out what is going on inside the model rather than how it responds to outside prompts. So I don’t know what the thesis statement refers to...
I guess to clarify:
Everything has an insanely large amount of information. To interpret something, we need to be able to see what “energy” (definitely literal energy, but likely also metaphorical energy) that information relates to, as the energy is more bounded and unified than the information.
But that’s (the thesis goes) hard for LLMs.
Not really, because this requires some notion of the same vs distinct features, which is not so interesting when the use of LLMs is so brief.
I don’t think so since you’ve often got more direct ways of intervening (e.g. applying gradient updates).
I’m sorry, but I still don’t really understand what you mean here. The phrase “the use of LLMs is so brief” is ambiguous to me. Do you mean to say:
a new, better LLM will come out soon anyway, making your work on current LLMs obsolete?
LLM context windows are really small, so you “use” them only for a brief time?
the entire LLM paradigm will be replaced by something else soon?
something totally different from all of the above?
But isn’t this rather… prosaic and “mundane”?
I thought the idea behind these methods that I have linked was to serve as the building blocks for future work on ontology identification and ultimately getting a clearer picture of what is going on internally, which is a crucial part of stuff like Wentworth’s “Retarget the Search” and other research directions like it.
So the fact that SAE-based updates of the model do not currently result in more impressive outputs than basic fine-tuning does not matter as much compared to the fact that they work at all, which gives us reason to believe that we might be able to scale them up to useful, strong-interpretability levels. Or at the very least that the insights we get from them could help in future efforts to obtain this.
Kind of like how you can teach a dog to sit pretty well just by basic reinforcement, but if you actually had a gears-level understanding of how its brain worked, down to the minute details, and the ability to directly modify the circuits in its mind that represented the concept of “sitting”, then you would be able to do this much more quickly, efficiently, and robustly.
Am I totally off-base here?
Maybe it helps if I start by giving some different applications one might want to use artificial agency for:
As a map: We might want to use the LLM as a map of the world, for instance by prompting us with data from the world and having it assist us with navigating that data. Now, the purpose of a map is to reflect as little information as possible about the world while still providing the bare minimum backbone needed to navigate the world.
This doesn’t work well with LLMs because they are instead trained to model information, so they will carry as much information as possible, and any map-making they do will be an accident driven by mimicking the information it’s seen of mapmakers, rather than primarily as an attempt to eliminate information about the world.
As a controller: We might want to use the LLM to perform small pushes to a chaotic system at times when the system reaches bifurcations where its state is extremely sensitive, such that the system moves in a desirable direction. But again I think LLMs are so busy copying information around that they don’t notice such sensitivities except by accident.
As a coder: Since LLMs are so busy outputting information instead of manipulating “energy”, maybe we could hope that they could assemble a big pile of information that we could “energize” in a relevant way, e.g. if they could write a large codebase and we could then excute it on a CPU and have a program that does something interesting in the world. But in order for this to work, the program shouldn’t have obstacles that stop the “energy” dead in its tracks (e.g. bugs that cause it to crash). But again the LLM isn’t optimizing for doing that, it’s just trying to copy information around that looks like software, and it only makes space for the energy of the CPU and the program functionality as a side-effect of that. (Or as the old saying goes, it’s maximizing lines of code written, not minimizing lines of code used.)
So, that gives us the thesis: To interpret the LLMs, we’d want to build a map of how they connect to the energy in the world, but they really don’t connect very well, so there’s not much to build a map of. The only thing you could really point out is the (input, output) relationships, but once you’ve characterized concrete (input, output) pairs, there’s not really much more of interest to say.
Perhaps both the first and the second, but especially the second: As described above, we might hope you could use them extensively and recursively to build up a big thing, because then for interpretability you could study how to manipulate the contours of that big thing. But that doesn’t really work. So people only use them briefly, rather than extensively.
Retargeting the search is only interesting if the search is able to do big stuff in the world, which LLMs according to the thesis are not.