Thanks for the post, I agree ~entirely with the descriptive parts (and slightly less the prescriptive parts). I wonder, what do you think of interpretability work that does try to answer top-down questions with top-down methods (which I call high-level interpretability)? For example, unsupervised ways to interface with goals / search in systems (1, 2, 3), or unsupervised honesty (ELK, CCS, this proposal), etc.
I’ve not come across 1 before, and I don’t think I really understood it from skimming, so I’ll reserve judgment.
Re (2,3): I think this kind of work is quite closely aligned with the conceptual questions I’m interested in, and I like it! For that reason I actually strongly considered both Adria’s stream and Erik’s stream for MATS 7.0 as well. Ultimately, I chose Owain’s stream for the other reasons described above (personal fit, wanting to try something new, worries about AI systems etc) but I would have been pretty happy working on either of these things instead (unfortunately it seems like Erik Jenner does not plan to continue working on the chess interpretability stuff).
I do think it’s a good idea to try probing for as many things as we can think of. If it were possible to train probes on Claude I would want to see if probes can detect alignment faking or in-context scheming. I’m concerned that there are non-obvious disanalogies between the language setting and the board game setting which prevent the findings / approaches from generalising. The only way we’ll know is by constructing similar agentic scenarios and attempting to probe for language model goals.
At the same time, I worry there will not exist a clean representation of the goals; maybe such ‘goals’ are deeply embedded / entangled / distributed across model weights in a way that makes them hard to elucidate. Positive results on rather toy-ish setups do not update me much away from this concern.
Food for thought: it’s plausible that ‘goal states’ only exist implicitly as the argmax of V(x). Maybe you don’t need to know where you’re ultimately going, you just need to know (at every step) what the current best action to take is, w.r.t some objective. Certainly nothing in AIXI precludes this.
unsupervised ways to interface with goals
I dunno about 1, but I wouldn’ call (2, 3) unsupervised. They require you to first observe the agent executing a specific goal-oriented behaviour (pacing in the case of Sokoban, or solving the puzzle in the case of Leela chess), and then probe its activations post-hoc. An ‘unsupervised’ method preferably wouldn’t need this level of handholding.
I haven’t followed ELK / CCS closely. The work seemed interesting when it was published a while back. I’m pretty curious as to why there have not been more follow-ups. That is some indication to me that this line of work is subtly flawed in some way, at least methodologically (if not fundamentally).
It does seem that language models maintain internal beliefs as to what information is true or false. Perhaps that’s an indication that this can work.
I dunno about 1, but I wouldn’ call (2, 3) unsupervised. They require you to first observe the agent executing a specific goal-oriented behaviour (pacing in the case of Sokoban, or solving the puzzle in the case of Leela chess), and then probe its activations post-hoc. An ‘unsupervised’ method preferably wouldn’t need this level of handholding.
I think the directions in (2, 3) aim at being unsupervised. In both cases, the central idea revolves around something like “there’s this thing we call search that seems dangerous and really important to understand, and we want to figure out how to interface with it”. In principle, this would involve concretizing our intuition for what mechanistically constitutes “search” and interacting with it without requiring behavioural analysis. In practice, we’re nowhere near that and we have to rely on observations to form better hypotheses at all.
At the same time, I worry there will not exist a clean representation of the goals; maybe such ‘goals’ are deeply embedded / entangled / distributed across model weights in a way that makes them hard to elucidate. Positive results on rather toy-ish setups do not update me much away from this concern.
I agree! That’s part of why I don’t favour the bottom-up approach as much. For any definition of “goal”, there could be a large number of ways it’s represented internally. But whatever it is that makes it a “goal” by our reckoning would allow it to be separated out from other parts of the system, if we understood well enough what those properties are. This is harder since you can’t rely solely on observational data (or you risk indexing too much on specific representations in specific models), but does have the upside of generalizing if you do it right.
I’m pretty curious as to why there have not been more follow-ups. That is some indication to me that this line of work is subtly flawed in some way, at least methodologically (if not fundamentally).
I think it’s just that it’s really hard. Figuring out the shape of a high-level property like “truth” is really hard; methods that make small-but-legible progress like CCS, often turn out in the end to not be that impressive in identifying what they want. That said, mechanistic anomaly detection spun out of ARC’s work on ELK and there has been some progress there. There are also people working on it at Anthropic now.
It does seem that language models maintain internal beliefs as to what information is true or false. Perhaps that’s an indication that this can work.
I agree, though I think that that result wasn’t much of an update for me—any competent model would have to have some notion of what’s true, even if it’s not in an easily visible way.
Figuring out the shape of a high-level property like “truth” is really hard
I’m skeptical of this. Where mech interp has been successful in finding linear representations of things, it has explicitly been the kind of high-level abstraction that proves generally useful across diverse contexts. And because we train models to be HHH (in particular, ‘honest’) I’d expect that HHH-related concepts have an especially high quality representation, similar to refusal
I’m more sympathetic to the claim that models concurrently simulate a bunch of different personas which have inconsistent definitions of subjective truth, and this ‘interference’ is what makes general ELK difficult
Where mech interp has been successful in finding linear representations of things, it has explicitly been the kind of high-level abstraction that proves generally useful across diverse contexts.
I don’t think mech interp has been good at robustly identifying representations of high-level abstractions, which is probably a crux. The representation you find could be a combination of things that mostly look like an abstraction in a lot of contexts, it could be a large-but-not-complete part of the model’s actual representation of that abstraction, and so on. If you did something like optimize using the extracted representation, I expect you’d see it break.
In general it seems like a very common failure mode of some alignment research that you can find ways to capture the first few bits of what you want to capture and have it look really good at first glance, but then find it really hard to get a lot more bits, which is where a lot of the important parts are (e.g. if you had an interp method that could explain 80% of a model’s performance, that would seem really promising and get some cool results at first, but most gains in current models happen at the ~99% range).
Thanks for the post, I agree ~entirely with the descriptive parts (and slightly less the prescriptive parts). I wonder, what do you think of interpretability work that does try to answer top-down questions with top-down methods (which I call high-level interpretability)? For example, unsupervised ways to interface with goals / search in systems (1, 2, 3), or unsupervised honesty (ELK, CCS, this proposal), etc.
I’ve not come across 1 before, and I don’t think I really understood it from skimming, so I’ll reserve judgment.
Re (2,3): I think this kind of work is quite closely aligned with the conceptual questions I’m interested in, and I like it! For that reason I actually strongly considered both Adria’s stream and Erik’s stream for MATS 7.0 as well. Ultimately, I chose Owain’s stream for the other reasons described above (personal fit, wanting to try something new, worries about AI systems etc) but I would have been pretty happy working on either of these things instead (unfortunately it seems like Erik Jenner does not plan to continue working on the chess interpretability stuff).
I do think it’s a good idea to try probing for as many things as we can think of. If it were possible to train probes on Claude I would want to see if probes can detect alignment faking or in-context scheming. I’m concerned that there are non-obvious disanalogies between the language setting and the board game setting which prevent the findings / approaches from generalising. The only way we’ll know is by constructing similar agentic scenarios and attempting to probe for language model goals.
At the same time, I worry there will not exist a clean representation of the goals; maybe such ‘goals’ are deeply embedded / entangled / distributed across model weights in a way that makes them hard to elucidate. Positive results on rather toy-ish setups do not update me much away from this concern.
Also, training probes for things is in itself an attempt to impose your own ontology onto the model, which I am wary of. Models’ ontologies can differ in surprising ways from human ones. Maybe there isn’t even any global structure to map onto.
Food for thought: it’s plausible that ‘goal states’ only exist implicitly as the argmax of V(x). Maybe you don’t need to know where you’re ultimately going, you just need to know (at every step) what the current best action to take is, w.r.t some objective. Certainly nothing in AIXI precludes this.
I dunno about 1, but I wouldn’ call (2, 3) unsupervised. They require you to first observe the agent executing a specific goal-oriented behaviour (pacing in the case of Sokoban, or solving the puzzle in the case of Leela chess), and then probe its activations post-hoc. An ‘unsupervised’ method preferably wouldn’t need this level of handholding.
I haven’t followed ELK / CCS closely. The work seemed interesting when it was published a while back. I’m pretty curious as to why there have not been more follow-ups. That is some indication to me that this line of work is subtly flawed in some way, at least methodologically (if not fundamentally).
It does seem that language models maintain internal beliefs as to what information is true or false. Perhaps that’s an indication that this can work.
I think the directions in (2, 3) aim at being unsupervised. In both cases, the central idea revolves around something like “there’s this thing we call search that seems dangerous and really important to understand, and we want to figure out how to interface with it”. In principle, this would involve concretizing our intuition for what mechanistically constitutes “search” and interacting with it without requiring behavioural analysis. In practice, we’re nowhere near that and we have to rely on observations to form better hypotheses at all.
I agree! That’s part of why I don’t favour the bottom-up approach as much. For any definition of “goal”, there could be a large number of ways it’s represented internally. But whatever it is that makes it a “goal” by our reckoning would allow it to be separated out from other parts of the system, if we understood well enough what those properties are. This is harder since you can’t rely solely on observational data (or you risk indexing too much on specific representations in specific models), but does have the upside of generalizing if you do it right.
I think it’s just that it’s really hard. Figuring out the shape of a high-level property like “truth” is really hard; methods that make small-but-legible progress like CCS, often turn out in the end to not be that impressive in identifying what they want. That said, mechanistic anomaly detection spun out of ARC’s work on ELK and there has been some progress there. There are also people working on it at Anthropic now.
I agree, though I think that that result wasn’t much of an update for me—any competent model would have to have some notion of what’s true, even if it’s not in an easily visible way.
~agree with most takes!
I’m skeptical of this. Where mech interp has been successful in finding linear representations of things, it has explicitly been the kind of high-level abstraction that proves generally useful across diverse contexts. And because we train models to be HHH (in particular, ‘honest’) I’d expect that HHH-related concepts have an especially high quality representation, similar to refusal
I’m more sympathetic to the claim that models concurrently simulate a bunch of different personas which have inconsistent definitions of subjective truth, and this ‘interference’ is what makes general ELK difficult
I don’t think mech interp has been good at robustly identifying representations of high-level abstractions, which is probably a crux. The representation you find could be a combination of things that mostly look like an abstraction in a lot of contexts, it could be a large-but-not-complete part of the model’s actual representation of that abstraction, and so on. If you did something like optimize using the extracted representation, I expect you’d see it break.
In general it seems like a very common failure mode of some alignment research that you can find ways to capture the first few bits of what you want to capture and have it look really good at first glance, but then find it really hard to get a lot more bits, which is where a lot of the important parts are (e.g. if you had an interp method that could explain 80% of a model’s performance, that would seem really promising and get some cool results at first, but most gains in current models happen at the ~99% range).