Figuring out the shape of a high-level property like “truth” is really hard
I’m skeptical of this. Where mech interp has been successful in finding linear representations of things, it has explicitly been the kind of high-level abstraction that proves generally useful across diverse contexts. And because we train models to be HHH (in particular, ‘honest’) I’d expect that HHH-related concepts have an especially high quality representation, similar to refusal
I’m more sympathetic to the claim that models concurrently simulate a bunch of different personas which have inconsistent definitions of subjective truth, and this ‘interference’ is what makes general ELK difficult
Where mech interp has been successful in finding linear representations of things, it has explicitly been the kind of high-level abstraction that proves generally useful across diverse contexts.
I don’t think mech interp has been good at robustly identifying representations of high-level abstractions, which is probably a crux. The representation you find could be a combination of things that mostly look like an abstraction in a lot of contexts, it could be a large-but-not-complete part of the model’s actual representation of that abstraction, and so on. If you did something like optimize using the extracted representation, I expect you’d see it break.
In general it seems like a very common failure mode of some alignment research that you can find ways to capture the first few bits of what you want to capture and have it look really good at first glance, but then find it really hard to get a lot more bits, which is where a lot of the important parts are (e.g. if you had an interp method that could explain 80% of a model’s performance, that would seem really promising and get some cool results at first, but most gains in current models happen at the ~99% range).
~agree with most takes!
I’m skeptical of this. Where mech interp has been successful in finding linear representations of things, it has explicitly been the kind of high-level abstraction that proves generally useful across diverse contexts. And because we train models to be HHH (in particular, ‘honest’) I’d expect that HHH-related concepts have an especially high quality representation, similar to refusal
I’m more sympathetic to the claim that models concurrently simulate a bunch of different personas which have inconsistent definitions of subjective truth, and this ‘interference’ is what makes general ELK difficult
I don’t think mech interp has been good at robustly identifying representations of high-level abstractions, which is probably a crux. The representation you find could be a combination of things that mostly look like an abstraction in a lot of contexts, it could be a large-but-not-complete part of the model’s actual representation of that abstraction, and so on. If you did something like optimize using the extracted representation, I expect you’d see it break.
In general it seems like a very common failure mode of some alignment research that you can find ways to capture the first few bits of what you want to capture and have it look really good at first glance, but then find it really hard to get a lot more bits, which is where a lot of the important parts are (e.g. if you had an interp method that could explain 80% of a model’s performance, that would seem really promising and get some cool results at first, but most gains in current models happen at the ~99% range).