with the mech interp people where they think we can identify values or other high-level concepts like deception simply by looking at the model’s linear representations bottom-up, where I think that’ll be a highly non-trivial problem.
I’m not sure anyone I know in mech interp is claiming this is a non-trivial problem.
Yeah sorry I should have been more precise. I think it’s so non-trivial that it plausibly contains most of the difficulty in the overall problem—which is a statement I think many people working on mechanistic interpretability would disagree with.
I’m not sure anyone I know in mech interp is claiming this is a non-trivial problem.
Yeah sorry I should have been more precise. I think it’s so non-trivial that it plausibly contains most of the difficulty in the overall problem—which is a statement I think many people working on mechanistic interpretability would disagree with.