I’m curious if you believe that, even if SAEs aren’t the right solution, there realistically exists a potential solution that would allow researchers to produce succinct, human understandable explanation that allow for recovering >75% of the training compute of model components?
There isn’t any clear reason to think this is impossible, but there are multiple reasons to think this is very, very hard.
I think highly ambitious bottom up interpretability (which naturally pursues this sort of goal), seems like an decent bet overall, but seems unlikely to succeed. E.g. more like a 5% chance of full ambitious success prior to the research[1] being massively speed up by AI and maybe a 10% chance of full success prior to humans being obsoleted.
(And there is some chance of less ambitious contributions as a byproduct of this work.)
I just worried because the field is massive and many people seem to think that the field is much further along than it actually is in terms of empirical results. (It’s not clear to me that we disagree that much, especially about next steps. However, I worry that this post contributes to a generally over optimistic view of bottom-up interp that is relatively common.)
There isn’t any clear reason to think this is impossible, but there are multiple reasons to think this is very, very hard.
I think highly ambitious bottom up interpretability (which naturally pursues this sort of goal), seems like an decent bet overall, but seems unlikely to succeed. E.g. more like a 5% chance of full ambitious success prior to the research[1] being massively speed up by AI and maybe a 10% chance of full success prior to humans being obsoleted.
(And there is some chance of less ambitious contributions as a byproduct of this work.)
I just worried because the field is massive and many people seem to think that the field is much further along than it actually is in terms of empirical results. (It’s not clear to me that we disagree that much, especially about next steps. However, I worry that this post contributes to a generally over optimistic view of bottom-up interp that is relatively common.)
The research labor, not the interpretability labor. I would count it as success if we know how to do all the interp labor once powerful AIs exist.