If one of our main goals for interpretability research is to help us with aligning highly intelligent AI systems in high stakes settings, shouldn’t we be seeing tools that are more helpful in the real world?
There are various reasons you might not see tools which are helpful right now. Here are some overly conjunctive examples:
There’s a clear and well argued plan for the tools/research to build into tools/research which reduce X-risk, but this plan requires additional components which don’t exist yet. So, these components are being worked on. Ideally, there would be some reason to think that acquiring this components is achievable.
There’s a rough idea that ‘this sort of thing has to be helpful’ and people are iterating on stuff which seems like it pushes on making interpretability eventually useful. (Even if it isn’t making it more useful now and the path is unclear)
People are working in the space for reasons other than maximally reducing X-risk (e.g., interpretability is cool or high status or is what I do to get paid).
I wouldn’t be particularly worried if (1) was the dominant explanation: clear plans should make it obvious if the research isn’t on track to reduce X-risk and should directly justify the lack of current tools.
I’m somewhat worried about (2) resulting in very poor allocation of resources. Further, it’s hard to know if things are going as well (because all you had was a rough intuition). But, it’s not really clear that this is bad bet to make overall.
I think reasons like (2) are very common for people working on interp targeting X-risk.
My guess is that for people saying their working on X-risk we shouldn’t be super worried about the failure modes associated with (3) (or similar reasons).
I think that (1) is interesting. This sounds plausible, but I do not know of any examples of this perspective being fleshed out. Do you know of any posts on this?
I don’t know if they’d put it like this, but IMO solving/understanding superposition is an important part of being able to really grapple with circuits in language models, and this is why it’s a focus of the Anthropic interp team
At least based on my convos with them, the Anthropic team does seem like a clear example of this, at least insofar as you think understanding circuits in real models with more than one MLP layer in them is important for interp—superposition just stops you from using the standard features as directions approach almost entirely!
I would argue that ARC’s research is justified by (1) (roughly speaking). Sadly, I don’t think that there are enough posts on their current plans for this to be clear or easy for me to point at. There might be some posts coming out soon.
I’m hopeful that Redwood (where I work) moves toward having a clear and well argued plan or directly useful techniques (perhaps building up from more toy problems).
There are various reasons you might not see tools which are helpful right now. Here are some overly conjunctive examples:
There’s a clear and well argued plan for the tools/research to build into tools/research which reduce X-risk, but this plan requires additional components which don’t exist yet. So, these components are being worked on. Ideally, there would be some reason to think that acquiring this components is achievable.
There’s a rough idea that ‘this sort of thing has to be helpful’ and people are iterating on stuff which seems like it pushes on making interpretability eventually useful. (Even if it isn’t making it more useful now and the path is unclear)
People are working in the space for reasons other than maximally reducing X-risk (e.g., interpretability is cool or high status or is what I do to get paid).
I wouldn’t be particularly worried if (1) was the dominant explanation: clear plans should make it obvious if the research isn’t on track to reduce X-risk and should directly justify the lack of current tools.
I’m somewhat worried about (2) resulting in very poor allocation of resources. Further, it’s hard to know if things are going as well (because all you had was a rough intuition). But, it’s not really clear that this is bad bet to make overall.
I think reasons like (2) are very common for people working on interp targeting X-risk.
My guess is that for people saying their working on X-risk we shouldn’t be super worried about the failure modes associated with (3) (or similar reasons).
I think that (1) is interesting. This sounds plausible, but I do not know of any examples of this perspective being fleshed out. Do you know of any posts on this?
I don’t know if they’d put it like this, but IMO solving/understanding superposition is an important part of being able to really grapple with circuits in language models, and this is why it’s a focus of the Anthropic interp team
At least based on my convos with them, the Anthropic team does seem like a clear example of this, at least insofar as you think understanding circuits in real models with more than one MLP layer in them is important for interp—superposition just stops you from using the standard features as directions approach almost entirely!
I would argue that ARC’s research is justified by (1) (roughly speaking). Sadly, I don’t think that there are enough posts on their current plans for this to be clear or easy for me to point at. There might be some posts coming out soon.
I’m hopeful that Redwood (where I work) moves toward having a clear and well argued plan or directly useful techniques (perhaps building up from more toy problems).