I agree with you that there’s a good chance that both forms of tractability that you outline here are not true in practice; it does seems like you can’t get a mechanistic interpretation of a powerful LM that 1) is both faithful and complete and 2) is human understandable.* I also think that the mechanistic interpretability community has not yet fully reverse engineered an algorithm from a large neural network that wouldn’t have been easier for humans to implement or even to solve with program induction, which we can point at to offer a clear rebuke of the “just make the interpretable AI” approach.
However, I think there are reasons why your analogy doesn’t apply to the case of AI:
It’s wrong to say we have a broken down car that we just need to fix; we don’t know how to build a car that actually does many of the tasks that GPT-3 can do, or even have any idea of how to do them.
On the other hand, the elephant really seems to be working. This might be because a lot of the intelligent behavior current models exhibit is, in some sense, irreducibly complex. But it might also just be because it’s easier to search through the space of programs when you parameterize them using a large transformer. Under this view, mechanistic interp can work because the elephant is a clean solution to the problems we face, even though evolution is messy.
Relatedly, it does seem like a lot of the reason we can’t do the elephant approach in your analogy is that the elephant isn’t a very good solution to our problems!
IMO a big part of why mechanistic interp is getting a lot of attention in the x-risk community is that neural networks are surprisingly more interpretable than we might have naively expected and there’s a lot of shovel-ready work in this area. I think if you asked many people three years ago, they would’ve said that we’d never find a non-trivial circuit in GPT-2-small, a 125m parameter model; yet Redwood has reverse engineered the IOI circuit in GPT-2-small. Many people were also surprised by Neel Nanda’s modular addition work.
* This is not a knockdown argument against current mechanistic interpretability efforts. I think the main reasons to work on mechanistic interp do not look like “we can literally understand all the cognition behind a powerful AI”, but instead “we can bound the behavior of the AI” or “we can help other, weaker AIs understand the powerful AI”. For example, we might find good heuristic arguments even if we can’t find fully complete and valid interpretations.
IMO a big part of why mechanistic interp is getting a lot of attention in the x-risk community is that neural networks are surprisingly more interpretable than we might have naively expected and there’s a lot of shovel-ready work in this area. I think if you asked many people three years ago, they would’ve said that we’d never find a non-trivial circuit in GPT-2-small, a 125m parameter model; yet Redwood has reverse engineered the IOI circuit in GPT-2-small. Many people were also surprised by Neel Nanda’s modular addition work.
I don’t think I’ve seen many people be surprised here, and indeed, at least in my model of the world interpretability is progressing slower than I was hoping for/expecting (like, when I saw the work by Chris Olah 6 years ago, I had hope we would make real progress understanding how these systems think, and lots of people would end up being productively able to contribute to the field, but our understand has IMO barely kept up with the changing architectures of the field, and is extremely far from being able to say really much of anything definite about how these models do any significant fraction of what they do, and very few people outside of Chris Olah’s team seem to have made useful progress).
I would be interested if you can dig up any predictions by people who predicted much slower progress on interpretability. I don’t currently believe that many people are surprised by current tractability in the space (I do think there is a trend for people who are working on interpretability to feel excited by their early work, but I think the incentives here are too strong for me to straightforwardly take someone’s word for it, though it’s still evidence).
I have seen one person be surprised (I think twice in the same convo) about what progress had been made.
ETA: Our observations are compatible. It could be that people used to a poor and slow-moving state of interpretability are surprised by the recent uptick, but that the absolute progress over 6 years is still disappointing.
I think the main reasons to work on mechanistic interp do not look like “we can literally understand all the cognition behind a powerful AI”, but instead “we can bound the behavior of the AI”
I assume “bound the behavior” means provide a worst-case guarantee. But if we don’t understand all the cognition, how can we provide such a guarantee? How do we know that the part of the AI we don’t understand wouldn’t ruin our guarantee?
we can help other, weaker AIs understand the powerful AI
My understanding of interpretability is that humans understand what the AI is doing. Weaker AIs understanding the powerful AI doesn’t feel like a solution to interpretability. Instead it feels like a solution to amplification that’s still uninterpretable by humans.
My understanding of interpretability is that humans understand what the AI is doing. Weaker AIs understanding the powerful AI doesn’t feel like a solution to interpretability. Instead it feels like a solution to amplification that’s ultimately still uninterpretable by humans.
This somewhat feels like semantics to me—this still feels like a win condition! I don’t personally care about whether interpretability helps via humans directly understanding the systems themselves, vs us somewhat understanding it ourselves and being able to use weaker AI systems to fully understand it, so long as it’s good enough to make aligned systems.
I also think that interpretability lies on a spectrum rather than being a binary.
I agree with you that there’s a good chance that both forms of tractability that you outline here are not true in practice; it does seems like you can’t get a mechanistic interpretation of a powerful LM that 1) is both faithful and complete and 2) is human understandable.* I also think that the mechanistic interpretability community has not yet fully reverse engineered an algorithm from a large neural network that wouldn’t have been easier for humans to implement or even to solve with program induction, which we can point at to offer a clear rebuke of the “just make the interpretable AI” approach.
However, I think there are reasons why your analogy doesn’t apply to the case of AI:
It’s wrong to say we have a broken down car that we just need to fix; we don’t know how to build a car that actually does many of the tasks that GPT-3 can do, or even have any idea of how to do them.
On the other hand, the elephant really seems to be working. This might be because a lot of the intelligent behavior current models exhibit is, in some sense, irreducibly complex. But it might also just be because it’s easier to search through the space of programs when you parameterize them using a large transformer. Under this view, mechanistic interp can work because the elephant is a clean solution to the problems we face, even though evolution is messy.
Relatedly, it does seem like a lot of the reason we can’t do the elephant approach in your analogy is that the elephant isn’t a very good solution to our problems!
IMO a big part of why mechanistic interp is getting a lot of attention in the x-risk community is that neural networks are surprisingly more interpretable than we might have naively expected and there’s a lot of shovel-ready work in this area. I think if you asked many people three years ago, they would’ve said that we’d never find a non-trivial circuit in GPT-2-small, a 125m parameter model; yet Redwood has reverse engineered the IOI circuit in GPT-2-small. Many people were also surprised by Neel Nanda’s modular addition work.
* This is not a knockdown argument against current mechanistic interpretability efforts. I think the main reasons to work on mechanistic interp do not look like “we can literally understand all the cognition behind a powerful AI”, but instead “we can bound the behavior of the AI” or “we can help other, weaker AIs understand the powerful AI”. For example, we might find good heuristic arguments even if we can’t find fully complete and valid interpretations.
I don’t think I’ve seen many people be surprised here, and indeed, at least in my model of the world interpretability is progressing slower than I was hoping for/expecting (like, when I saw the work by Chris Olah 6 years ago, I had hope we would make real progress understanding how these systems think, and lots of people would end up being productively able to contribute to the field, but our understand has IMO barely kept up with the changing architectures of the field, and is extremely far from being able to say really much of anything definite about how these models do any significant fraction of what they do, and very few people outside of Chris Olah’s team seem to have made useful progress).
I would be interested if you can dig up any predictions by people who predicted much slower progress on interpretability. I don’t currently believe that many people are surprised by current tractability in the space (I do think there is a trend for people who are working on interpretability to feel excited by their early work, but I think the incentives here are too strong for me to straightforwardly take someone’s word for it, though it’s still evidence).
I have seen one person be surprised (I think twice in the same convo) about what progress had been made.
ETA: Our observations are compatible. It could be that people used to a poor and slow-moving state of interpretability are surprised by the recent uptick, but that the absolute progress over 6 years is still disappointing.
I’m very unconvinced by the results in the IOI paper.
I’d be interested to hear in more detail why you’re unconvinced.
I assume “bound the behavior” means provide a worst-case guarantee. But if we don’t understand all the cognition, how can we provide such a guarantee? How do we know that the part of the AI we don’t understand wouldn’t ruin our guarantee?
My understanding of interpretability is that humans understand what the AI is doing. Weaker AIs understanding the powerful AI doesn’t feel like a solution to interpretability. Instead it feels like a solution to amplification that’s still uninterpretable by humans.
This somewhat feels like semantics to me—this still feels like a win condition! I don’t personally care about whether interpretability helps via humans directly understanding the systems themselves, vs us somewhat understanding it ourselves and being able to use weaker AI systems to fully understand it, so long as it’s good enough to make aligned systems.
I also think that interpretability lies on a spectrum rather than being a binary.