When doing bottom up interpretability, it’s pretty unclear if you can answer questions like “how does GPT-4 talk” without being able to explain arbitrary parts to a high degree of accuracy.
I agree that top down interpretability trying to answer more basic questions seems good. (And generally I think top down interpretability looks more promising than bottom up interpretability at current margins.)
(By interpretability, I mean work aimed at having humans understand the algorithm/approach the model to uses to solve tasks. I don’t mean literally any work which involves using the internals of the model in some non-basic way.)
I have no gears-level model for how anything like this could be done at all. [...] What I want right now is a basic understanding of combustion engines. I want to understand the key internal gears of LLMs that are currently completely mysterious to me, the parts where I don’t have any functional model at all for how they even could work. What I ultimately want to get out of Interpretability at the moment is a sketch of Python code I could write myself.
It’s not obvious to me that what you seem to want exists. I think the way LLMs work might not be well described as having key internal gears or having an at-all illuminating python code sketch.
(I’d guess something sorta close to what you seem to be describing, but ultimately disappointing and mostly unilluminating exists. And something tremendously complex but ultimately pretty illuminating if you fully understood it might exist.)
When doing bottom up interpretability, it’s pretty unclear if you can answer questions like “how does GPT-4 talk” without being able to explain arbitrary parts to a high degree of accuracy.
I agree that top down interpretability trying to answer more basic questions seems good. (And generally I think top down interpretability looks more promising than bottom up interpretability at current margins.)
(By interpretability, I mean work aimed at having humans understand the algorithm/approach the model to uses to solve tasks. I don’t mean literally any work which involves using the internals of the model in some non-basic way.)
It’s not obvious to me that what you seem to want exists. I think the way LLMs work might not be well described as having key internal gears or having an at-all illuminating python code sketch.
(I’d guess something sorta close to what you seem to be describing, but ultimately disappointing and mostly unilluminating exists. And something tremendously complex but ultimately pretty illuminating if you fully understood it might exist.)
What motivates your believing that?