I don’t want a description of every single plate and cable in a Toyota Corolla, I’m not thinking about the balance between the length of the Corolla blueprint and its fidelity as a central issue of interpretability as a field.
What I want right now is a basic understanding of combustion engines.
This is the wrong ‘length’. The right version of brute-force length is not “every weight and bias in the network” but “the program trace of running the network on every datapoint in pretrain”. Compressing the explanation (not just the source code) is the thing connected to understanding. This is what we found from getting formal proofs of model behavior in Compact Proofs of Model Performance via Mechanistic Interpretability.
Does the 17th-century scholar have the requisite background to understand the transcript of how bringing the metal plates in the spark plug close enough together results in the formation of a spark? And how gasoline will ignite and expand? I think given these two building blocks, a complete description of the frame-by-frame motion of the Toyota Corolla would eventually convince the 17th-century scholar that such motion is possible, and what remains would just be fitting the explanation into their head all at once. We already have the corresponding building blocks for neural nets: floating point operations.
This is the wrong ‘length’. The right version of brute-force length is not “every weight and bias in the network” but “the program trace of running the network on every datapoint in pretrain”. Compressing the explanation (not just the source code) is the thing connected to understanding. This is what we found from getting formal proofs of model behavior in Compact Proofs of Model Performance via Mechanistic Interpretability.
Does the 17th-century scholar have the requisite background to understand the transcript of how bringing the metal plates in the spark plug close enough together results in the formation of a spark? And how gasoline will ignite and expand? I think given these two building blocks, a complete description of the frame-by-frame motion of the Toyota Corolla would eventually convince the 17th-century scholar that such motion is possible, and what remains would just be fitting the explanation into their head all at once. We already have the corresponding building blocks for neural nets: floating point operations.