This section in Anthropic’s work on Induction heads seems highly relevant—I would be interested in seeing an extension of your analysis that looks at what induction heads do in these tasks.
If we believe the claims in that paper, then in-context learning of any kind seems to driven by a fairly simple mechanism not unlike kNN—induction attention heads. Since it’s pretty tractable to locate induction heads in an automated way, we could potentially take a look at the actual mechanism being used to implement these predictions and verify/falsify the hypotheses you make about how GPT makes these predictions. (Although you’d probably have to switch to an open-source model.)
That seems like a great idea, and induction heads do seem highly relevant!
What you describe is actually one of the key reasons why I’m so excited about this whole approach. I’ve seen many interesting metalearning tasks, and they mostly just like work or not work, or they fail sometimes, and you can try to study their failures to perhaps glean some insight into the underlying algorithm—but...they just don’t have (m)any nontrivial “degrees of freedom” in which you can vary them. The class of numerical models, on the other hand, has a substantial amount of nontrivial ways in which you can vary your input—and even more so, you can vary it not just discretely, but also ~continuously.
That makes me really optimistic about the possibility of which you hint, of reverse engineering whatever algorithm the model is running underneath, and then using interpretability tools to verify/falsify those findings. Conversely, interpretability tools could be used to make predictions about the algorithm, which can then be checked. Hence one can imagine a quite meaningful feedback loop between experimentation and interpretability!
Nice work!
This section in Anthropic’s work on Induction heads seems highly relevant—I would be interested in seeing an extension of your analysis that looks at what induction heads do in these tasks.
If we believe the claims in that paper, then in-context learning of any kind seems to driven by a fairly simple mechanism not unlike kNN—induction attention heads. Since it’s pretty tractable to locate induction heads in an automated way, we could potentially take a look at the actual mechanism being used to implement these predictions and verify/falsify the hypotheses you make about how GPT makes these predictions. (Although you’d probably have to switch to an open-source model.)
That seems like a great idea, and induction heads do seem highly relevant!
What you describe is actually one of the key reasons why I’m so excited about this whole approach. I’ve seen many interesting metalearning tasks, and they mostly just like work or not work, or they fail sometimes, and you can try to study their failures to perhaps glean some insight into the underlying algorithm—but...they just don’t have (m)any nontrivial “degrees of freedom” in which you can vary them. The class of numerical models, on the other hand, has a substantial amount of nontrivial ways in which you can vary your input—and even more so, you can vary it not just discretely, but also ~continuously.
That makes me really optimistic about the possibility of which you hint, of reverse engineering whatever algorithm the model is running underneath, and then using interpretability tools to verify/falsify those findings. Conversely, interpretability tools could be used to make predictions about the algorithm, which can then be checked. Hence one can imagine a quite meaningful feedback loop between experimentation and interpretability!