That seems like a great idea, and induction heads do seem highly relevant!
What you describe is actually one of the key reasons why I’m so excited about this whole approach. I’ve seen many interesting metalearning tasks, and they mostly just like work or not work, or they fail sometimes, and you can try to study their failures to perhaps glean some insight into the underlying algorithm—but...they just don’t have (m)any nontrivial “degrees of freedom” in which you can vary them. The class of numerical models, on the other hand, has a substantial amount of nontrivial ways in which you can vary your input—and even more so, you can vary it not just discretely, but also ~continuously.
That makes me really optimistic about the possibility of which you hint, of reverse engineering whatever algorithm the model is running underneath, and then using interpretability tools to verify/falsify those findings. Conversely, interpretability tools could be used to make predictions about the algorithm, which can then be checked. Hence one can imagine a quite meaningful feedback loop between experimentation and interpretability!
That seems like a great idea, and induction heads do seem highly relevant!
What you describe is actually one of the key reasons why I’m so excited about this whole approach. I’ve seen many interesting metalearning tasks, and they mostly just like work or not work, or they fail sometimes, and you can try to study their failures to perhaps glean some insight into the underlying algorithm—but...they just don’t have (m)any nontrivial “degrees of freedom” in which you can vary them. The class of numerical models, on the other hand, has a substantial amount of nontrivial ways in which you can vary your input—and even more so, you can vary it not just discretely, but also ~continuously.
That makes me really optimistic about the possibility of which you hint, of reverse engineering whatever algorithm the model is running underneath, and then using interpretability tools to verify/falsify those findings. Conversely, interpretability tools could be used to make predictions about the algorithm, which can then be checked. Hence one can imagine a quite meaningful feedback loop between experimentation and interpretability!