Transformers, for example, seem to do a step of gradient descent in Transformer blocks on an abstracted version of the problem, as a small explicit inference step at runtime, where the learned abstractions do most of the work during pretraining which is then amortized over all runtimes
Do you have a reference for this? I have a hard time believing that this is generally true of anything other than toy models trained on toy tasks. I think you’re referencing this paper, which trains a shallow attention-only transformer where they get rid of the nonlinearity in the attention, trained to perform linear regression. There are too many dissimilarities between the setting in this work and LLMs to convince me that this is true of LLama or GPT4.
Well, obviously not just that one (“Transformers learn in-context by gradient descent”, van Oswald et al 2022). There’s lots of related work examining it in various ways. (I haven’t read a lot of those myself, unfortunately—as always, too many things to read, especially if I ever want to write my own stuff.)
Do you have a reference for this? I have a hard time believing that this is generally true of anything other than toy models trained on toy tasks. I think you’re referencing this paper, which trains a shallow attention-only transformer where they get rid of the nonlinearity in the attention, trained to perform linear regression. There are too many dissimilarities between the setting in this work and LLMs to convince me that this is true of LLama or GPT4.
Well, obviously not just that one (“Transformers learn in-context by gradient descent”, van Oswald et al 2022). There’s lots of related work examining it in various ways. (I haven’t read a lot of those myself, unfortunately—as always, too many things to read, especially if I ever want to write my own stuff.)
I don’t know why you have a hard time believing it, so I couldn’t say what of those you might find relevant—it makes plenty of sense to me, for the reasons I outlined here, and is what I expect from increasingly capable models. And you didn’t seem to disagree with these sorts of claims last time: “I think that these papers do provide sufficient behavioral evidence that transformers are implementing something close to gradient descent in their weights.”
Broadly, I was also thinking of: “How Well Can Transformers Emulate In-context Newton’s Method?”, Giannou et al 2024, “Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models”, Fu et al 2023, “CausalLM is not optimal for in-context learning”, Ding et al 2023, “One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention”, Mahankali et al 2023, “Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers”, Dai et al 2023, “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes”, Garg et al 2022/”What learning algorithm is in-context learning? Investigations with linear models”, Akyürek et al 2022, & “An Explanation of In-context Learning as Implicit Bayesian Inference”, Xie et al 2021.