There are two ways a large language model transformer learns: type 1, the gradient descent process, which certainly does not learn information efficiently, taking billions of examples, and then type 2, the mysterious in-episode learning process, where a transformer learns from ~ 5 examples in an engineered prompt to do a ‘new’ task. I think the fundamental question is whether type 2 only works if the task to be learned is represented in the original dataset, or if it generalizes out of distribution. If it truly generalizes, then the obvious next step is to somehow skip straight to type 2 learning.
There are two ways a large language model transformer learns: type 1, the gradient descent process, which certainly does not learn information efficiently, taking billions of examples, and then type 2, the mysterious in-episode learning process, where a transformer learns from ~ 5 examples in an engineered prompt to do a ‘new’ task. I think the fundamental question is whether type 2 only works if the task to be learned is represented in the original dataset, or if it generalizes out of distribution. If it truly generalizes, then the obvious next step is to somehow skip straight to type 2 learning.