I’m seeking some clarification, my reading of your post is that you see the following concepts as intertwined:
Efficient representation of learned information
Efficient learning of information
As you point out (and I agree) that transformer parameters live in a small space and the realities of human biology seem to imply that we can do #1 better, that is, use a “lighter” algorithm with fewer free parameters to store our learned information.
If I understand you correctly, you believe that this “far more efficient architecture trying to get out” would also be better at #2 (require less data to reach this efficient representation). While I agree that an algorithm to do this better must exist, it is not obvious to me that a better compressed/sparse storage format for language models would necessarily require less data to train.
So, questions: Did I misunderstand you, and if so, where? Are there additional reasons you believe the two concepts to be correlated?
There are two ways a large language model transformer learns: type 1, the gradient descent process, which certainly does not learn information efficiently, taking billions of examples, and then type 2, the mysterious in-episode learning process, where a transformer learns from ~ 5 examples in an engineered prompt to do a ‘new’ task. I think the fundamental question is whether type 2 only works if the task to be learned is represented in the original dataset, or if it generalizes out of distribution. If it truly generalizes, then the obvious next step is to somehow skip straight to type 2 learning.
I’m seeking some clarification, my reading of your post is that you see the following concepts as intertwined:
Efficient representation of learned information
Efficient learning of information
As you point out (and I agree) that transformer parameters live in a small space and the realities of human biology seem to imply that we can do #1 better, that is, use a “lighter” algorithm with fewer free parameters to store our learned information.
If I understand you correctly, you believe that this “far more efficient architecture trying to get out” would also be better at #2 (require less data to reach this efficient representation). While I agree that an algorithm to do this better must exist, it is not obvious to me that a better compressed/sparse storage format for language models would necessarily require less data to train.
So, questions: Did I misunderstand you, and if so, where? Are there additional reasons you believe the two concepts to be correlated?
There are two ways a large language model transformer learns: type 1, the gradient descent process, which certainly does not learn information efficiently, taking billions of examples, and then type 2, the mysterious in-episode learning process, where a transformer learns from ~ 5 examples in an engineered prompt to do a ‘new’ task. I think the fundamental question is whether type 2 only works if the task to be learned is represented in the original dataset, or if it generalizes out of distribution. If it truly generalizes, then the obvious next step is to somehow skip straight to type 2 learning.