I have heard of people getting similar results just using mechanistic schemes of certain parts of normal lossless compression as well, though even more inefficiently.
Interesting, if you happen to have a link I’d be interested to learn more.
Think of it in terms of a sequence of completions that keep getting both more novel and more requiring of intelligence for other reasons?
I like the idea, but it seems hard to judge ‘more novel and [especially] more requiring of intelligence’ other than to sort completions in order of human error on each.
I wasn’t really talking about the training, I was talking about how well it does on things for which it isn’t trained. When it comes across new information, how well does it integrate and use that when it has only seen a little bit or it is nonobviously related?
I think there’s a lot of work to be done on this still, but there’s some evidence that in-context learning is essentially equivalent to gradient descent (though also some criticism of that claim).
I’m glad you think this has been a valuable exchange
Sorry, I don’t have a link for using actual compression algorithms, it was a while ago. I didn’t think it would come up so I didn’t note anything down. My recent spate of commenting is unusual for me (and I don’t actually keep many notes on AI related subjects).
I definitely agree that it is ‘hard to judge’ ‘more novel and more requiring of intelligence’. It is, after all, a major thing we don’t even know how to clearly solve for evaluating other humans (so we use tricks that often rely on other things and these tricks likely do not generalize to other possible intelligences and thus couldn’t use here). Intelligence has not been solved.
Still, there is a big difference between the level of intelligence required when discussing how great your favorite popstar is vs what in particular they are good at vs why they are good at it (and within each category there are much more or less intellectual ways to write about it, though intellectual should not be confused with intelligent). It would have been nice if I could think up good examples, but I couldn’t. You could possibly check things like how well it completes things like parts of this conversation (which is somewhere in the middle).
I wasn’t really able to properly evaluate your links. There’s just too much they assume that I don’t know.
I found your first link, ‘Transformers Learn In-Context by Gradient Descent’ a bit hard to follow (though I don’t particularly think it is a fault of the paper itself). Once they get down to the details, they lose me. It is interesting that it would come up with similar changes based on training and just ‘reading’ the context, but if both mechanisms are simple, I suppose that makes sense.
Their claim about how in context can ‘curve’ better also reminds me of the ODEs used for samplers in diffusion models (I’ve written a number of samplers for diffusion models as a hobby/ to work on my programming). Higher degree ODEs curve more too (though they have their own drawbacks and particularly high degree is generally a bad idea) by using extra samples, just like this can use extra layers. Gradient descent is effectively first degree by default, right? So it wouldn’t be a surprise if you can curve more than it. You would expect sufficiently general things to resemble each other of course. I do find it a bit strange just how similar the loss for steps of gradient descent and transformer layers is. (Random point: I find that loss is not a very good metric for how good the actual results are at least in image generation/reconstruction. Not that I know of a good replacement. People do often come up with various different ways of measuring it though.)
Even though I can’t critique the details, I do think it is important to note that I often find claims of similarity like this in areas I understand better to not be very illuminating because people want to find similarities/analogies to understand it more easily.
The graphs really are shockingly similar though in the single layer case, which raises the likelihood that there’s something to it. And the multi-layer ones really does seem like simply a higher degree polynomial ODE.
The second link ‘In-context Learning and Gradient Descent Revisited’, which was equally difficult, has this line “Surprisingly, we find that untrained models achieve similarity scores at least as good as trained ones. This result provides strong evidence against the strong ICL-GD correspondence.” Which sounds pretty damning to me, assuming they are correct (which I also can’t evaluate).
I could probably figure them out, but I expect it would take me a lot of time.
Even though I can’t critique the details, I do think it is important to note that I often find claims of similarity like this in areas I understand better to not be very illuminating because people want to find similarities/analogies to understand it more easily.
Interesting, if you happen to have a link I’d be interested to learn more.
I like the idea, but it seems hard to judge ‘more novel and [especially] more requiring of intelligence’ other than to sort completions in order of human error on each.
I think there’s a lot of work to be done on this still, but there’s some evidence that in-context learning is essentially equivalent to gradient descent (though also some criticism of that claim).
I continue to think so :). Thanks again!
Sorry, I don’t have a link for using actual compression algorithms, it was a while ago. I didn’t think it would come up so I didn’t note anything down. My recent spate of commenting is unusual for me (and I don’t actually keep many notes on AI related subjects).
I definitely agree that it is ‘hard to judge’ ‘more novel and more requiring of intelligence’. It is, after all, a major thing we don’t even know how to clearly solve for evaluating other humans (so we use tricks that often rely on other things and these tricks likely do not generalize to other possible intelligences and thus couldn’t use here). Intelligence has not been solved.
Still, there is a big difference between the level of intelligence required when discussing how great your favorite popstar is vs what in particular they are good at vs why they are good at it (and within each category there are much more or less intellectual ways to write about it, though intellectual should not be confused with intelligent). It would have been nice if I could think up good examples, but I couldn’t. You could possibly check things like how well it completes things like parts of this conversation (which is somewhere in the middle).
I wasn’t really able to properly evaluate your links. There’s just too much they assume that I don’t know.
I found your first link, ‘Transformers Learn In-Context by Gradient Descent’ a bit hard to follow (though I don’t particularly think it is a fault of the paper itself). Once they get down to the details, they lose me. It is interesting that it would come up with similar changes based on training and just ‘reading’ the context, but if both mechanisms are simple, I suppose that makes sense.
Their claim about how in context can ‘curve’ better also reminds me of the ODEs used for samplers in diffusion models (I’ve written a number of samplers for diffusion models as a hobby/ to work on my programming). Higher degree ODEs curve more too (though they have their own drawbacks and particularly high degree is generally a bad idea) by using extra samples, just like this can use extra layers. Gradient descent is effectively first degree by default, right? So it wouldn’t be a surprise if you can curve more than it. You would expect sufficiently general things to resemble each other of course. I do find it a bit strange just how similar the loss for steps of gradient descent and transformer layers is. (Random point: I find that loss is not a very good metric for how good the actual results are at least in image generation/reconstruction. Not that I know of a good replacement. People do often come up with various different ways of measuring it though.)
Even though I can’t critique the details, I do think it is important to note that I often find claims of similarity like this in areas I understand better to not be very illuminating because people want to find similarities/analogies to understand it more easily.
The graphs really are shockingly similar though in the single layer case, which raises the likelihood that there’s something to it. And the multi-layer ones really does seem like simply a higher degree polynomial ODE.
The second link ‘In-context Learning and Gradient Descent Revisited’, which was equally difficult, has this line “Surprisingly, we find that untrained models achieve similarity scores at least as good as trained ones. This result provides strong evidence against the strong ICL-GD correspondence.” Which sounds pretty damning to me, assuming they are correct (which I also can’t evaluate).
I could probably figure them out, but I expect it would take me a lot of time.
Agreed, that’s definitely a general failure mode.