Note that I am, in general, reluctant to claim to know how I will react to evidence in the future. There are things so far out there that I do know how I would react, but I like to allow myself to use all the evidence I have at that point, and not what I thought beforehand. I do not currently know enough about what would convince me of intelligence in an AI to say for sure. (In part because many people before me have been so obviously wrong.)
I wouldn’t say I see intelligence as a boolean, but as many valued… but those values include a level below which there is no meaningful intelligence (aka, not intelligent). This could be simplified to trinary, not binary. Not intelligent vs sort of intelligent vs genuinely intelligent. A rock… not intelligent. An abacus… not intelligent. A regular computer… not intelligent. Every program I have ever personally written, definitely not intelligent. Almost everyone agrees on those. There is a lot more disagreement about LLMs and other modern AI, but I’d still say they aren’t. (Sort of intelligent might include particularly advanced animals but I am unsure. I’ve certainly heard plenty of claims about it.)
I do think some of them can be said to understand certain things to a shallow degree despite not being intelligent, like how LLMS understand what I am asking them to do if I write something in Korean asking it to answer a particular question in English (or vice versa, I tested both when LLMs became a real thing because I am learning Korean and LLMs do often do it well even back when I tested it), or if I tell an image generation AI that I want a photo most understand what set of features make something photographic (if well trained).
Perhaps it should be noted that I think it requires either very deep understanding of something reasonably broad or notably general intelligence to count as intelligence? This is part of my definition. I generally think people should use the same definitions as each other in these discussions, but it should be the correct one and that is hard in this case since people do not understand intelligence deeply enough to have a great definition, even when we are just talking about humans. (Sometimes I barely think I qualify as intelligent, especially when reading math or AI papers, but that is a completely different definition. How we are defining it matters.)
I am highly unlikely to consider a tool AI to be intelligent, especially since I know it doesn’t understand much about things in general. I am utterly convinced that LLMs are simple tool AI at present, as are other AIs in general use. Modern tool AI might as well just be a very complicated program I wrote as far as intelligence goes according to me.
I actually see ‘neural nets’ as creating a lossy compression scheme using the data provided for their training, but then you supply a novel key during inference that wasn’t actually part of the data and see what happens. I have heard of people getting similar results just using mechanistic schemes of certain parts of normal lossless compression as well, though even more inefficiently. (Basically, you are making a dictionary based on the training data.) Gradient descent seems to allow very limited movement near real data to still make sense and that is what most of the other advancements involved seem to be for as well.
Generally when testing things like AI for intelligence, we seem to either serve up the easiest or hardest questions, because we either want them to fail or succeed based on our own beliefs. And I agree that the way something is obfuscated matters a lot to the difficulty of the question post obfuscation. The questioner is often at fault for how results turn out whether or not the thing being questioned is intelligent enough to answer in a neutral setting. (This is true when humans question humans as well.)
I don’t find arbitrary operations as compelling. The problem with arbitrary operations is the obvious fact that they don’t make sense. Under some definitions of intelligence that matters a lot. (I don’t personally know if it does.) Plus, I don’t know how to judge things perfectly (I’m overly perfectionist in attitude, even though I’ve realized it is impossible) if they are arbitrary except in trivial cases where I can just tell a computer the formula to check. That’s why I like the rediscovery stuff.
Can you make the arbitrary operations fit together perfectly in a sequence like numbers → succession → addition → multiplication in a way that we can truly know works? And then explain why it works clearly in few lines? If so, that is much better evidence. (That’s actually an interesting idea. LLMs clearly understand human language if they understand anything, so they should be able to do it based off of your explanation to humans if they are intelligent and a human would get it. Write up an article about the succession, with how it makes sense, and then ask questions that extend it in the obvious way.)
There could be a way in which its wrong answer, or the right answer, was somehow included in the question and I don’t know about it because I am not superhuman at next word prediction (obviously, and I don’t even try). Modern AI has proven itself quite capable at reading into word choice (if it understands anything well, that would be it), and we could get it to answer correctly like ‘clever Hans’ by massaging the question even subconsciously. (I’m sure this has been pointed out by many people.)
I still do think that these arbitrary operations are a good start, just not definitive. Honestly, in some ways the problem with arbitrary operations is that they are too hard, and thus more a problem for human memory and knowledge at a given difficulty than of intelligence. If an LLM was actually intelligent, it would be a different kind of intelligence, so we’d have a hard time gauging the results.
So, I think the test where you didn’t know what I was getting at is written in a somewhat unclear manner. Think of it in terms of a sequence of completions that keep getting both more novel and more requiring of intelligence for other reasons? (Possibly separately.) How does it perform on rote word completion? Compare that to how it performs on things requiring a little understanding. Then a little more. Up until you reach genuinely intellectually challenging and completely novel ideas. How does its ability to complete these sentences change as it requires more understanding of the world of thoughts and ideas rather than just sentence completion? Obviously, it will get worse, but how does it compare to humans on the level of change? Since it is superhuman on sentence completion, if at any time it does worse than a human, it seems like good evidence that it is reaching its limit.
One thing I think should be done more for AIs is give them actual reference materials like dictionaries or the grammar manual in the paper you mentioned. In fact, I think that the AI should be trained to write those itself. (I’m sure some people do that. It is not the same as what o1 is doing, because o1′s approach is far too primitive and short term.)
I do have a major problem with taking the Gemini paper at face value, because each paper in AI makes claims that turn out to be overstated (this is probably normal in all fields, but I don’t read many outside of AI, and those are mostly just specific math.) They all sound good, but turn out to be not what is claimed. (That said, LLMs really are good at translation, though for some reason google translate doesn’t actually work all that well when used for more than a short phrase, which is funny considering the claim in the paper is for a google AI.) For some reason google AI can’t do Korean well, for instance. (I haven’t tried Gemini as I got bored of trying LLMs by then.)
From reading their description, I am not entirely sure what their procedure of testing was. The writeup seems unclear. But if I’m reading it right, the setup is designed such that it makes it harder to be sure whether the machine translation is correct. Reference translations are a proxy, so in comparing the AI translation to it rather than the actual meaning there is a bunch of extra noise.
That said, the translation of Kalamang from a grammar book and dictionary is probably close enough to the kind of thing I was speculating on assuming there really wasn’t any in the training. Now it needs to be done a bunch of times by neutral parties. (Not me, I’m lazy, very skeptical, and not a linguist.) The included table looks like to me that it is actually dramatically inferior to human performance according human evaluations when translating from Kalamang (though relatively close on English to Kalamang). It is interesting.
Ignore sample-efficiency (is that the term?) at your own peril. While you are thinking about the training, I wasn’t really talking about the training, I was talking about how well it does on things for which it isn’t trained. When it comes across new information, how well does it integrate and use that when it has only seen a little bit or it is nonobviously related? This is sort of related to the few shot prompting. The fewer hints it needs to get up to a high level of performance for something it can’t do from initial training, the more likely it is to be intelligent. Most things in the world are still novel to the AI despite the insane amount of things it saw in training, which is why it makes so many mistakes. We know it can do the things it has seen a billion times (possibly literally) in its training, and that is uninteresting.
I’m glad you think this has been a valuable exchange, because I don’t think I’ve written my points very well. (Both too long and unclear for other reasons at the same time.) I have a feeling that everything I’ve said could be much clearer. (Also, given how much I wrote, a fair bit is probably wrong.) It has been interesting to me responding to your posts and having to actually think through what I think. It’s easy to get lazy when thinking about things just by myself.
I have heard of people getting similar results just using mechanistic schemes of certain parts of normal lossless compression as well, though even more inefficiently.
Interesting, if you happen to have a link I’d be interested to learn more.
Think of it in terms of a sequence of completions that keep getting both more novel and more requiring of intelligence for other reasons?
I like the idea, but it seems hard to judge ‘more novel and [especially] more requiring of intelligence’ other than to sort completions in order of human error on each.
I wasn’t really talking about the training, I was talking about how well it does on things for which it isn’t trained. When it comes across new information, how well does it integrate and use that when it has only seen a little bit or it is nonobviously related?
I think there’s a lot of work to be done on this still, but there’s some evidence that in-context learning is essentially equivalent to gradient descent (though also some criticism of that claim).
I’m glad you think this has been a valuable exchange
Sorry, I don’t have a link for using actual compression algorithms, it was a while ago. I didn’t think it would come up so I didn’t note anything down. My recent spate of commenting is unusual for me (and I don’t actually keep many notes on AI related subjects).
I definitely agree that it is ‘hard to judge’ ‘more novel and more requiring of intelligence’. It is, after all, a major thing we don’t even know how to clearly solve for evaluating other humans (so we use tricks that often rely on other things and these tricks likely do not generalize to other possible intelligences and thus couldn’t use here). Intelligence has not been solved.
Still, there is a big difference between the level of intelligence required when discussing how great your favorite popstar is vs what in particular they are good at vs why they are good at it (and within each category there are much more or less intellectual ways to write about it, though intellectual should not be confused with intelligent). It would have been nice if I could think up good examples, but I couldn’t. You could possibly check things like how well it completes things like parts of this conversation (which is somewhere in the middle).
I wasn’t really able to properly evaluate your links. There’s just too much they assume that I don’t know.
I found your first link, ‘Transformers Learn In-Context by Gradient Descent’ a bit hard to follow (though I don’t particularly think it is a fault of the paper itself). Once they get down to the details, they lose me. It is interesting that it would come up with similar changes based on training and just ‘reading’ the context, but if both mechanisms are simple, I suppose that makes sense.
Their claim about how in context can ‘curve’ better also reminds me of the ODEs used for samplers in diffusion models (I’ve written a number of samplers for diffusion models as a hobby/ to work on my programming). Higher degree ODEs curve more too (though they have their own drawbacks and particularly high degree is generally a bad idea) by using extra samples, just like this can use extra layers. Gradient descent is effectively first degree by default, right? So it wouldn’t be a surprise if you can curve more than it. You would expect sufficiently general things to resemble each other of course. I do find it a bit strange just how similar the loss for steps of gradient descent and transformer layers is. (Random point: I find that loss is not a very good metric for how good the actual results are at least in image generation/reconstruction. Not that I know of a good replacement. People do often come up with various different ways of measuring it though.)
Even though I can’t critique the details, I do think it is important to note that I often find claims of similarity like this in areas I understand better to not be very illuminating because people want to find similarities/analogies to understand it more easily.
The graphs really are shockingly similar though in the single layer case, which raises the likelihood that there’s something to it. And the multi-layer ones really does seem like simply a higher degree polynomial ODE.
The second link ‘In-context Learning and Gradient Descent Revisited’, which was equally difficult, has this line “Surprisingly, we find that untrained models achieve similarity scores at least as good as trained ones. This result provides strong evidence against the strong ICL-GD correspondence.” Which sounds pretty damning to me, assuming they are correct (which I also can’t evaluate).
I could probably figure them out, but I expect it would take me a lot of time.
Even though I can’t critique the details, I do think it is important to note that I often find claims of similarity like this in areas I understand better to not be very illuminating because people want to find similarities/analogies to understand it more easily.
No problem with the failure to respond. I appreciate that this way of communicating is asynchronous (and I don’t necessarily reply to things promptly either). And I think it would be reasonable to drop it at any point if it didn’t seem valuable.
Note that I am, in general, reluctant to claim to know how I will react to evidence in the future. There are things so far out there that I do know how I would react, but I like to allow myself to use all the evidence I have at that point, and not what I thought beforehand. I do not currently know enough about what would convince me of intelligence in an AI to say for sure. (In part because many people before me have been so obviously wrong.)
I wouldn’t say I see intelligence as a boolean, but as many valued… but those values include a level below which there is no meaningful intelligence (aka, not intelligent). This could be simplified to trinary, not binary. Not intelligent vs sort of intelligent vs genuinely intelligent. A rock… not intelligent. An abacus… not intelligent. A regular computer… not intelligent. Every program I have ever personally written, definitely not intelligent. Almost everyone agrees on those. There is a lot more disagreement about LLMs and other modern AI, but I’d still say they aren’t. (Sort of intelligent might include particularly advanced animals but I am unsure. I’ve certainly heard plenty of claims about it.)
I do think some of them can be said to understand certain things to a shallow degree despite not being intelligent, like how LLMS understand what I am asking them to do if I write something in Korean asking it to answer a particular question in English (or vice versa, I tested both when LLMs became a real thing because I am learning Korean and LLMs do often do it well even back when I tested it), or if I tell an image generation AI that I want a photo most understand what set of features make something photographic (if well trained).
Perhaps it should be noted that I think it requires either very deep understanding of something reasonably broad or notably general intelligence to count as intelligence? This is part of my definition. I generally think people should use the same definitions as each other in these discussions, but it should be the correct one and that is hard in this case since people do not understand intelligence deeply enough to have a great definition, even when we are just talking about humans. (Sometimes I barely think I qualify as intelligent, especially when reading math or AI papers, but that is a completely different definition. How we are defining it matters.)
I am highly unlikely to consider a tool AI to be intelligent, especially since I know it doesn’t understand much about things in general. I am utterly convinced that LLMs are simple tool AI at present, as are other AIs in general use. Modern tool AI might as well just be a very complicated program I wrote as far as intelligence goes according to me.
I actually see ‘neural nets’ as creating a lossy compression scheme using the data provided for their training, but then you supply a novel key during inference that wasn’t actually part of the data and see what happens. I have heard of people getting similar results just using mechanistic schemes of certain parts of normal lossless compression as well, though even more inefficiently. (Basically, you are making a dictionary based on the training data.) Gradient descent seems to allow very limited movement near real data to still make sense and that is what most of the other advancements involved seem to be for as well.
Generally when testing things like AI for intelligence, we seem to either serve up the easiest or hardest questions, because we either want them to fail or succeed based on our own beliefs. And I agree that the way something is obfuscated matters a lot to the difficulty of the question post obfuscation. The questioner is often at fault for how results turn out whether or not the thing being questioned is intelligent enough to answer in a neutral setting. (This is true when humans question humans as well.)
I don’t find arbitrary operations as compelling. The problem with arbitrary operations is the obvious fact that they don’t make sense. Under some definitions of intelligence that matters a lot. (I don’t personally know if it does.) Plus, I don’t know how to judge things perfectly (I’m overly perfectionist in attitude, even though I’ve realized it is impossible) if they are arbitrary except in trivial cases where I can just tell a computer the formula to check. That’s why I like the rediscovery stuff.
Can you make the arbitrary operations fit together perfectly in a sequence like numbers → succession → addition → multiplication in a way that we can truly know works? And then explain why it works clearly in few lines? If so, that is much better evidence. (That’s actually an interesting idea. LLMs clearly understand human language if they understand anything, so they should be able to do it based off of your explanation to humans if they are intelligent and a human would get it. Write up an article about the succession, with how it makes sense, and then ask questions that extend it in the obvious way.)
There could be a way in which its wrong answer, or the right answer, was somehow included in the question and I don’t know about it because I am not superhuman at next word prediction (obviously, and I don’t even try). Modern AI has proven itself quite capable at reading into word choice (if it understands anything well, that would be it), and we could get it to answer correctly like ‘clever Hans’ by massaging the question even subconsciously. (I’m sure this has been pointed out by many people.)
I still do think that these arbitrary operations are a good start, just not definitive. Honestly, in some ways the problem with arbitrary operations is that they are too hard, and thus more a problem for human memory and knowledge at a given difficulty than of intelligence. If an LLM was actually intelligent, it would be a different kind of intelligence, so we’d have a hard time gauging the results.
So, I think the test where you didn’t know what I was getting at is written in a somewhat unclear manner. Think of it in terms of a sequence of completions that keep getting both more novel and more requiring of intelligence for other reasons? (Possibly separately.) How does it perform on rote word completion? Compare that to how it performs on things requiring a little understanding. Then a little more. Up until you reach genuinely intellectually challenging and completely novel ideas. How does its ability to complete these sentences change as it requires more understanding of the world of thoughts and ideas rather than just sentence completion? Obviously, it will get worse, but how does it compare to humans on the level of change? Since it is superhuman on sentence completion, if at any time it does worse than a human, it seems like good evidence that it is reaching its limit.
One thing I think should be done more for AIs is give them actual reference materials like dictionaries or the grammar manual in the paper you mentioned. In fact, I think that the AI should be trained to write those itself. (I’m sure some people do that. It is not the same as what o1 is doing, because o1′s approach is far too primitive and short term.)
I do have a major problem with taking the Gemini paper at face value, because each paper in AI makes claims that turn out to be overstated (this is probably normal in all fields, but I don’t read many outside of AI, and those are mostly just specific math.) They all sound good, but turn out to be not what is claimed. (That said, LLMs really are good at translation, though for some reason google translate doesn’t actually work all that well when used for more than a short phrase, which is funny considering the claim in the paper is for a google AI.) For some reason google AI can’t do Korean well, for instance. (I haven’t tried Gemini as I got bored of trying LLMs by then.)
From reading their description, I am not entirely sure what their procedure of testing was. The writeup seems unclear. But if I’m reading it right, the setup is designed such that it makes it harder to be sure whether the machine translation is correct. Reference translations are a proxy, so in comparing the AI translation to it rather than the actual meaning there is a bunch of extra noise.
That said, the translation of Kalamang from a grammar book and dictionary is probably close enough to the kind of thing I was speculating on assuming there really wasn’t any in the training. Now it needs to be done a bunch of times by neutral parties. (Not me, I’m lazy, very skeptical, and not a linguist.) The included table looks like to me that it is actually dramatically inferior to human performance according human evaluations when translating from Kalamang (though relatively close on English to Kalamang). It is interesting.
Ignore sample-efficiency (is that the term?) at your own peril. While you are thinking about the training, I wasn’t really talking about the training, I was talking about how well it does on things for which it isn’t trained. When it comes across new information, how well does it integrate and use that when it has only seen a little bit or it is nonobviously related? This is sort of related to the few shot prompting. The fewer hints it needs to get up to a high level of performance for something it can’t do from initial training, the more likely it is to be intelligent. Most things in the world are still novel to the AI despite the insane amount of things it saw in training, which is why it makes so many mistakes. We know it can do the things it has seen a billion times (possibly literally) in its training, and that is uninteresting.
I’m glad you think this has been a valuable exchange, because I don’t think I’ve written my points very well. (Both too long and unclear for other reasons at the same time.) I have a feeling that everything I’ve said could be much clearer. (Also, given how much I wrote, a fair bit is probably wrong.) It has been interesting to me responding to your posts and having to actually think through what I think. It’s easy to get lazy when thinking about things just by myself.
Interesting, if you happen to have a link I’d be interested to learn more.
I like the idea, but it seems hard to judge ‘more novel and [especially] more requiring of intelligence’ other than to sort completions in order of human error on each.
I think there’s a lot of work to be done on this still, but there’s some evidence that in-context learning is essentially equivalent to gradient descent (though also some criticism of that claim).
I continue to think so :). Thanks again!
Sorry, I don’t have a link for using actual compression algorithms, it was a while ago. I didn’t think it would come up so I didn’t note anything down. My recent spate of commenting is unusual for me (and I don’t actually keep many notes on AI related subjects).
I definitely agree that it is ‘hard to judge’ ‘more novel and more requiring of intelligence’. It is, after all, a major thing we don’t even know how to clearly solve for evaluating other humans (so we use tricks that often rely on other things and these tricks likely do not generalize to other possible intelligences and thus couldn’t use here). Intelligence has not been solved.
Still, there is a big difference between the level of intelligence required when discussing how great your favorite popstar is vs what in particular they are good at vs why they are good at it (and within each category there are much more or less intellectual ways to write about it, though intellectual should not be confused with intelligent). It would have been nice if I could think up good examples, but I couldn’t. You could possibly check things like how well it completes things like parts of this conversation (which is somewhere in the middle).
I wasn’t really able to properly evaluate your links. There’s just too much they assume that I don’t know.
I found your first link, ‘Transformers Learn In-Context by Gradient Descent’ a bit hard to follow (though I don’t particularly think it is a fault of the paper itself). Once they get down to the details, they lose me. It is interesting that it would come up with similar changes based on training and just ‘reading’ the context, but if both mechanisms are simple, I suppose that makes sense.
Their claim about how in context can ‘curve’ better also reminds me of the ODEs used for samplers in diffusion models (I’ve written a number of samplers for diffusion models as a hobby/ to work on my programming). Higher degree ODEs curve more too (though they have their own drawbacks and particularly high degree is generally a bad idea) by using extra samples, just like this can use extra layers. Gradient descent is effectively first degree by default, right? So it wouldn’t be a surprise if you can curve more than it. You would expect sufficiently general things to resemble each other of course. I do find it a bit strange just how similar the loss for steps of gradient descent and transformer layers is. (Random point: I find that loss is not a very good metric for how good the actual results are at least in image generation/reconstruction. Not that I know of a good replacement. People do often come up with various different ways of measuring it though.)
Even though I can’t critique the details, I do think it is important to note that I often find claims of similarity like this in areas I understand better to not be very illuminating because people want to find similarities/analogies to understand it more easily.
The graphs really are shockingly similar though in the single layer case, which raises the likelihood that there’s something to it. And the multi-layer ones really does seem like simply a higher degree polynomial ODE.
The second link ‘In-context Learning and Gradient Descent Revisited’, which was equally difficult, has this line “Surprisingly, we find that untrained models achieve similarity scores at least as good as trained ones. This result provides strong evidence against the strong ICL-GD correspondence.” Which sounds pretty damning to me, assuming they are correct (which I also can’t evaluate).
I could probably figure them out, but I expect it would take me a lot of time.
Agreed, that’s definitely a general failure mode.
Hi, apologies for having failed to respond; I went out of town and lost track of this thread. Reading back through what you’ve said. Thank you!
No problem with the failure to respond. I appreciate that this way of communicating is asynchronous (and I don’t necessarily reply to things promptly either). And I think it would be reasonable to drop it at any point if it didn’t seem valuable.
Also, you’re welcome.