Interpolation vs extrapolation is obviously very simple in theory; are you going in between points it has trained on or extending it outside of the training set. To just use math as an example
Sorry, I should have been clearer. I agree it’s straightforward in cases like the ones you give, I’m really thinking of the case of large language models. It’s not at all clear to me that we even have a good way to identify in- vs out-of-distribution for a model trained against much of the internet. If we did, some of this stuff would be much easier to test.
The proposed experiment should be somewhat a test of this, though hardly definitive (not that we as a society are at the stage to do definitive tests).
What would constitute a (minimal-ish) definitive test in your view?
And how do you expect the proposed experiment to go? Would you expect current-generation LLMs to fail completely, or to succeed for simple but not complex cases, or to have an easy time with it?
It seems important to keep in mind that we should probably build things like this from the end to beginning, which is mentioned, so that we know exactly what the correct answer is before we ask, rather than assuming.
Absolutely; this is a huge weakness of much of the existing research trying to test the limitations of LLMs with respect to general reasoning ability, and a large motivation for the experiment (which has just been accepted for the next session of AI Safety Camp; if things go as expected I’ll be leading a research team on this experiment).
Perhaps one idea would be to do three varieties of question for each type of question:
1.Non-obfuscated but not in training data (we do less of this than sometimes thought)
2.Obfuscated directly from known training data
3.Obfuscated and not in training data
I’m not sure what it would mean for something not in the training data to be obfuscated. Obfuscated relative to what? In any case, my aim is very much to test something that’s definitively not in the training data, because it’s been randomly generated and uses novel words.
As to your disagreement where you say scale has always decreased error rate, this may be true when the scale increase is truly massive,
Sure, I only mean that there’s a strong correlation, not that there’s a perfect correspondence.
but I have seen scale not help on numerous things in image generation AI
I think it’s important to distinguish error rate on the loss function, which pretty reliably decreases with scale, from other measures like ‘Does it make better art?‘, which a) quite plausibly don’t improve with scale since they’re not not what the model’s being trained on, and b) are very much harder to judge. Even ‘Is the skin plasticky or unrealistic?’ seems tricky (though not impossible) to judge without a human labeler.
Of course, one of the main causes of confusion is that ‘Is it good at general reasoning?’ is also a hard-to-judge question, and although it certainly seems to have improved significantly with scale, it’s hard to show that in a principled way. The experiment I describe is designed to at least get at a subset of that in a somewhat more principled way: can the models develop hypotheses in novel domains, figure out experiments that will test those hypotheses, and come to conclusions that match the underlying ground truth?
What would be a minimal-ish definitive test for LLM style AI? I don’t really know. I could come up with tests for it most likely, but I don’t really know how to make them fairly minimal. I can tell you that current AI isn’t intelligent, but as for what would prove intelligence, I’ve been thinking about it for a while and I really don’t have much. I wish I could be more helpful.
I do think your test of whether an AI can follow the scientific method in a novel area is intriguing.
Historically, a lot of people have come up with (in retrospect) really dumb tests (like Chess playing) that they assumed would be this because they didn’t really understand how AI would work, and this doesn’t seem to have abated with the switch to deep learning. I don’t want to do that, and thus I am reluctant to try (another problem with comparing human intelligence to machine intelligence). This is complicated in part because we really don’t even understand the nature of human intelligence, much less general intelligence in the abstract.
In theory, it is simple, but there is no single test that is necessarily robust to things like being in the training data because someone decided on that particular (which has happened many times when someone pointed out a particular flaw, but the particular test needn’t be included for that reason) so it would need to be tested across a number of different areas, and they all need to be genuinely hard if it doesn’t have the capability. Obviously the exact test items being held in reserve is useful, but I don’t think it can rule out being included since there are an awful lot of people making training data due to the way these are trained. Obfuscation does help, but I wouldn’t rule out it figuring out how to deobfuscate things without being generally intelligent (humans are not great generators of problems).
More limited specific tests are easier to design. We can programmatically create effectively infinite math problems to test and as long as the generator produces a notably different distribution of problems we know it has learned math when it does well… but that only tests whether it can do math and they can create effectively infinite math for the training as well.
Perhaps if you could genuinely exclude all data during training that in any way has to do with a certain scientific discovery from training you could check how well it discerns the real rule from plausible alternative rules when asked, but the best way to do that takes a very long time (waiting for scientific discoveries that weren’t even theorized correctly at the time it was trained), and the other ways of doing it have been shown to be leaky.
The best non minimal way is to introduce it to entirely new domains where it has not been trained at all, but that requires controlling the training very tightly and may not be useful as an external metric. For instance, train it on only numbers and addition (or for bonus points, only explain addition in terms of the succession of numbers on the number line) mathematically, then explain multiplication in terms of addition and ask it to do a lot of complicated multiplication. If it does that well, explain division in terms of multiplication, and so on. See just how deep it can go and maintain correctness when you explain things only in terms of other things with just that single link. This is not an especially different idea than the one proposed, of course, but I would find it more telling. If it was good at this, then I think it would be worth looking into the level of intelligence it has more closely, but doing well here isn’t proof. (In other words, I think your test is a good start, just not proof.)
The problem is, all models are trained on math in general because of how the internet works so it needs to be these less well-defined areas where we can’t be certain whether or not the answers are in some way correct or flawed, and crucially, just how hard the problems really are. Is it failing/acing extremely difficult/trivial problems? Our intuitions on what is easy/hard seem built specifically for humans. (We aren’t entirely general intelligences as we appear to have many special purpose capabilities bolted on, like judging other humans.) Also, giving it access to math tools would be cheating, but people have already started integrating tools for things like that into LLMs.
LLMs are supposedly superhuman at next word prediction, so an interesting (though not telling) test for an LLM might be varying the amount of informational and intelligence requiring information there is in a completely novel text by an author they have never seen before, and seeing how well the LLM continues to predict the next word. If it remains at a similar level, there’s probably something worth looking closely at going in terms of reasoning. (This can of course be gamed by making it worse at next word prediction on low content stuff.) This is similar to verification set testing though, so there is some selection for this in what gets released.
For bonus points, a linguist could make up a bunch of very different full-fledged languages it hasn’t been exposed to using arbitrary (and unusual) rules of grammar and see how well it does on those tests in the new languages compared to an average human with just the same key to the languages (but this can’t just be a cipher, as those are reversible without intelligence once it has figured out how to deobfuscate things and I believe that plausibly doesn’t require intelligence exactly, though it would for a human.)
I forget what the term for this is (maybe ‘data-efficient’?), but the best single test of an area is to compare the total amount of training information given to the AI in training and prompt to the amount a human gets in that area to get to a certain level of ability across a variety of representative areas. LLMs currently do terribly at this, and we don’t have anyone even vaguely suggesting that even considering trying this at levels with as little training data as humans use would make any sense at all (and again, humans have some specific purpose capabilities built in, so this isn’t even a great test). We also don’t even know how much training data humans actually get… (I’ve seen people trying to ballpark it, but it didn’t seem credible at the time.)
I suspect that in your proposed test, modern AI would likely be able to solve the very easy questions, but would do quite badly on difficult ones. Problem is, I don’t know how easy should be expected to be solved. I am again reluctant to opine to strongly on this matter.
So, as you know, obfuscation is a method of hiding exactly what you are getting at. You can do this for things it already knows obviously, but you can also use whatever methods you use for generating a obfuscations of known data on the novel data you generated. I would strongly advise testing on known data as a comparison.
This is to test how much of the difficulty is based on the form of the question rather than the content. Or in other words, using the same exact words and setup, have completely unknown things, and completely known things asked about. (You can check how well it knows an area using the nonobfuscated stuff.) For bonus points, see how well it does on things where it already struggles just a little in plain English too.
On another note, I do believe that image generation models are specifically being trained these days to be better at both aesthetics and realism, and are simply failing to move the needle sufficiently as they grow larger. I do agree that even the ‘skin test’ isn’t really super objective (since it is testing vs the parts that humans probably have built in which likely have some skew, and a human doesn’t want to judge thousands of pictures a day on such a matter, while using an AI to judge AI really is quite error prone.).
I’m planning to make a LW post soon asking for more input on this experiment—one of my goals here is to make this experiment one that both sides of the debate agree in advance would provide good evidence. I’d love to get your input there as well if you’re so moved!
I can tell you that current AI isn’t intelligent, but as for what would prove intelligence, I’ve been thinking about it for a while and I really don’t have much.
I tend not to think of intelligence as a boolean property, but of an entity having some level of intelligence (like IQ, although we certainly can’t blithely give IQ tests to LLMs and treat the results as meaningful, not that that stops people from doing it). I don’t imagine you think of it as boolean either, but calling that out in case I’m mistaken.
Obviously the exact test items being held in reserve is useful, but I don’t think it can rule out being included since there are an awful lot of people making training data due to the way these are trained.
Agreed; at this point I assume that anything published before (or not long after) the knowledge cutoff may well be in the training data.
Obfuscation does help, but I wouldn’t rule out it figuring out how to deobfuscate things without being generally intelligent
The obfuscation method matters as well; eg I think the Kambhampati team’s approach to obfuscation made the problems much harder in ways that are irrelevant or counterproductive to testing LLM reasoning abilities (see Ryan’s comment here and my reply for details).
Perhaps if you could genuinely exclude all data during training that in any way has to do with a certain scientific discovery
I’d absolutely love that and agree it would help enormously to resolve these sorts of questions. But my guess is we won’t see deliberate exclusions on frontier LLMs anytime in the next couple of years; it’s difficult and labor-intensive to do at internet scale, and the leading companies haven’t shown any interest in doing so AFAIK (or even in releasing comprehensive data about what the training data was).
For instance, train it on only numbers and addition (or for bonus points, only explain addition in terms of the succession of numbers on the number line) mathematically, then explain multiplication in terms of addition and ask it to do a lot of complicated multiplication. If it does that well, explain division in terms of multiplication, and so on...This is not an especially different idea than the one proposed, of course, but I would find it more telling. If it was good at this, then I think it would be worth looking into the level of intelligence it has more closely, but doing well here isn’t proof.
Very interesting idea! I think I informally anectested something similar at one point by introducing new mathematical operations (but can’t recall how it turned out). Two questions:
Since we can’t in practice train a frontier LLM without multiplication, would artificial new operations be equally convincing in your view (eg, I don’t know, x # y means sqrt(x - 2y)? Ideally something a bit less arbitrary than that, though mathematicians tend to already write about the non-arbitrary ones).
Would providing few-shot examples (eg several demonstrations of x # y for particular values of x and y) make it less compelling?
LLMs are supposedly superhuman at next word prediction
an interesting (though not telling) test for an LLM might be varying the amount of informational and intelligence requiring information there is in a completely novel text by an author they have never seen before, and seeing how well the LLM continues to predict the next word. If it remains at a similar level, there’s probably something worth looking closely at going in terms of reasoning.
Sorry, I’m failing to understand the test you’re proposing; can you spell it out a bit more?
For bonus points, a linguist could make up a bunch of very different full-fledged languages it hasn’t been exposed to using arbitrary (and unusual) rules of grammar and see how well it does on those tests in the new languages compared to an average human with just the same key to the languages
I found DeepMind’s experiment in teaching Gemini the Kalamang language (which it had never or barely encountered in the training data) really intriguing here, although not definitive evidence of anything (see section 5.2.2.1 of their Gemini paper for details).
I forget what the term for this is (maybe ‘data-efficient’?), but the best single test of an area is to compare the total amount of training information given to the AI in training and prompt to the amount a human gets in that area to get to a certain level of ability across a variety of representative areas. LLMs currently do terribly at this
From my point of view, sample efficiency is interesting but not that relevant; a model may have needed the equivalent of a thousand years of childhood to reach a certain level of intelligence, but the main thing I’m trying to investigate is what that level of intelligence is, regardless of how it got there.
I suspect that in your proposed test, modern AI would likely be able to solve the very easy questions, but would do quite badly on difficult ones. Problem is, I don’t know how easy should be expected to be solved. I am again reluctant to opine to strongly on this matter.
My intuition is similar, that it should be able to solve them up to a certain level of difficulty (and I also expect that the difficulty level they can manage correlates pretty well with model size). But as I see it, that’s exactly the core point under debate—are LLM limitations along these lines a matter of scale or a fundamental flaw in the entire LLM approach?
So, as you know, obfuscation is a method of hiding exactly what you are getting at. You can do this for things it already knows obviously, but you can also use whatever methods you use for generating a obfuscations of known data on the novel data you generated. I would strongly advise testing on known data as a comparison.
This is to test how much of the difficulty is based on the form of the question rather than the content. Or in other words, using the same exact words and setup, have completely unknown things, and completely known things asked about. (You can check how well it knows an area using the nonobfuscated stuff.)
Interesting point, thanks. I don’t think of the experiment as ultimately involving obfuscated data as much as novel data (certainly my aim is for it to be novel data, except insofar as it follows mathematical laws in a way that’s in-distribution for our universe), but I agree that it would be interesting and useful to see how the models do on a similar but known problem (maybe something like the gas laws). I’ll add that to the plan.
Thanks again for your deep engagement on this question! It’s both helpful and interesting to get to go into detail on this issue with someone who holds your view (whereas it’s easy to find people to fully represent the other view, and since I lean somewhat toward that view myself I think I have a pretty easy time representing the arguments for it).
Note that I am, in general, reluctant to claim to know how I will react to evidence in the future. There are things so far out there that I do know how I would react, but I like to allow myself to use all the evidence I have at that point, and not what I thought beforehand. I do not currently know enough about what would convince me of intelligence in an AI to say for sure. (In part because many people before me have been so obviously wrong.)
I wouldn’t say I see intelligence as a boolean, but as many valued… but those values include a level below which there is no meaningful intelligence (aka, not intelligent). This could be simplified to trinary, not binary. Not intelligent vs sort of intelligent vs genuinely intelligent. A rock… not intelligent. An abacus… not intelligent. A regular computer… not intelligent. Every program I have ever personally written, definitely not intelligent. Almost everyone agrees on those. There is a lot more disagreement about LLMs and other modern AI, but I’d still say they aren’t. (Sort of intelligent might include particularly advanced animals but I am unsure. I’ve certainly heard plenty of claims about it.)
I do think some of them can be said to understand certain things to a shallow degree despite not being intelligent, like how LLMS understand what I am asking them to do if I write something in Korean asking it to answer a particular question in English (or vice versa, I tested both when LLMs became a real thing because I am learning Korean and LLMs do often do it well even back when I tested it), or if I tell an image generation AI that I want a photo most understand what set of features make something photographic (if well trained).
Perhaps it should be noted that I think it requires either very deep understanding of something reasonably broad or notably general intelligence to count as intelligence? This is part of my definition. I generally think people should use the same definitions as each other in these discussions, but it should be the correct one and that is hard in this case since people do not understand intelligence deeply enough to have a great definition, even when we are just talking about humans. (Sometimes I barely think I qualify as intelligent, especially when reading math or AI papers, but that is a completely different definition. How we are defining it matters.)
I am highly unlikely to consider a tool AI to be intelligent, especially since I know it doesn’t understand much about things in general. I am utterly convinced that LLMs are simple tool AI at present, as are other AIs in general use. Modern tool AI might as well just be a very complicated program I wrote as far as intelligence goes according to me.
I actually see ‘neural nets’ as creating a lossy compression scheme using the data provided for their training, but then you supply a novel key during inference that wasn’t actually part of the data and see what happens. I have heard of people getting similar results just using mechanistic schemes of certain parts of normal lossless compression as well, though even more inefficiently. (Basically, you are making a dictionary based on the training data.) Gradient descent seems to allow very limited movement near real data to still make sense and that is what most of the other advancements involved seem to be for as well.
Generally when testing things like AI for intelligence, we seem to either serve up the easiest or hardest questions, because we either want them to fail or succeed based on our own beliefs. And I agree that the way something is obfuscated matters a lot to the difficulty of the question post obfuscation. The questioner is often at fault for how results turn out whether or not the thing being questioned is intelligent enough to answer in a neutral setting. (This is true when humans question humans as well.)
I don’t find arbitrary operations as compelling. The problem with arbitrary operations is the obvious fact that they don’t make sense. Under some definitions of intelligence that matters a lot. (I don’t personally know if it does.) Plus, I don’t know how to judge things perfectly (I’m overly perfectionist in attitude, even though I’ve realized it is impossible) if they are arbitrary except in trivial cases where I can just tell a computer the formula to check. That’s why I like the rediscovery stuff.
Can you make the arbitrary operations fit together perfectly in a sequence like numbers → succession → addition → multiplication in a way that we can truly know works? And then explain why it works clearly in few lines? If so, that is much better evidence. (That’s actually an interesting idea. LLMs clearly understand human language if they understand anything, so they should be able to do it based off of your explanation to humans if they are intelligent and a human would get it. Write up an article about the succession, with how it makes sense, and then ask questions that extend it in the obvious way.)
There could be a way in which its wrong answer, or the right answer, was somehow included in the question and I don’t know about it because I am not superhuman at next word prediction (obviously, and I don’t even try). Modern AI has proven itself quite capable at reading into word choice (if it understands anything well, that would be it), and we could get it to answer correctly like ‘clever Hans’ by massaging the question even subconsciously. (I’m sure this has been pointed out by many people.)
I still do think that these arbitrary operations are a good start, just not definitive. Honestly, in some ways the problem with arbitrary operations is that they are too hard, and thus more a problem for human memory and knowledge at a given difficulty than of intelligence. If an LLM was actually intelligent, it would be a different kind of intelligence, so we’d have a hard time gauging the results.
So, I think the test where you didn’t know what I was getting at is written in a somewhat unclear manner. Think of it in terms of a sequence of completions that keep getting both more novel and more requiring of intelligence for other reasons? (Possibly separately.) How does it perform on rote word completion? Compare that to how it performs on things requiring a little understanding. Then a little more. Up until you reach genuinely intellectually challenging and completely novel ideas. How does its ability to complete these sentences change as it requires more understanding of the world of thoughts and ideas rather than just sentence completion? Obviously, it will get worse, but how does it compare to humans on the level of change? Since it is superhuman on sentence completion, if at any time it does worse than a human, it seems like good evidence that it is reaching its limit.
One thing I think should be done more for AIs is give them actual reference materials like dictionaries or the grammar manual in the paper you mentioned. In fact, I think that the AI should be trained to write those itself. (I’m sure some people do that. It is not the same as what o1 is doing, because o1′s approach is far too primitive and short term.)
I do have a major problem with taking the Gemini paper at face value, because each paper in AI makes claims that turn out to be overstated (this is probably normal in all fields, but I don’t read many outside of AI, and those are mostly just specific math.) They all sound good, but turn out to be not what is claimed. (That said, LLMs really are good at translation, though for some reason google translate doesn’t actually work all that well when used for more than a short phrase, which is funny considering the claim in the paper is for a google AI.) For some reason google AI can’t do Korean well, for instance. (I haven’t tried Gemini as I got bored of trying LLMs by then.)
From reading their description, I am not entirely sure what their procedure of testing was. The writeup seems unclear. But if I’m reading it right, the setup is designed such that it makes it harder to be sure whether the machine translation is correct. Reference translations are a proxy, so in comparing the AI translation to it rather than the actual meaning there is a bunch of extra noise.
That said, the translation of Kalamang from a grammar book and dictionary is probably close enough to the kind of thing I was speculating on assuming there really wasn’t any in the training. Now it needs to be done a bunch of times by neutral parties. (Not me, I’m lazy, very skeptical, and not a linguist.) The included table looks like to me that it is actually dramatically inferior to human performance according human evaluations when translating from Kalamang (though relatively close on English to Kalamang). It is interesting.
Ignore sample-efficiency (is that the term?) at your own peril. While you are thinking about the training, I wasn’t really talking about the training, I was talking about how well it does on things for which it isn’t trained. When it comes across new information, how well does it integrate and use that when it has only seen a little bit or it is nonobviously related? This is sort of related to the few shot prompting. The fewer hints it needs to get up to a high level of performance for something it can’t do from initial training, the more likely it is to be intelligent. Most things in the world are still novel to the AI despite the insane amount of things it saw in training, which is why it makes so many mistakes. We know it can do the things it has seen a billion times (possibly literally) in its training, and that is uninteresting.
I’m glad you think this has been a valuable exchange, because I don’t think I’ve written my points very well. (Both too long and unclear for other reasons at the same time.) I have a feeling that everything I’ve said could be much clearer. (Also, given how much I wrote, a fair bit is probably wrong.) It has been interesting to me responding to your posts and having to actually think through what I think. It’s easy to get lazy when thinking about things just by myself.
I have heard of people getting similar results just using mechanistic schemes of certain parts of normal lossless compression as well, though even more inefficiently.
Interesting, if you happen to have a link I’d be interested to learn more.
Think of it in terms of a sequence of completions that keep getting both more novel and more requiring of intelligence for other reasons?
I like the idea, but it seems hard to judge ‘more novel and [especially] more requiring of intelligence’ other than to sort completions in order of human error on each.
I wasn’t really talking about the training, I was talking about how well it does on things for which it isn’t trained. When it comes across new information, how well does it integrate and use that when it has only seen a little bit or it is nonobviously related?
I think there’s a lot of work to be done on this still, but there’s some evidence that in-context learning is essentially equivalent to gradient descent (though also some criticism of that claim).
I’m glad you think this has been a valuable exchange
Sorry, I don’t have a link for using actual compression algorithms, it was a while ago. I didn’t think it would come up so I didn’t note anything down. My recent spate of commenting is unusual for me (and I don’t actually keep many notes on AI related subjects).
I definitely agree that it is ‘hard to judge’ ‘more novel and more requiring of intelligence’. It is, after all, a major thing we don’t even know how to clearly solve for evaluating other humans (so we use tricks that often rely on other things and these tricks likely do not generalize to other possible intelligences and thus couldn’t use here). Intelligence has not been solved.
Still, there is a big difference between the level of intelligence required when discussing how great your favorite popstar is vs what in particular they are good at vs why they are good at it (and within each category there are much more or less intellectual ways to write about it, though intellectual should not be confused with intelligent). It would have been nice if I could think up good examples, but I couldn’t. You could possibly check things like how well it completes things like parts of this conversation (which is somewhere in the middle).
I wasn’t really able to properly evaluate your links. There’s just too much they assume that I don’t know.
I found your first link, ‘Transformers Learn In-Context by Gradient Descent’ a bit hard to follow (though I don’t particularly think it is a fault of the paper itself). Once they get down to the details, they lose me. It is interesting that it would come up with similar changes based on training and just ‘reading’ the context, but if both mechanisms are simple, I suppose that makes sense.
Their claim about how in context can ‘curve’ better also reminds me of the ODEs used for samplers in diffusion models (I’ve written a number of samplers for diffusion models as a hobby/ to work on my programming). Higher degree ODEs curve more too (though they have their own drawbacks and particularly high degree is generally a bad idea) by using extra samples, just like this can use extra layers. Gradient descent is effectively first degree by default, right? So it wouldn’t be a surprise if you can curve more than it. You would expect sufficiently general things to resemble each other of course. I do find it a bit strange just how similar the loss for steps of gradient descent and transformer layers is. (Random point: I find that loss is not a very good metric for how good the actual results are at least in image generation/reconstruction. Not that I know of a good replacement. People do often come up with various different ways of measuring it though.)
Even though I can’t critique the details, I do think it is important to note that I often find claims of similarity like this in areas I understand better to not be very illuminating because people want to find similarities/analogies to understand it more easily.
The graphs really are shockingly similar though in the single layer case, which raises the likelihood that there’s something to it. And the multi-layer ones really does seem like simply a higher degree polynomial ODE.
The second link ‘In-context Learning and Gradient Descent Revisited’, which was equally difficult, has this line “Surprisingly, we find that untrained models achieve similarity scores at least as good as trained ones. This result provides strong evidence against the strong ICL-GD correspondence.” Which sounds pretty damning to me, assuming they are correct (which I also can’t evaluate).
I could probably figure them out, but I expect it would take me a lot of time.
Even though I can’t critique the details, I do think it is important to note that I often find claims of similarity like this in areas I understand better to not be very illuminating because people want to find similarities/analogies to understand it more easily.
No problem with the failure to respond. I appreciate that this way of communicating is asynchronous (and I don’t necessarily reply to things promptly either). And I think it would be reasonable to drop it at any point if it didn’t seem valuable.
Sorry, I should have been clearer. I agree it’s straightforward in cases like the ones you give, I’m really thinking of the case of large language models. It’s not at all clear to me that we even have a good way to identify in- vs out-of-distribution for a model trained against much of the internet. If we did, some of this stuff would be much easier to test.
What would constitute a (minimal-ish) definitive test in your view?
And how do you expect the proposed experiment to go? Would you expect current-generation LLMs to fail completely, or to succeed for simple but not complex cases, or to have an easy time with it?
Absolutely; this is a huge weakness of much of the existing research trying to test the limitations of LLMs with respect to general reasoning ability, and a large motivation for the experiment (which has just been accepted for the next session of AI Safety Camp; if things go as expected I’ll be leading a research team on this experiment).
I’m not sure what it would mean for something not in the training data to be obfuscated. Obfuscated relative to what? In any case, my aim is very much to test something that’s definitively not in the training data, because it’s been randomly generated and uses novel words.
Sure, I only mean that there’s a strong correlation, not that there’s a perfect correspondence.
I think it’s important to distinguish error rate on the loss function, which pretty reliably decreases with scale, from other measures like ‘Does it make better art?‘, which a) quite plausibly don’t improve with scale since they’re not not what the model’s being trained on, and b) are very much harder to judge. Even ‘Is the skin plasticky or unrealistic?’ seems tricky (though not impossible) to judge without a human labeler.
Of course, one of the main causes of confusion is that ‘Is it good at general reasoning?’ is also a hard-to-judge question, and although it certainly seems to have improved significantly with scale, it’s hard to show that in a principled way. The experiment I describe is designed to at least get at a subset of that in a somewhat more principled way: can the models develop hypotheses in novel domains, figure out experiments that will test those hypotheses, and come to conclusions that match the underlying ground truth?
What would be a minimal-ish definitive test for LLM style AI? I don’t really know. I could come up with tests for it most likely, but I don’t really know how to make them fairly minimal. I can tell you that current AI isn’t intelligent, but as for what would prove intelligence, I’ve been thinking about it for a while and I really don’t have much. I wish I could be more helpful.
I do think your test of whether an AI can follow the scientific method in a novel area is intriguing.
Historically, a lot of people have come up with (in retrospect) really dumb tests (like Chess playing) that they assumed would be this because they didn’t really understand how AI would work, and this doesn’t seem to have abated with the switch to deep learning. I don’t want to do that, and thus I am reluctant to try (another problem with comparing human intelligence to machine intelligence). This is complicated in part because we really don’t even understand the nature of human intelligence, much less general intelligence in the abstract.
In theory, it is simple, but there is no single test that is necessarily robust to things like being in the training data because someone decided on that particular (which has happened many times when someone pointed out a particular flaw, but the particular test needn’t be included for that reason) so it would need to be tested across a number of different areas, and they all need to be genuinely hard if it doesn’t have the capability. Obviously the exact test items being held in reserve is useful, but I don’t think it can rule out being included since there are an awful lot of people making training data due to the way these are trained. Obfuscation does help, but I wouldn’t rule out it figuring out how to deobfuscate things without being generally intelligent (humans are not great generators of problems).
More limited specific tests are easier to design. We can programmatically create effectively infinite math problems to test and as long as the generator produces a notably different distribution of problems we know it has learned math when it does well… but that only tests whether it can do math and they can create effectively infinite math for the training as well.
Perhaps if you could genuinely exclude all data during training that in any way has to do with a certain scientific discovery from training you could check how well it discerns the real rule from plausible alternative rules when asked, but the best way to do that takes a very long time (waiting for scientific discoveries that weren’t even theorized correctly at the time it was trained), and the other ways of doing it have been shown to be leaky.
The best non minimal way is to introduce it to entirely new domains where it has not been trained at all, but that requires controlling the training very tightly and may not be useful as an external metric. For instance, train it on only numbers and addition (or for bonus points, only explain addition in terms of the succession of numbers on the number line) mathematically, then explain multiplication in terms of addition and ask it to do a lot of complicated multiplication. If it does that well, explain division in terms of multiplication, and so on. See just how deep it can go and maintain correctness when you explain things only in terms of other things with just that single link. This is not an especially different idea than the one proposed, of course, but I would find it more telling. If it was good at this, then I think it would be worth looking into the level of intelligence it has more closely, but doing well here isn’t proof. (In other words, I think your test is a good start, just not proof.)
The problem is, all models are trained on math in general because of how the internet works so it needs to be these less well-defined areas where we can’t be certain whether or not the answers are in some way correct or flawed, and crucially, just how hard the problems really are. Is it failing/acing extremely difficult/trivial problems? Our intuitions on what is easy/hard seem built specifically for humans. (We aren’t entirely general intelligences as we appear to have many special purpose capabilities bolted on, like judging other humans.) Also, giving it access to math tools would be cheating, but people have already started integrating tools for things like that into LLMs.
LLMs are supposedly superhuman at next word prediction, so an interesting (though not telling) test for an LLM might be varying the amount of informational and intelligence requiring information there is in a completely novel text by an author they have never seen before, and seeing how well the LLM continues to predict the next word. If it remains at a similar level, there’s probably something worth looking closely at going in terms of reasoning. (This can of course be gamed by making it worse at next word prediction on low content stuff.) This is similar to verification set testing though, so there is some selection for this in what gets released.
For bonus points, a linguist could make up a bunch of very different full-fledged languages it hasn’t been exposed to using arbitrary (and unusual) rules of grammar and see how well it does on those tests in the new languages compared to an average human with just the same key to the languages (but this can’t just be a cipher, as those are reversible without intelligence once it has figured out how to deobfuscate things and I believe that plausibly doesn’t require intelligence exactly, though it would for a human.)
I forget what the term for this is (maybe ‘data-efficient’?), but the best single test of an area is to compare the total amount of training information given to the AI in training and prompt to the amount a human gets in that area to get to a certain level of ability across a variety of representative areas. LLMs currently do terribly at this, and we don’t have anyone even vaguely suggesting that even considering trying this at levels with as little training data as humans use would make any sense at all (and again, humans have some specific purpose capabilities built in, so this isn’t even a great test). We also don’t even know how much training data humans actually get… (I’ve seen people trying to ballpark it, but it didn’t seem credible at the time.)
I suspect that in your proposed test, modern AI would likely be able to solve the very easy questions, but would do quite badly on difficult ones. Problem is, I don’t know how easy should be expected to be solved. I am again reluctant to opine to strongly on this matter.
So, as you know, obfuscation is a method of hiding exactly what you are getting at. You can do this for things it already knows obviously, but you can also use whatever methods you use for generating a obfuscations of known data on the novel data you generated. I would strongly advise testing on known data as a comparison.
This is to test how much of the difficulty is based on the form of the question rather than the content. Or in other words, using the same exact words and setup, have completely unknown things, and completely known things asked about. (You can check how well it knows an area using the nonobfuscated stuff.) For bonus points, see how well it does on things where it already struggles just a little in plain English too.
On another note, I do believe that image generation models are specifically being trained these days to be better at both aesthetics and realism, and are simply failing to move the needle sufficiently as they grow larger. I do agree that even the ‘skin test’ isn’t really super objective (since it is testing vs the parts that humans probably have built in which likely have some skew, and a human doesn’t want to judge thousands of pictures a day on such a matter, while using an AI to judge AI really is quite error prone.).
Thanks for the lengthy and thoughtful reply!
I’m planning to make a LW post soon asking for more input on this experiment—one of my goals here is to make this experiment one that both sides of the debate agree in advance would provide good evidence. I’d love to get your input there as well if you’re so moved!
I tend not to think of intelligence as a boolean property, but of an entity having some level of intelligence (like IQ, although we certainly can’t blithely give IQ tests to LLMs and treat the results as meaningful, not that that stops people from doing it). I don’t imagine you think of it as boolean either, but calling that out in case I’m mistaken.
Agreed; at this point I assume that anything published before (or not long after) the knowledge cutoff may well be in the training data.
The obfuscation method matters as well; eg I think the Kambhampati team’s approach to obfuscation made the problems much harder in ways that are irrelevant or counterproductive to testing LLM reasoning abilities (see Ryan’s comment here and my reply for details).
I’d absolutely love that and agree it would help enormously to resolve these sorts of questions. But my guess is we won’t see deliberate exclusions on frontier LLMs anytime in the next couple of years; it’s difficult and labor-intensive to do at internet scale, and the leading companies haven’t shown any interest in doing so AFAIK (or even in releasing comprehensive data about what the training data was).
Very interesting idea! I think I informally anectested something similar at one point by introducing new mathematical operations (but can’t recall how it turned out). Two questions:
Since we can’t in practice train a frontier LLM without multiplication, would artificial new operations be equally convincing in your view (eg, I don’t know,
x # y
meanssqrt(x - 2y)
? Ideally something a bit less arbitrary than that, though mathematicians tend to already write about the non-arbitrary ones).Would providing few-shot examples (eg several demonstrations of
x # y
for particular values of x and y) make it less compelling?It’s fun to confirm that for yourself :)
Sorry, I’m failing to understand the test you’re proposing; can you spell it out a bit more?
I found DeepMind’s experiment in teaching Gemini the Kalamang language (which it had never or barely encountered in the training data) really intriguing here, although not definitive evidence of anything (see section 5.2.2.1 of their Gemini paper for details).
From my point of view, sample efficiency is interesting but not that relevant; a model may have needed the equivalent of a thousand years of childhood to reach a certain level of intelligence, but the main thing I’m trying to investigate is what that level of intelligence is, regardless of how it got there.
My intuition is similar, that it should be able to solve them up to a certain level of difficulty (and I also expect that the difficulty level they can manage correlates pretty well with model size). But as I see it, that’s exactly the core point under debate—are LLM limitations along these lines a matter of scale or a fundamental flaw in the entire LLM approach?
Interesting point, thanks. I don’t think of the experiment as ultimately involving obfuscated data as much as novel data (certainly my aim is for it to be novel data, except insofar as it follows mathematical laws in a way that’s in-distribution for our universe), but I agree that it would be interesting and useful to see how the models do on a similar but known problem (maybe something like the gas laws). I’ll add that to the plan.
Thanks again for your deep engagement on this question! It’s both helpful and interesting to get to go into detail on this issue with someone who holds your view (whereas it’s easy to find people to fully represent the other view, and since I lean somewhat toward that view myself I think I have a pretty easy time representing the arguments for it).
Note that I am, in general, reluctant to claim to know how I will react to evidence in the future. There are things so far out there that I do know how I would react, but I like to allow myself to use all the evidence I have at that point, and not what I thought beforehand. I do not currently know enough about what would convince me of intelligence in an AI to say for sure. (In part because many people before me have been so obviously wrong.)
I wouldn’t say I see intelligence as a boolean, but as many valued… but those values include a level below which there is no meaningful intelligence (aka, not intelligent). This could be simplified to trinary, not binary. Not intelligent vs sort of intelligent vs genuinely intelligent. A rock… not intelligent. An abacus… not intelligent. A regular computer… not intelligent. Every program I have ever personally written, definitely not intelligent. Almost everyone agrees on those. There is a lot more disagreement about LLMs and other modern AI, but I’d still say they aren’t. (Sort of intelligent might include particularly advanced animals but I am unsure. I’ve certainly heard plenty of claims about it.)
I do think some of them can be said to understand certain things to a shallow degree despite not being intelligent, like how LLMS understand what I am asking them to do if I write something in Korean asking it to answer a particular question in English (or vice versa, I tested both when LLMs became a real thing because I am learning Korean and LLMs do often do it well even back when I tested it), or if I tell an image generation AI that I want a photo most understand what set of features make something photographic (if well trained).
Perhaps it should be noted that I think it requires either very deep understanding of something reasonably broad or notably general intelligence to count as intelligence? This is part of my definition. I generally think people should use the same definitions as each other in these discussions, but it should be the correct one and that is hard in this case since people do not understand intelligence deeply enough to have a great definition, even when we are just talking about humans. (Sometimes I barely think I qualify as intelligent, especially when reading math or AI papers, but that is a completely different definition. How we are defining it matters.)
I am highly unlikely to consider a tool AI to be intelligent, especially since I know it doesn’t understand much about things in general. I am utterly convinced that LLMs are simple tool AI at present, as are other AIs in general use. Modern tool AI might as well just be a very complicated program I wrote as far as intelligence goes according to me.
I actually see ‘neural nets’ as creating a lossy compression scheme using the data provided for their training, but then you supply a novel key during inference that wasn’t actually part of the data and see what happens. I have heard of people getting similar results just using mechanistic schemes of certain parts of normal lossless compression as well, though even more inefficiently. (Basically, you are making a dictionary based on the training data.) Gradient descent seems to allow very limited movement near real data to still make sense and that is what most of the other advancements involved seem to be for as well.
Generally when testing things like AI for intelligence, we seem to either serve up the easiest or hardest questions, because we either want them to fail or succeed based on our own beliefs. And I agree that the way something is obfuscated matters a lot to the difficulty of the question post obfuscation. The questioner is often at fault for how results turn out whether or not the thing being questioned is intelligent enough to answer in a neutral setting. (This is true when humans question humans as well.)
I don’t find arbitrary operations as compelling. The problem with arbitrary operations is the obvious fact that they don’t make sense. Under some definitions of intelligence that matters a lot. (I don’t personally know if it does.) Plus, I don’t know how to judge things perfectly (I’m overly perfectionist in attitude, even though I’ve realized it is impossible) if they are arbitrary except in trivial cases where I can just tell a computer the formula to check. That’s why I like the rediscovery stuff.
Can you make the arbitrary operations fit together perfectly in a sequence like numbers → succession → addition → multiplication in a way that we can truly know works? And then explain why it works clearly in few lines? If so, that is much better evidence. (That’s actually an interesting idea. LLMs clearly understand human language if they understand anything, so they should be able to do it based off of your explanation to humans if they are intelligent and a human would get it. Write up an article about the succession, with how it makes sense, and then ask questions that extend it in the obvious way.)
There could be a way in which its wrong answer, or the right answer, was somehow included in the question and I don’t know about it because I am not superhuman at next word prediction (obviously, and I don’t even try). Modern AI has proven itself quite capable at reading into word choice (if it understands anything well, that would be it), and we could get it to answer correctly like ‘clever Hans’ by massaging the question even subconsciously. (I’m sure this has been pointed out by many people.)
I still do think that these arbitrary operations are a good start, just not definitive. Honestly, in some ways the problem with arbitrary operations is that they are too hard, and thus more a problem for human memory and knowledge at a given difficulty than of intelligence. If an LLM was actually intelligent, it would be a different kind of intelligence, so we’d have a hard time gauging the results.
So, I think the test where you didn’t know what I was getting at is written in a somewhat unclear manner. Think of it in terms of a sequence of completions that keep getting both more novel and more requiring of intelligence for other reasons? (Possibly separately.) How does it perform on rote word completion? Compare that to how it performs on things requiring a little understanding. Then a little more. Up until you reach genuinely intellectually challenging and completely novel ideas. How does its ability to complete these sentences change as it requires more understanding of the world of thoughts and ideas rather than just sentence completion? Obviously, it will get worse, but how does it compare to humans on the level of change? Since it is superhuman on sentence completion, if at any time it does worse than a human, it seems like good evidence that it is reaching its limit.
One thing I think should be done more for AIs is give them actual reference materials like dictionaries or the grammar manual in the paper you mentioned. In fact, I think that the AI should be trained to write those itself. (I’m sure some people do that. It is not the same as what o1 is doing, because o1′s approach is far too primitive and short term.)
I do have a major problem with taking the Gemini paper at face value, because each paper in AI makes claims that turn out to be overstated (this is probably normal in all fields, but I don’t read many outside of AI, and those are mostly just specific math.) They all sound good, but turn out to be not what is claimed. (That said, LLMs really are good at translation, though for some reason google translate doesn’t actually work all that well when used for more than a short phrase, which is funny considering the claim in the paper is for a google AI.) For some reason google AI can’t do Korean well, for instance. (I haven’t tried Gemini as I got bored of trying LLMs by then.)
From reading their description, I am not entirely sure what their procedure of testing was. The writeup seems unclear. But if I’m reading it right, the setup is designed such that it makes it harder to be sure whether the machine translation is correct. Reference translations are a proxy, so in comparing the AI translation to it rather than the actual meaning there is a bunch of extra noise.
That said, the translation of Kalamang from a grammar book and dictionary is probably close enough to the kind of thing I was speculating on assuming there really wasn’t any in the training. Now it needs to be done a bunch of times by neutral parties. (Not me, I’m lazy, very skeptical, and not a linguist.) The included table looks like to me that it is actually dramatically inferior to human performance according human evaluations when translating from Kalamang (though relatively close on English to Kalamang). It is interesting.
Ignore sample-efficiency (is that the term?) at your own peril. While you are thinking about the training, I wasn’t really talking about the training, I was talking about how well it does on things for which it isn’t trained. When it comes across new information, how well does it integrate and use that when it has only seen a little bit or it is nonobviously related? This is sort of related to the few shot prompting. The fewer hints it needs to get up to a high level of performance for something it can’t do from initial training, the more likely it is to be intelligent. Most things in the world are still novel to the AI despite the insane amount of things it saw in training, which is why it makes so many mistakes. We know it can do the things it has seen a billion times (possibly literally) in its training, and that is uninteresting.
I’m glad you think this has been a valuable exchange, because I don’t think I’ve written my points very well. (Both too long and unclear for other reasons at the same time.) I have a feeling that everything I’ve said could be much clearer. (Also, given how much I wrote, a fair bit is probably wrong.) It has been interesting to me responding to your posts and having to actually think through what I think. It’s easy to get lazy when thinking about things just by myself.
Interesting, if you happen to have a link I’d be interested to learn more.
I like the idea, but it seems hard to judge ‘more novel and [especially] more requiring of intelligence’ other than to sort completions in order of human error on each.
I think there’s a lot of work to be done on this still, but there’s some evidence that in-context learning is essentially equivalent to gradient descent (though also some criticism of that claim).
I continue to think so :). Thanks again!
Sorry, I don’t have a link for using actual compression algorithms, it was a while ago. I didn’t think it would come up so I didn’t note anything down. My recent spate of commenting is unusual for me (and I don’t actually keep many notes on AI related subjects).
I definitely agree that it is ‘hard to judge’ ‘more novel and more requiring of intelligence’. It is, after all, a major thing we don’t even know how to clearly solve for evaluating other humans (so we use tricks that often rely on other things and these tricks likely do not generalize to other possible intelligences and thus couldn’t use here). Intelligence has not been solved.
Still, there is a big difference between the level of intelligence required when discussing how great your favorite popstar is vs what in particular they are good at vs why they are good at it (and within each category there are much more or less intellectual ways to write about it, though intellectual should not be confused with intelligent). It would have been nice if I could think up good examples, but I couldn’t. You could possibly check things like how well it completes things like parts of this conversation (which is somewhere in the middle).
I wasn’t really able to properly evaluate your links. There’s just too much they assume that I don’t know.
I found your first link, ‘Transformers Learn In-Context by Gradient Descent’ a bit hard to follow (though I don’t particularly think it is a fault of the paper itself). Once they get down to the details, they lose me. It is interesting that it would come up with similar changes based on training and just ‘reading’ the context, but if both mechanisms are simple, I suppose that makes sense.
Their claim about how in context can ‘curve’ better also reminds me of the ODEs used for samplers in diffusion models (I’ve written a number of samplers for diffusion models as a hobby/ to work on my programming). Higher degree ODEs curve more too (though they have their own drawbacks and particularly high degree is generally a bad idea) by using extra samples, just like this can use extra layers. Gradient descent is effectively first degree by default, right? So it wouldn’t be a surprise if you can curve more than it. You would expect sufficiently general things to resemble each other of course. I do find it a bit strange just how similar the loss for steps of gradient descent and transformer layers is. (Random point: I find that loss is not a very good metric for how good the actual results are at least in image generation/reconstruction. Not that I know of a good replacement. People do often come up with various different ways of measuring it though.)
Even though I can’t critique the details, I do think it is important to note that I often find claims of similarity like this in areas I understand better to not be very illuminating because people want to find similarities/analogies to understand it more easily.
The graphs really are shockingly similar though in the single layer case, which raises the likelihood that there’s something to it. And the multi-layer ones really does seem like simply a higher degree polynomial ODE.
The second link ‘In-context Learning and Gradient Descent Revisited’, which was equally difficult, has this line “Surprisingly, we find that untrained models achieve similarity scores at least as good as trained ones. This result provides strong evidence against the strong ICL-GD correspondence.” Which sounds pretty damning to me, assuming they are correct (which I also can’t evaluate).
I could probably figure them out, but I expect it would take me a lot of time.
Agreed, that’s definitely a general failure mode.
Hi, apologies for having failed to respond; I went out of town and lost track of this thread. Reading back through what you’ve said. Thank you!
No problem with the failure to respond. I appreciate that this way of communicating is asynchronous (and I don’t necessarily reply to things promptly either). And I think it would be reasonable to drop it at any point if it didn’t seem valuable.
Also, you’re welcome.