deepthoughtlife comments on LLM Generality is a Timeline Crux

deepthoughtlife 16 Oct 2024 8:29 UTC
1 point
0
Note that I am, in general, reluctant to claim to know how I will react to evidence in the future. There are things so far out there that I do know how I would react, but I like to allow myself to use all the evidence I have at that point, and not what I thought beforehand. I do not currently know enough about what would convince me of intelligence in an AI to say for sure. (In part because many people before me have been so obviously wrong.)
I wouldn’t say I see intelligence as a boolean, but as many valued… but those values include a level below which there is no meaningful intelligence (aka, not intelligent). This could be simplified to trinary, not binary. Not intelligent vs sort of intelligent vs genuinely intelligent. A rock… not intelligent. An abacus… not intelligent. A regular computer… not intelligent. Every program I have ever personally written, definitely not intelligent. Almost everyone agrees on those. There is a lot more disagreement about LLMs and other modern AI, but I’d still say they aren’t. (Sort of intelligent might include particularly advanced animals but I am unsure. I’ve certainly heard plenty of claims about it.)
I do think some of them can be said to understand certain things to a shallow degree despite not being intelligent, like how LLMS understand what I am asking them to do if I write something in Korean asking it to answer a particular question in English (or vice versa, I tested both when LLMs became a real thing because I am learning Korean and LLMs do often do it well even back when I tested it), or if I tell an image generation AI that I want a photo most understand what set of features make something photographic (if well trained).
Perhaps it should be noted that I think it requires either very deep understanding of something reasonably broad or notably general intelligence to count as intelligence? This is part of my definition. I generally think people should use the same definitions as each other in these discussions, but it should be the correct one and that is hard in this case since people do not understand intelligence deeply enough to have a great definition, even when we are just talking about humans. (Sometimes I barely think I qualify as intelligent, especially when reading math or AI papers, but that is a completely different definition. How we are defining it matters.)
I am highly unlikely to consider a tool AI to be intelligent, especially since I know it doesn’t understand much about things in general. I am utterly convinced that LLMs are simple tool AI at present, as are other AIs in general use. Modern tool AI might as well just be a very complicated program I wrote as far as intelligence goes according to me.
I actually see ‘neural nets’ as creating a lossy compression scheme using the data provided for their training, but then you supply a novel key during inference that wasn’t actually part of the data and see what happens. I have heard of people getting similar results just using mechanistic schemes of certain parts of normal lossless compression as well, though even more inefficiently. (Basically, you are making a dictionary based on the training data.) Gradient descent seems to allow very limited movement near real data to still make sense and that is what most of the other advancements involved seem to be for as well.
Generally when testing things like AI for intelligence, we seem to either serve up the easiest or hardest questions, because we either want them to fail or succeed based on our own beliefs. And I agree that the way something is obfuscated matters a lot to the difficulty of the question post obfuscation. The questioner is often at fault for how results turn out whether or not the thing being questioned is intelligent enough to answer in a neutral setting. (This is true when humans question humans as well.)
I don’t find arbitrary operations as compelling. The problem with arbitrary operations is the obvious fact that they don’t make sense. Under some definitions of intelligence that matters a lot. (I don’t personally know if it does.) Plus, I don’t know how to judge things perfectly (I’m overly perfectionist in attitude, even though I’ve realized it is impossible) if they are arbitrary except in trivial cases where I can just tell a computer the formula to check. That’s why I like the rediscovery stuff.
Can you make the arbitrary operations fit together perfectly in a sequence like numbers → succession → addition → multiplication in a way that we can truly know works? And then explain why it works clearly in few lines? If so, that is much better evidence. (That’s actually an interesting idea. LLMs clearly understand human language if they understand anything, so they should be able to do it based off of your explanation to humans if they are intelligent and a human would get it. Write up an article about the succession, with how it makes sense, and then ask questions that extend it in the obvious way.)
There could be a way in which its wrong answer, or the right answer, was somehow included in the question and I don’t know about it because I am not superhuman at next word prediction (obviously, and I don’t even try). Modern AI has proven itself quite capable at reading into word choice (if it understands anything well, that would be it), and we could get it to answer correctly like ‘clever Hans’ by massaging the question even subconsciously. (I’m sure this has been pointed out by many people.)
I still do think that these arbitrary operations are a good start, just not definitive. Honestly, in some ways the problem with arbitrary operations is that they are too hard, and thus more a problem for human memory and knowledge at a given difficulty than of intelligence. If an LLM was actually intelligent, it would be a different kind of intelligence, so we’d have a hard time gauging the results.
So, I think the test where you didn’t know what I was getting at is written in a somewhat unclear manner. Think of it in terms of a sequence of completions that keep getting both more novel and more requiring of intelligence for other reasons? (Possibly separately.) How does it perform on rote word completion? Compare that to how it performs on things requiring a little understanding. Then a little more. Up until you reach genuinely intellectually challenging and completely novel ideas. How does its ability to complete these sentences change as it requires more understanding of the world of thoughts and ideas rather than just sentence completion? Obviously, it will get worse, but how does it compare to humans on the level of change? Since it is superhuman on sentence completion, if at any time it does worse than a human, it seems like good evidence that it is reaching its limit.
One thing I think should be done more for AIs is give them actual reference materials like dictionaries or the grammar manual in the paper you mentioned. In fact, I think that the AI should be trained to write those itself. (I’m sure some people do that. It is not the same as what o1 is doing, because o1′s approach is far too primitive and short term.)
I do have a major problem with taking the Gemini paper at face value, because each paper in AI makes claims that turn out to be overstated (this is probably normal in all fields, but I don’t read many outside of AI, and those are mostly just specific math.) They all sound good, but turn out to be not what is claimed. (That said, LLMs really are good at translation, though for some reason google translate doesn’t actually work all that well when used for more than a short phrase, which is funny considering the claim in the paper is for a google AI.) For some reason google AI can’t do Korean well, for instance. (I haven’t tried Gemini as I got bored of trying LLMs by then.)
From reading their description, I am not entirely sure what their procedure of testing was. The writeup seems unclear. But if I’m reading it right, the setup is designed such that it makes it harder to be sure whether the machine translation is correct. Reference translations are a proxy, so in comparing the AI translation to it rather than the actual meaning there is a bunch of extra noise.
That said, the translation of Kalamang from a grammar book and dictionary is probably close enough to the kind of thing I was speculating on assuming there really wasn’t any in the training. Now it needs to be done a bunch of times by neutral parties. (Not me, I’m lazy, very skeptical, and not a linguist.) The included table looks like to me that it is actually dramatically inferior to human performance according human evaluations when translating from Kalamang (though relatively close on English to Kalamang). It is interesting.
Ignore sample-efficiency (is that the term?) at your own peril. While you are thinking about the training, I wasn’t really talking about the training, I was talking about how well it does on things for which it isn’t trained. When it comes across new information, how well does it integrate and use that when it has only seen a little bit or it is nonobviously related? This is sort of related to the few shot prompting. The fewer hints it needs to get up to a high level of performance for something it can’t do from initial training, the more likely it is to be intelligent. Most things in the world are still novel to the AI despite the insane amount of things it saw in training, which is why it makes so many mistakes. We know it can do the things it has seen a billion times (possibly literally) in its training, and that is uninteresting.
I’m glad you think this has been a valuable exchange, because I don’t think I’ve written my points very well. (Both too long and unclear for other reasons at the same time.) I have a feeling that everything I’ve said could be much clearer. (Also, given how much I wrote, a fair bit is probably wrong.) It has been interesting to me responding to your posts and having to actually think through what I think. It’s easy to get lazy when thinking about things just by myself.