I’m a bit confused about this post. Are you saying it is theoretically impossible to create an LLM that can do 3*3 matrix multiplication without using chain of thought? That seems false.
The amount of computation an LLM has done so far will be a function of both the size of the LLM (call it the s factor) and the number of tokens generate so far (n). Let’s say matrix multiplication of n*n matrices requires cn^3 amount of computation (actually, there are more efficient algos, but it doesn’t matter).
You can do this by either using a small LLM and n^3 tokens so that sn^3 > cn^3. Or you can use a bigger LLM, so that s_big*n> cn^3. So then just need n tokens.
In general, you can always get a bigger and bigger constant factor to solve problems with higher n.
If your claim is that, for any LLM that works in the same way as GPT, there will exist a value of n for which it will stop being capable of doing n*n matrix multiplication without chain of thought/extra work, I’d cautiously agree.
An LLM takes the same amount of computation for each generated token, regardless of how hard it is to predict. This limits the complexity of any problem an LLM is trying to solve.
For a given LLM, yes, there will be a limit on amount of computation it can do while generating a token. But this amount can be arbitrarily large. And you can always make a bigger/smarter LLM.
Consider two statements:
“The richest country in North America is the United States of ______”
“The SHA1 of ‘abc123’, iterated 500 times, is _______”
An LLM’s goal is to predict the best token to fill in the blank given its training and the previous context. Completing statement 1 requires knowledge about the world but is computationally trivial. Statement 2 requires a lot of computation. Regardless, the LLM performs the same amount of work for either statement.
Well, it might do the same amount of “computation” but for problem 1 that might mostly be filler work while for problem 2 it can do intelligent work. You can always use more compute than necessary, so why does statement 1 using the same amount of compute as statement 2 imply any sort of limitation on the LLM?
It cannot correctly solve computationally hard statements like #2. Period. If it could, that would imply that all problems can be solved in constant time, which is provably (and obviously) false.
Why does this matter? It puts some bounds on what an LLM can do.
But it’s easy to imagine a huge LLM capable doing 500 iterations of SH1 of small strings in one shot (even without memorization)? Why do you think that’s impossible? (Just imagine a transformer circuit calculating SHA1, repeated 500 times). This doesn’t imply that all problems can be solved in constant time. It just means that the LLM will only be able to do this until the length of string is bigger than a certain limit. After that, you’ll need to make the LLM bigger/smarter.
Thanks for the deep and thoughtful comment. I hadn’t considered that “In general, you can always get a bigger and bigger constant factor to solve problems with higher n.”
I’m going to think carefully and then see how this changes my thinking. I’ll try to reply again soon.
This doesn’t matter much, as the constant factor needed still grows as fast as the asymptotic bound. GPT does not have a big enough constant factor. (This objection has always been true of asymptotic bounds.)
The solution is IMO just to consider the number of computations performed per generated token as some function of the model size, and once we’ve identified a suitable asymptotic order on the function, we can say intelligent things like “the smallest network capable of solving a problem in complexity class C of size N is X”.
Or if our asymptotic bounds are not tight enough:
“No economically feasible LLM can solve problems in complexity class C of size >= N”.
(Where economically feasible may be something defined by aggregate global economic resources or similar, depending on how tight you want the bound to be.)
Regardless, we can still obtain meaningful impossibility results.
Okay, you raise a very good point. To analogize to my own brain: it’s like noticing that I can multiply integers 1-20 in my head in one step, but for larger numbers I need to write it out. Does that mean that my neural net can do multiplication? Well, as you say, it depends on n.
it’s easy to imagine a huge LLM capable doing 500 iterations of SH1 of small strings in one shot
Nitpick: for SHA1 (and any other cryptographic hash functions) I can’t fathom how an LLM could learn it through SGD, as opposed to being hand coded. To do SHA1 correctly you need to implement its internals correctly; being off a little bit will result in a completely incorrect output. It’s all or nothing, so there’s no way to gradually approach getting it right, so there’s no gradient to descend.
But your overall point still stands. It is theoretically possible for a transformer to learn any function, so this is not a fundamental upper bound, and you are therefore correct that a large enough LLM could do any of these for small n. I wonder if this is going to be a real capability of SOTA LLMs, or will it be one of those “possible in theory, but many orders of magnitude off in practice” things.
Ultimately the question I’m thinking towards is whether an LLM could do the truly important/scary problems. I care less whether an LLM can multiply two 1Mx1M (or 3x3) matrices, and more whether it can devise & execute a 50-step plan for world domination, or make important new discoveries in nanotech, or make billions of dollars in financial markets, etc.
I don’t know how to evaluate the computational complexity of these hard problems. I also don’t know whether exploring that question would help the capabilities side more than the alignment side, so I need to think carefully before answering.
Agree. If GPT-4 can solve 3-dim matrix multiplication with chain-of-thought, then doesn’t that mean you could just take the last layer’s output (before you generate a single token from it) and send it into other instances of GPT-4, and then chain together their output? That should by definition by enough “internal state-keeping” that you wouldn’t need it to do the “note-keeping” of chain-of-thought. And that’s precisely bayesed’s point—because from the outside, that kind of a construct would just look like a bigger LLM. I think this is a clever post, but the bottleneck-ing created by token generation is too arbitrary of a way to assess LLM complexity.
The LLM outputs are out of distribution for its input layer. There is some research happening in deep model communication, but it has not yielded fruit yet AFAIK.
I’m a bit confused about this post. Are you saying it is theoretically impossible to create an LLM that can do 3*3 matrix multiplication without using chain of thought? That seems false.
The amount of computation an LLM has done so far will be a function of both the size of the LLM (call it the s factor) and the number of tokens generate so far (n). Let’s say matrix multiplication of n*n matrices requires cn^3 amount of computation (actually, there are more efficient algos, but it doesn’t matter).
You can do this by either using a small LLM and n^3 tokens so that sn^3 > cn^3. Or you can use a bigger LLM, so that s_big*n> cn^3. So then just need n tokens.
In general, you can always get a bigger and bigger constant factor to solve problems with higher n.
If your claim is that, for any LLM that works in the same way as GPT, there will exist a value of n for which it will stop being capable of doing n*n matrix multiplication without chain of thought/extra work, I’d cautiously agree.
For a given LLM, yes, there will be a limit on amount of computation it can do while generating a token. But this amount can be arbitrarily large. And you can always make a bigger/smarter LLM.
Well, it might do the same amount of “computation” but for problem 1 that might mostly be filler work while for problem 2 it can do intelligent work. You can always use more compute than necessary, so why does statement 1 using the same amount of compute as statement 2 imply any sort of limitation on the LLM?
But it’s easy to imagine a huge LLM capable doing 500 iterations of SH1 of small strings in one shot (even without memorization)? Why do you think that’s impossible? (Just imagine a transformer circuit calculating SHA1, repeated 500 times). This doesn’t imply that all problems can be solved in constant time. It just means that the LLM will only be able to do this until the length of string is bigger than a certain limit. After that, you’ll need to make the LLM bigger/smarter.
Thanks for the deep and thoughtful comment. I hadn’t considered that “In general, you can always get a bigger and bigger constant factor to solve problems with higher n.”
I’m going to think carefully and then see how this changes my thinking. I’ll try to reply again soon.
This doesn’t matter much, as the constant factor needed still grows as fast as the asymptotic bound. GPT does not have a big enough constant factor. (This objection has always been true of asymptotic bounds.)
The solution is IMO just to consider the number of computations performed per generated token as some function of the model size, and once we’ve identified a suitable asymptotic order on the function, we can say intelligent things like “the smallest network capable of solving a problem in complexity class C of size N is X”.
Or if our asymptotic bounds are not tight enough:
“No economically feasible LLM can solve problems in complexity class C of size >= N”.
(Where economically feasible may be something defined by aggregate global economic resources or similar, depending on how tight you want the bound to be.)
Regardless, we can still obtain meaningful impossibility results.
Okay, you raise a very good point. To analogize to my own brain: it’s like noticing that I can multiply integers 1-20 in my head in one step, but for larger numbers I need to write it out. Does that mean that my neural net can do multiplication? Well, as you say, it depends on n.
Nitpick: for SHA1 (and any other cryptographic hash functions) I can’t fathom how an LLM could learn it through SGD, as opposed to being hand coded. To do SHA1 correctly you need to implement its internals correctly; being off a little bit will result in a completely incorrect output. It’s all or nothing, so there’s no way to gradually approach getting it right, so there’s no gradient to descend.
But your overall point still stands. It is theoretically possible for a transformer to learn any function, so this is not a fundamental upper bound, and you are therefore correct that a large enough LLM could do any of these for small n. I wonder if this is going to be a real capability of SOTA LLMs, or will it be one of those “possible in theory, but many orders of magnitude off in practice” things.
Ultimately the question I’m thinking towards is whether an LLM could do the truly important/scary problems. I care less whether an LLM can multiply two 1Mx1M (or 3x3) matrices, and more whether it can devise & execute a 50-step plan for world domination, or make important new discoveries in nanotech, or make billions of dollars in financial markets, etc.
I don’t know how to evaluate the computational complexity of these hard problems. I also don’t know whether exploring that question would help the capabilities side more than the alignment side, so I need to think carefully before answering.
Agree. If GPT-4 can solve 3-dim matrix multiplication with chain-of-thought, then doesn’t that mean you could just take the last layer’s output (before you generate a single token from it) and send it into other instances of GPT-4, and then chain together their output? That should by definition by enough “internal state-keeping” that you wouldn’t need it to do the “note-keeping” of chain-of-thought. And that’s precisely bayesed’s point—because from the outside, that kind of a construct would just look like a bigger LLM. I think this is a clever post, but the bottleneck-ing created by token generation is too arbitrary of a way to assess LLM complexity.
The LLM outputs are out of distribution for its input layer. There is some research happening in deep model communication, but it has not yielded fruit yet AFAIK.