“Is intelligence NP hard?” is a very important question with too little engagement from the LW/AI safety community. NP hardness:
Bounds attainable levels of intelligence (just how superhuman is superintelligence?)
Bounds physically and economically feasible takeoff speeds (i.e. exponentially growing resource investment is required for linear improvements in intelligence)
I’d operationalise “is intelligence NP hard?” as:
Does there exist some subset of computational problems underlying core cognitive tasks that have NP hard [expected] (time) complexity?
I want to write a LW question post on this sometime.
For now, I want to learn more about complexity theory and let my thoughts on the complexity of intelligence gestate further.
Gwern’s “Complexity no Bar to AI” argues convincingly that complexity theory doesn’t prohibit AGI/superhuman AI, but not that:
The attainable level of intelligence is as far above humans as humans are above ants (or other animal of choice)
Fast takeoff is feasible
As best as I can tell (I didn’t finish it, but will reread it after I’m done with my current complexity paper) Gwern did not argue convincingly that intelligence is not NP hard.
One question is: “Can a team of one hundred 10×-sped-up John von Neumann-level intelligent agents, running on computer chips and working together, wipe out humanity if they really wanted to?” It’s an open question, but I really think the answer is “yes” because (blah blah pandemics crop diseases nuclear war etc.—see here). I don’t think NP-hardness matters. You don’t need to solve any NP-hard problems to make and release 20 pandemics simultaneously, that’s a human-level problem, or at least in the ballpark of human-level.
And then another question is: “How many 10×-sped-up John von Neumann-level intelligent agents can you get from the existing stock of chips in the world?” That’s an open question too. I wrote this post recently on the topic. (Note the warning at the top; I can share a draft of the follow-up-post-in-progress, but it won’t be done for a while.) Anyway I’m currently expecting “hundreds of thousands, maybe much more”, but reasonable people can disagree. If I’m right, then that seems more than sufficient for a fast takeoff argument to go through, again without any speculation about what happens beyond human-level intelligence.
And then yet another question is: “Might we program an agent that’s much much more ‘insightful’ than John von Neumann, and if so, what real-world difference will that extra ‘insight’ make?” OK, now this is much more speculative. My hunch is “Yes we will, and it will make a very big real-world difference”, but I can’t prove that. I kinda think that if John von Neumann could hold even more complicated ideas in his head, then he would find lots of low-hanging-to-him fruit in developing powerful new science & technology. (See also brief discussion here.) But anyway, my point is, I’m not sure much hinges on this third question, because the previous two questions seem sufficient for practical planning / strategy purposes.
To be clear, I don’t think the complexity of intelligence matters for whether we should work on AI existential safety, and I don’t think it guarantees alignment by default.
I think it can confer longer timelines and/or slower takeoff, and both seem to reduce P(doom) but mostly by giving us more time to get our shit together/align AI.
I do think complexity of intelligence threatens Yudkowskian foom, but that’s not the only AI failure mode.
A chunk of why my timelines are short involves a complexity argument:
Current transformer-based LLMs, by virtue of always executing the same steps to predict the next token, run in constant time.
Our current uses of LLMs tend to demand a large amount of “intelligence” within the scope of a single step- sequence of English tokens are not perfectly natural representations of complex reasoning, and many prompts attempt to elicit immediate answers to computationally difficult questions (consider prompting an LLM with “1401749 * 23170802 = ” without any kind of chain of thought or fine tuning).
Our current uses of LLMs are still remarkably capable within this extremely harsh limitation, and within the scope of how we’re using them.
This seems like really strong empirical evidence that a lot of the kind of intelligence we care about is not just not NP-hard, but expressible in constant time.
In this framing, I’m basically quantifying “intelligence” as something like “how much progress is made during each step in the algorithm of problem solving.” There may exist problems that require non-constant numbers of reasoning steps, and the traditional transformer LLM is provably incapable of solving such problems in one token prediction (e.g. multiplying large integers), but this does not impose a limitation on capabilities over a longer simulation.
I suspect there are decent ways of measuring the complexity of the “intelligence” required for any particular prediction, but it’s adjacent to some stuff that I’m not 100% comfy with signal boosting publicly- feel free to DM me if interested.
You’re correct that it arises because we can choose a large enough constant (proportional to parameter count, which is a constant with respect to inference), and because we have bounded context windows. Not all large language models must be constant time, nor are they.
The concerning part is that all the big name ones I’m aware of are running in constant time (per token) and still manage to do extremely well. Every time we see some form of intelligence expressed within a single token prediction on these models, we get a proof that that kind of intelligence is just not very complicated.
I just don’t intuitively follow. It violates my intuitions about algorithms and complexity.
Does this generalise? Would it also be constant time per token if it was generating outputs a million tokens long?
Does the time per token vary with the difficulty of the prediction task? Not all prediction tasks should be equally difficult, so if cost doesn’t vary, that also warrants explanation.
I just don’t buy the constant time hypothesis/formulation. It’s like: “if you’re getting that result, you’re doing something illegitimate or abusing the notion of complexity”.
Constant time per token generalising asymptotically becomes linear complexity, and there are problems that we know are worse than linear complexity. It’s like this result just isn’t plausible?
Does this generalise? Would it also be constant time per token if it was generating outputs a million tokens long?
Yes, if you modified the forward pass to output a million tokens at once, it would remain constant time so long as the forward pass’s execution remained bounded by a constant. Likewise, you could change the output distribution to cover tokens of extreme length. Realistically, the architecture wouldn’t be practical. It would be either enormous and slow or its predictions would suck.
Does the time per token vary with the difficulty of the prediction task? Not all prediction tasks should be equally difficult, so if cost doesn’t vary, that also warrants explanation.
No, a given GPT-like transformer always does exactly the same thing in the forward pass. GPT-3 does not have any kind of adaptive computation time within a forward pass. If a single token prediction requires more computation steps than fits in the (large) constant time available to the forward pass, the transformer cannot fully complete the prediction. This is near the core of the “wait what” response I had to GPT-3′s performance.
Note that when you prompt GPT-3 with something like “1401749 x23170802 = ”, it will tend to give you a prediction which matches the shape of the answer (it’s a number, and a fairly big one), but beyond the rough approximation, it’s pretty much always going to be wrong. Even if you fine-tuned GPT-3 on arithmetic, you would still be able to find two integers of sufficient size that they cannot be multiplied in one step because the number internal steps required exceeds the number of steps the forward pass can express.
The output distribution will cover a wide smear of tokens corresponding to approximately-correct big numbers. It can’t compute which one is right, so the probability distribution can’t narrow any further.
(Raw GPT-3 also isn’t interested in being correct except to the extent that being correct corresponds to a good token prediction, so it won’t bother with trying to output intermediate tokens that could let it perform a longer non-constant-time computation. The prompt makes it look like the next token should be an answer, not incremental reasoning, so it’ll sample from its smear of answer-shaped tokens.)
“if you’re getting that result, you’re doing something illegitimate or abusing the notion of complexity”
It can feel that way a bit due to the scale- like it’s technically true but meaningless, maybe- but it is directly visible in their behavior, and it has major implications for how these things could advance. The constants were talking about here aren’t actually so pathological that they make the analysis pointless; a quadrillion flops isn’t that many compared to an H100′s throughput.
It’s like this result just isn’t plausible?
If I had to guess, that feeling is probably arising from how I’m using the word “intelligence” in the context of a single forward pass, while it sounds like you’re using the word to describe something more like… the capability of the model over a full simulation that takes many forward passes to complete. The latter is not bound by constant time even in the LLMs I’m talking about, but I think the capabilities exhibited in constant time by current LLMs are sufficient to think that huge chunks of “whatever is the thing we care about when we talk about intelligence” are actually pretty darn easy, computationally.
For all practical purposes, it takes O(N+M) compute to generate N tokens from an M token context (attention is superlinear, but takes up a negligible proportion of flops in current models at the context lengths that current models are trained for. also, while nobody has succeeded at it yet, linear attention does not seem implausible). No current SOTA model has adaptive compute. There has been some work in this direction (see Universal transformers), but it doesn’t work well enough for people to use it in practice.
For all practical purposes, it takes O(N+M) compute to generate N tokens from an M token context
Yup. I suspect that’s close to the root of the confusion/apparent disagreement earlier- when I say constant time, I mean constant with respect to input, given a particular model and bounded context window, for a single token.
I think doing the analysis at this level is often more revealing than doing the analysis across full trajectories or across arbitrary windows in an important way: a tight bound makes it easier to make claims about what’s possible by existence proof (which turns out to be a lot).
Without knowing the actual optimal curve of computational cost per “unit” of intelligence (and WTF that even means), it’s not too useful to know whether it’s polynomial or not. There are LOTS of np-hard problems that humans have “solved” numerically or partially for real-world uses, at scales that are economically important. They don’t scale perfectly, and they’re often not provably optimal, but they work.
It’s hard to figure out the right metrics for world-modeling-and-goal-optimization that would prevent AGI from taking over or destroying most value for biological agents, and even harder to have any clue whether the underlying constraint being NP-hard matters at all in the next milleneum. It probably WILL matter at some point, but it could be 3 or 4 takeovers or alien-machine-intelligence-discoveries away.
There are LOTS of np-hard problems that humans have “solved” numerically or partially for real-world uses, at scales that are economically important. They don’t scale perfectly, and they’re often not provably optimal, but they work.
These are part of the considerations I would address when I get around to writing the post.
Empirical probability distributions over inputs
Weighting problems by their economic importance
Approximate solutions
Probabilistic solutions
Etc.
All complicate the analysis (you’d probably want a framework for determining complexity that natively handled probabilistic/approximate solutions, so maybe input size in bits and “work done by optimisation in bits”), but even with all these considerations, you can still define a coherent notion of “time complexity”.
“Is intelligence NP hard?” is a very important question with too little engagement from the LW/AI safety community. NP hardness:
Bounds attainable levels of intelligence (just how superhuman is superintelligence?)
Bounds physically and economically feasible takeoff speeds (i.e. exponentially growing resource investment is required for linear improvements in intelligence)
I’d operationalise “is intelligence NP hard?” as:
I want to write a LW question post on this sometime.
For now, I want to learn more about complexity theory and let my thoughts on the complexity of intelligence gestate further.
Gwern’s “Complexity no Bar to AI” argues convincingly that complexity theory doesn’t prohibit AGI/superhuman AI, but not that:
The attainable level of intelligence is as far above humans as humans are above ants (or other animal of choice)
Fast takeoff is feasible
As best as I can tell (I didn’t finish it, but will reread it after I’m done with my current complexity paper) Gwern did not argue convincingly that intelligence is not NP hard.
One question is: “Can a team of one hundred 10×-sped-up John von Neumann-level intelligent agents, running on computer chips and working together, wipe out humanity if they really wanted to?” It’s an open question, but I really think the answer is “yes” because (blah blah pandemics crop diseases nuclear war etc.—see here). I don’t think NP-hardness matters. You don’t need to solve any NP-hard problems to make and release 20 pandemics simultaneously, that’s a human-level problem, or at least in the ballpark of human-level.
And then another question is: “How many 10×-sped-up John von Neumann-level intelligent agents can you get from the existing stock of chips in the world?” That’s an open question too. I wrote this post recently on the topic. (Note the warning at the top; I can share a draft of the follow-up-post-in-progress, but it won’t be done for a while.) Anyway I’m currently expecting “hundreds of thousands, maybe much more”, but reasonable people can disagree. If I’m right, then that seems more than sufficient for a fast takeoff argument to go through, again without any speculation about what happens beyond human-level intelligence.
And then yet another question is: “Might we program an agent that’s much much more ‘insightful’ than John von Neumann, and if so, what real-world difference will that extra ‘insight’ make?” OK, now this is much more speculative. My hunch is “Yes we will, and it will make a very big real-world difference”, but I can’t prove that. I kinda think that if John von Neumann could hold even more complicated ideas in his head, then he would find lots of low-hanging-to-him fruit in developing powerful new science & technology. (See also brief discussion here.) But anyway, my point is, I’m not sure much hinges on this third question, because the previous two questions seem sufficient for practical planning / strategy purposes.
To be clear, I don’t think the complexity of intelligence matters for whether we should work on AI existential safety, and I don’t think it guarantees alignment by default.
I think it can confer longer timelines and/or slower takeoff, and both seem to reduce P(doom) but mostly by giving us more time to get our shit together/align AI.
I do think complexity of intelligence threatens Yudkowskian foom, but that’s not the only AI failure mode.
A chunk of why my timelines are short involves a complexity argument:
Current transformer-based LLMs, by virtue of always executing the same steps to predict the next token, run in constant time.
Our current uses of LLMs tend to demand a large amount of “intelligence” within the scope of a single step- sequence of English tokens are not perfectly natural representations of complex reasoning, and many prompts attempt to elicit immediate answers to computationally difficult questions (consider prompting an LLM with “1401749 * 23170802 = ” without any kind of chain of thought or fine tuning).
Our current uses of LLMs are still remarkably capable within this extremely harsh limitation, and within the scope of how we’re using them.
This seems like really strong empirical evidence that a lot of the kind of intelligence we care about is not just not NP-hard, but expressible in constant time.
In this framing, I’m basically quantifying “intelligence” as something like “how much progress is made during each step in the algorithm of problem solving.” There may exist problems that require non-constant numbers of reasoning steps, and the traditional transformer LLM is provably incapable of solving such problems in one token prediction (e.g. multiplying large integers), but this does not impose a limitation on capabilities over a longer simulation.
I suspect there are decent ways of measuring the complexity of the “intelligence” required for any particular prediction, but it’s adjacent to some stuff that I’m not 100% comfy with signal boosting publicly- feel free to DM me if interested.
I’m suspicious of this. It seems obviously not true on priors, and like an artifact of:
Choosing a large enough constant
Fixed/bounded input sizes
But I don’t understand well how a transformer works, so I can’t engage this on the object level.
You’re correct that it arises because we can choose a large enough constant (proportional to parameter count, which is a constant with respect to inference), and because we have bounded context windows. Not all large language models must be constant time, nor are they.
The concerning part is that all the big name ones I’m aware of are running in constant time (per token) and still manage to do extremely well. Every time we see some form of intelligence expressed within a single token prediction on these models, we get a proof that that kind of intelligence is just not very complicated.
I just don’t intuitively follow. It violates my intuitions about algorithms and complexity.
Does this generalise? Would it also be constant time per token if it was generating outputs a million tokens long?
Does the time per token vary with the difficulty of the prediction task? Not all prediction tasks should be equally difficult, so if cost doesn’t vary, that also warrants explanation.
I just don’t buy the constant time hypothesis/formulation. It’s like: “if you’re getting that result, you’re doing something illegitimate or abusing the notion of complexity”.
Constant time per token generalising asymptotically becomes linear complexity, and there are problems that we know are worse than linear complexity. It’s like this result just isn’t plausible?
Yes, if you modified the forward pass to output a million tokens at once, it would remain constant time so long as the forward pass’s execution remained bounded by a constant. Likewise, you could change the output distribution to cover tokens of extreme length. Realistically, the architecture wouldn’t be practical. It would be either enormous and slow or its predictions would suck.
No, a given GPT-like transformer always does exactly the same thing in the forward pass. GPT-3 does not have any kind of adaptive computation time within a forward pass. If a single token prediction requires more computation steps than fits in the (large) constant time available to the forward pass, the transformer cannot fully complete the prediction. This is near the core of the “wait what” response I had to GPT-3′s performance.
Note that when you prompt GPT-3 with something like “1401749 x 23170802 = ”, it will tend to give you a prediction which matches the shape of the answer (it’s a number, and a fairly big one), but beyond the rough approximation, it’s pretty much always going to be wrong. Even if you fine-tuned GPT-3 on arithmetic, you would still be able to find two integers of sufficient size that they cannot be multiplied in one step because the number internal steps required exceeds the number of steps the forward pass can express.
The output distribution will cover a wide smear of tokens corresponding to approximately-correct big numbers. It can’t compute which one is right, so the probability distribution can’t narrow any further.
(Raw GPT-3 also isn’t interested in being correct except to the extent that being correct corresponds to a good token prediction, so it won’t bother with trying to output intermediate tokens that could let it perform a longer non-constant-time computation. The prompt makes it look like the next token should be an answer, not incremental reasoning, so it’ll sample from its smear of answer-shaped tokens.)
It can feel that way a bit due to the scale- like it’s technically true but meaningless, maybe- but it is directly visible in their behavior, and it has major implications for how these things could advance. The constants were talking about here aren’t actually so pathological that they make the analysis pointless; a quadrillion flops isn’t that many compared to an H100′s throughput.
If I had to guess, that feeling is probably arising from how I’m using the word “intelligence” in the context of a single forward pass, while it sounds like you’re using the word to describe something more like… the capability of the model over a full simulation that takes many forward passes to complete. The latter is not bound by constant time even in the LLMs I’m talking about, but I think the capabilities exhibited in constant time by current LLMs are sufficient to think that huge chunks of “whatever is the thing we care about when we talk about intelligence” are actually pretty darn easy, computationally.
For all practical purposes, it takes O(N+M) compute to generate N tokens from an M token context (attention is superlinear, but takes up a negligible proportion of flops in current models at the context lengths that current models are trained for. also, while nobody has succeeded at it yet, linear attention does not seem implausible). No current SOTA model has adaptive compute. There has been some work in this direction (see Universal transformers), but it doesn’t work well enough for people to use it in practice.
Yup. I suspect that’s close to the root of the confusion/apparent disagreement earlier- when I say constant time, I mean constant with respect to input, given a particular model and bounded context window, for a single token.
I think doing the analysis at this level is often more revealing than doing the analysis across full trajectories or across arbitrary windows in an important way: a tight bound makes it easier to make claims about what’s possible by existence proof (which turns out to be a lot).
Without knowing the actual optimal curve of computational cost per “unit” of intelligence (and WTF that even means), it’s not too useful to know whether it’s polynomial or not. There are LOTS of np-hard problems that humans have “solved” numerically or partially for real-world uses, at scales that are economically important. They don’t scale perfectly, and they’re often not provably optimal, but they work.
It’s hard to figure out the right metrics for world-modeling-and-goal-optimization that would prevent AGI from taking over or destroying most value for biological agents, and even harder to have any clue whether the underlying constraint being NP-hard matters at all in the next milleneum. It probably WILL matter at some point, but it could be 3 or 4 takeovers or alien-machine-intelligence-discoveries away.
These are part of the considerations I would address when I get around to writing the post.
Empirical probability distributions over inputs
Weighting problems by their economic importance
Approximate solutions
Probabilistic solutions
Etc.
All complicate the analysis (you’d probably want a framework for determining complexity that natively handled probabilistic/approximate solutions, so maybe input size in bits and “work done by optimisation in bits”), but even with all these considerations, you can still define a coherent notion of “time complexity”.
Just messaged my lecturer, I’ll try and see if I can get permission for such a project.