Transformers are not AGI because they will never be able to “figure something out” the way humans can.
If a human is given the rules for Sudoku, they first try filling in the square randomly. After a while, they notice that certain things work and certain things don’t work. They begin to define heuristics for things that work (for example, if all but one number appears in the same row or column as a box, that number goes in the box). Eventually they work out a complete algorithm for solving Sudoku.
A transformer will never do this (pretending Sudoku wasn’t in its training data). Because they are next-token predictors, they are fundamentally incapable of reasoning about things not in their training set. They are incapable of “noticing when they made a mistake” and then backtracking they way a human would.
Now it’s entirely possible that a very small wrapper around a Transformer could solve Sudoku. You could have the transformer suggest moves and then add a reasoning/planning layer around it to handle the back-tracking. This is effectively what Alpha-Geometry does.
Yeah, I didn’t do a very good job in this respect. I am not intending to talk about a transformer by itself. I am intending to talk about transformers with the sorts of bells and whistles that they are currently being wrapped with. So not just transformers, but also not some totally speculative wrapper.
In the technical sense that you can implement arbitrary programs by prompting an LLM (they are turning complete), sure.
In a practical sense, no.
GPT-4 can’t even play tic-tac-toe. Manifold spent a year getting GPT-4 to implement (much less discover) the algorithm for Sudoku and failed.
Now imagine trying to implement a serious backtracking algorithm. Stockfish checks millions of positions per turn of play. The attention window for your “backtracking transformer” is going to have to be at lease {size of chess board state}*{number of positions evaluated}.
And because of quadratic attention, training it is going to take on the order of {number or parameters}*({chess board state size}*{number of positions evaluated})^2
Even with very generous assumptions for {number of parameters} and {chess board state}, there’s simply no way we could train such a model this century (and that’s assuming Moore’s law somehow continues that long).
The question is—how far can we get with in-context learning. If we filled Gemini’s 10 million tokens with Sudoku rules and examples, showing where it went wrong each time, would it generalize? I’m not sure but I think it’s possible
I agree that filling a context window with worked sudoku examples wouldn’t help for solving hidouku. But, there is a common element here to the games. Both look like math, but aren’t about numbers except that there’s an ordered sequence. The sequence of items could just as easily be an alphabetically ordered set of words. Both are much more about geometry, or topology, or graph theory, for how a set of points is connected. I would not be surprised to learn that there is a set of tokens, containing no examples of either game, combined with a checker (like your link has) that points out when a mistake has been made, that enables solving a wide range of similar games.
I think one of the things humans do better than current LLMs is that, as we learn a new task, we vary what counts as a token and how we nest tokens. How do we chunk things? In sudoku, each box is a chunk, each row and column are a chunk, the board is a chunk, “sudoku” is a chunk, “checking an answer” is a chunk, “playing a game” is a chunk, and there are probably lots of others I’m ignoring. I don’t think just prompting an LLM with the full text of “How to solve it” in its context window would get us to a solution, but at some level I do think it’s possible to make explicit, in words and diagrams, what it is humans do to solve things, in a way legible to it. I think it largely resembles repeatedly telescoping in and out, to lower and higher abstractions applying different concepts and contexts, locally sanity checking ourselves, correcting locally obvious insanity, and continuing until we hit some sort of reflective consistency. Different humans have different limits on what contexts they can successfully do this in.
Absolutely. I don’t think it’s impossible to build such a system. In fact, I think a transformer is probably about 90% there. Need to add trial and error, some kind of long-term memory/fine-tuning and a handful of default heuristics. Scale will help too, but no amount of scale alone will get us there.
sure. 4000 words (~8000 tokens) to do a 9-state 9-turn game with the entire strategy written out by a human. Now extrapolate that to chess, go, or any serious game.
And this doesn’t address at all my actual point, which is that Transformers cannot teach themselves to play a game.
And this doesn’t address at all my actual point, which is that Transformers cannot teach themselves to play a game.
Oh so you wrote/can provide a paper proving this or..?
This is kind of the problem with a lot of these discussions. Wild Confidence on ability estimation from what is ultimately just gut feeling. You said GPT-4 couldn’t play tic-tac-toe. Well it can. You said it would be impossible to train a chess playing model this century. Already done.
Now you’re saying Transformers can’t “teach themselves to play a game”. There is 0 theoretical justification for that stance.
Have you never figured out something by yourself? The way I learned to do Sudoku was: I was given a book of Sudoku puzzles and told “have fun”.
you said it would be impossible to train a chess playing model this century.
I didn’t say it was impossible to train an LLM to play Chess. I said it was impossible for an LLM to teach itself to play a game of similar difficulty to chess if that game is not in it’s training data.
These are two wildly different things.
Obviously LLMs can learn things that are in their training data. That’s what they do. Obviously if you give LLMs detailed step-by-step instructions for a procedure that is small enough to fit in its attention window, LLMs can follow that procedure. Again, that is what LLMs do.
What they do not do is teach themselves things that aren’t in their training data via trial-and-error. Which is the primary way humans learn things.
It seems like this would be because the transformer weights are fixed and we have not built a mechanism for the model to record things it needs to learn to improve performance or an automated way to practice offline to do so.
It’s just missing all this, like a human patient with large sections of their brain surgically removed. Doesn’t seem difficult or long term to add this does it? How many years before one of the competing AI lab adds some form of “performance enhancing fine tuning and self play”?
Have you never figured out something by yourself? The way I learned to do Sudoku was: I was given a book of Sudoku puzzles and told “have fun”.
So few shot + scratchpad ?
I didn’t say it was impossible to train an LLM to play Chess. I said it was impossible for an LLM to teach itself to play a game of similar difficulty to chess if that game is not in it’s training data.
More gut claims.
What they do not do is teach themselves things that aren’t in their training data via trial-and-error. Which is the primary way humans learn things
Setting up the architecture that would allow a pretrained LLM to trial and error whatever you want is relatively trivial. Current state of the art isn’t that competent but the backbone for this sort of work is there. Sudoku, Game of 24 solve rate is much higher with Tree of thought for instance. There’s stuff for Minecraft too.
Setting up the architecture that would allow a pretrained LLM to trial and error whatever you want is relatively trivial.
I agree. Or at least, I don’t see any reason why not.
My point was not that “a relatively simple architecture that contains a Transformer as the core” cannot solve problems via trial and error (in fact I think it’s likely such an architecture exists). My point was that transformers alone cannot do so.
You can call it a “gut claim” if that makes you feel better. But the actual reason is I did some very simple math (about the window size required and given quadratic scaling for transformers) and concluded that practically speaking it was impossible.
Also, importantly, we don’t know what that “relatively simple” architecture looks like. If you look at the various efforts to “extend” transformers to general learning machines, there are a bunch of different approaches: alpha-geometry, diffusion transformers,baby-agi, voyager, dreamer, chain-of-thought, RAG, continuous fine-tuning, V-JEPA. Practically speaking, we have no idea which of these techniques is the “correct” one (if any of them are).
In my opinion saying “Transformers are AGI” is a bit like saying “Deep learning is AGI”. While it is extremely possible that an architecture that heavily relies on Transformers and is AGI exists, we don’t actually know what that architecture is.
Personally, my bet is either on a sort of generalized alpha-geometry approach (where the transformer generates hypothesis and then GOFAI is used to evaluate them) or Diffusion Transformers (where we iteratively de-noise a solution to a problem). But I wouldn’t be at all surprised if a few years from now it is universally agreed that some key insight we’re currently missing marks the dividing line between Transformers and AGI.
You can call it a “gut claim” if that makes you feel better. But the actual reason is I did some very simple math (about the window size required and given quadratic scaling for transformers) and concluded that practically speaking it was impossible.
If you’re talking about this:
Now imagine trying to implement a serious backtracking algorithm. Stockfish checks millions of positions per turn of play. The attention window for your “backtracking transformer” is going to have to be at lease {size of chess board state}*{number of positions evaluated}.
And because of quadratic attention, training it is going to take on the order of {number or parameters}*({chess board state size}*{number of positions evaluated})^2
then that’s just irrelevant. You don’t need to evaluate millions of positions to backtrack (unless you think humans don’t backtrack) or play chess.
My point was not that “a relatively simple architecture that contains a Transformer as the core” cannot solve problems via trial and error (in fact I think it’s likely such an architecture exists). My point was that transformers alone cannot do so.
There’s nothing the former can do that the latter can’t. “architecture” is really overselling it but i couldn’t think of a better word. It’s just function calling.
Not really. The majority of your experiences and interactions are forgotten and discarded, the few that aren’t are recalled and triggered by the right input when necessary and not just sitting there in your awareness at all times. Those memories are also modified at every recall.
And that’s really just beside the point. However you want to spin it, evaluating that many positions is not necessary for backtracking or playing chess. If that’s the base of your “impossible” rhetoric then it’s a poor one.
Obvious bait is obvious bait, but here goes.
Transformers are not AGI because they will never be able to “figure something out” the way humans can.
If a human is given the rules for Sudoku, they first try filling in the square randomly. After a while, they notice that certain things work and certain things don’t work. They begin to define heuristics for things that work (for example, if all but one number appears in the same row or column as a box, that number goes in the box). Eventually they work out a complete algorithm for solving Sudoku.
A transformer will never do this (pretending Sudoku wasn’t in its training data). Because they are next-token predictors, they are fundamentally incapable of reasoning about things not in their training set. They are incapable of “noticing when they made a mistake” and then backtracking they way a human would.
Now it’s entirely possible that a very small wrapper around a Transformer could solve Sudoku. You could have the transformer suggest moves and then add a reasoning/planning layer around it to handle the back-tracking. This is effectively what Alpha-Geometry does.
But a Transformer BY ITSELF will never be AGI.
Yeah, I didn’t do a very good job in this respect. I am not intending to talk about a transformer by itself. I am intending to talk about transformers with the sorts of bells and whistles that they are currently being wrapped with. So not just transformers, but also not some totally speculative wrapper.
It seems likely to me that you could create a prompt that would have a transformer do this.
In the technical sense that you can implement arbitrary programs by prompting an LLM (they are turning complete), sure.
In a practical sense, no.
GPT-4 can’t even play tic-tac-toe. Manifold spent a year getting GPT-4 to implement (much less discover) the algorithm for Sudoku and failed.
Now imagine trying to implement a serious backtracking algorithm. Stockfish checks millions of positions per turn of play. The attention window for your “backtracking transformer” is going to have to be at lease {size of chess board state}*{number of positions evaluated}.
And because of quadratic attention, training it is going to take on the order of {number or parameters}*({chess board state size}*{number of positions evaluated})^2
Even with very generous assumptions for {number of parameters} and {chess board state}, there’s simply no way we could train such a model this century (and that’s assuming Moore’s law somehow continues that long).
The question is—how far can we get with in-context learning. If we filled Gemini’s 10 million tokens with Sudoku rules and examples, showing where it went wrong each time, would it generalize? I’m not sure but I think it’s possible
It certainly wouldn’t generalize to e.g Hidouku
I agree that filling a context window with worked sudoku examples wouldn’t help for solving hidouku. But, there is a common element here to the games. Both look like math, but aren’t about numbers except that there’s an ordered sequence. The sequence of items could just as easily be an alphabetically ordered set of words. Both are much more about geometry, or topology, or graph theory, for how a set of points is connected. I would not be surprised to learn that there is a set of tokens, containing no examples of either game, combined with a checker (like your link has) that points out when a mistake has been made, that enables solving a wide range of similar games.
I think one of the things humans do better than current LLMs is that, as we learn a new task, we vary what counts as a token and how we nest tokens. How do we chunk things? In sudoku, each box is a chunk, each row and column are a chunk, the board is a chunk, “sudoku” is a chunk, “checking an answer” is a chunk, “playing a game” is a chunk, and there are probably lots of others I’m ignoring. I don’t think just prompting an LLM with the full text of “How to solve it” in its context window would get us to a solution, but at some level I do think it’s possible to make explicit, in words and diagrams, what it is humans do to solve things, in a way legible to it. I think it largely resembles repeatedly telescoping in and out, to lower and higher abstractions applying different concepts and contexts, locally sanity checking ourselves, correcting locally obvious insanity, and continuing until we hit some sort of reflective consistency. Different humans have different limits on what contexts they can successfully do this in.
Absolutely. I don’t think it’s impossible to build such a system. In fact, I think a transformer is probably about 90% there. Need to add trial and error, some kind of long-term memory/fine-tuning and a handful of default heuristics. Scale will help too, but no amount of scale alone will get us there.
GPT-4 can play tic-tac-toe
https://chat.openai.com/share/75758e5e-d228-420f-9138-7bff47f2e12d
sure. 4000 words (~8000 tokens) to do a 9-state 9-turn game with the entire strategy written out by a human. Now extrapolate that to chess, go, or any serious game.
And this doesn’t address at all my actual point, which is that Transformers cannot teach themselves to play a game.
Ok? That’s how you teach anybody anything.
LLMs can play chess, poker just fine. gpt 3.5-turbo-instruct plays at about 1800 Elo, consistently making legal moves. - https://github.com/adamkarvonen/chess_gpt_eval
Then there is this grandmaster level chess transformer—https://arxiv.org/abs/2402.04494
Poker—https://arxiv.org/abs/2308.12466
Oh so you wrote/can provide a paper proving this or..?
This is kind of the problem with a lot of these discussions. Wild Confidence on ability estimation from what is ultimately just gut feeling. You said GPT-4 couldn’t play tic-tac-toe. Well it can. You said it would be impossible to train a chess playing model this century. Already done.
Now you’re saying Transformers can’t “teach themselves to play a game”. There is 0 theoretical justification for that stance.
Have you never figured out something by yourself? The way I learned to do Sudoku was: I was given a book of Sudoku puzzles and told “have fun”.
I didn’t say it was impossible to train an LLM to play Chess. I said it was impossible for an LLM to teach itself to play a game of similar difficulty to chess if that game is not in it’s training data.
These are two wildly different things.
Obviously LLMs can learn things that are in their training data. That’s what they do. Obviously if you give LLMs detailed step-by-step instructions for a procedure that is small enough to fit in its attention window, LLMs can follow that procedure. Again, that is what LLMs do.
What they do not do is teach themselves things that aren’t in their training data via trial-and-error. Which is the primary way humans learn things.
It seems like this would be because the transformer weights are fixed and we have not built a mechanism for the model to record things it needs to learn to improve performance or an automated way to practice offline to do so.
It’s just missing all this, like a human patient with large sections of their brain surgically removed. Doesn’t seem difficult or long term to add this does it? How many years before one of the competing AI lab adds some form of “performance enhancing fine tuning and self play”?
Less than a year. They probably already have toy models with periodically or continuously updating weights.
So few shot + scratchpad ?
More gut claims.
Setting up the architecture that would allow a pretrained LLM to trial and error whatever you want is relatively trivial. Current state of the art isn’t that competent but the backbone for this sort of work is there. Sudoku, Game of 24 solve rate is much higher with Tree of thought for instance. There’s stuff for Minecraft too.
I agree. Or at least, I don’t see any reason why not.
My point was not that “a relatively simple architecture that contains a Transformer as the core” cannot solve problems via trial and error (in fact I think it’s likely such an architecture exists). My point was that transformers alone cannot do so.
You can call it a “gut claim” if that makes you feel better. But the actual reason is I did some very simple math (about the window size required and given quadratic scaling for transformers) and concluded that practically speaking it was impossible.
Also, importantly, we don’t know what that “relatively simple” architecture looks like. If you look at the various efforts to “extend” transformers to general learning machines, there are a bunch of different approaches: alpha-geometry, diffusion transformers, baby-agi, voyager, dreamer, chain-of-thought, RAG, continuous fine-tuning, V-JEPA. Practically speaking, we have no idea which of these techniques is the “correct” one (if any of them are).
In my opinion saying “Transformers are AGI” is a bit like saying “Deep learning is AGI”. While it is extremely possible that an architecture that heavily relies on Transformers and is AGI exists, we don’t actually know what that architecture is.
Personally, my bet is either on a sort of generalized alpha-geometry approach (where the transformer generates hypothesis and then GOFAI is used to evaluate them) or Diffusion Transformers (where we iteratively de-noise a solution to a problem). But I wouldn’t be at all surprised if a few years from now it is universally agreed that some key insight we’re currently missing marks the dividing line between Transformers and AGI.
If you’re talking about this:
then that’s just irrelevant. You don’t need to evaluate millions of positions to backtrack (unless you think humans don’t backtrack) or play chess.
There’s nothing the former can do that the latter can’t. “architecture” is really overselling it but i couldn’t think of a better word. It’s just function calling.
Humans are not transformers. The “context window” for a human is literally their entire life.
Not really. The majority of your experiences and interactions are forgotten and discarded, the few that aren’t are recalled and triggered by the right input when necessary and not just sitting there in your awareness at all times. Those memories are also modified at every recall.
And that’s really just beside the point. However you want to spin it, evaluating that many positions is not necessary for backtracking or playing chess. If that’s the base of your “impossible” rhetoric then it’s a poor one.