Setting up the architecture that would allow a pretrained LLM to trial and error whatever you want is relatively trivial.
I agree. Or at least, I don’t see any reason why not.
My point was not that “a relatively simple architecture that contains a Transformer as the core” cannot solve problems via trial and error (in fact I think it’s likely such an architecture exists). My point was that transformers alone cannot do so.
You can call it a “gut claim” if that makes you feel better. But the actual reason is I did some very simple math (about the window size required and given quadratic scaling for transformers) and concluded that practically speaking it was impossible.
Also, importantly, we don’t know what that “relatively simple” architecture looks like. If you look at the various efforts to “extend” transformers to general learning machines, there are a bunch of different approaches: alpha-geometry, diffusion transformers,baby-agi, voyager, dreamer, chain-of-thought, RAG, continuous fine-tuning, V-JEPA. Practically speaking, we have no idea which of these techniques is the “correct” one (if any of them are).
In my opinion saying “Transformers are AGI” is a bit like saying “Deep learning is AGI”. While it is extremely possible that an architecture that heavily relies on Transformers and is AGI exists, we don’t actually know what that architecture is.
Personally, my bet is either on a sort of generalized alpha-geometry approach (where the transformer generates hypothesis and then GOFAI is used to evaluate them) or Diffusion Transformers (where we iteratively de-noise a solution to a problem). But I wouldn’t be at all surprised if a few years from now it is universally agreed that some key insight we’re currently missing marks the dividing line between Transformers and AGI.
You can call it a “gut claim” if that makes you feel better. But the actual reason is I did some very simple math (about the window size required and given quadratic scaling for transformers) and concluded that practically speaking it was impossible.
If you’re talking about this:
Now imagine trying to implement a serious backtracking algorithm. Stockfish checks millions of positions per turn of play. The attention window for your “backtracking transformer” is going to have to be at lease {size of chess board state}*{number of positions evaluated}.
And because of quadratic attention, training it is going to take on the order of {number or parameters}*({chess board state size}*{number of positions evaluated})^2
then that’s just irrelevant. You don’t need to evaluate millions of positions to backtrack (unless you think humans don’t backtrack) or play chess.
My point was not that “a relatively simple architecture that contains a Transformer as the core” cannot solve problems via trial and error (in fact I think it’s likely such an architecture exists). My point was that transformers alone cannot do so.
There’s nothing the former can do that the latter can’t. “architecture” is really overselling it but i couldn’t think of a better word. It’s just function calling.
Not really. The majority of your experiences and interactions are forgotten and discarded, the few that aren’t are recalled and triggered by the right input when necessary and not just sitting there in your awareness at all times. Those memories are also modified at every recall.
And that’s really just beside the point. However you want to spin it, evaluating that many positions is not necessary for backtracking or playing chess. If that’s the base of your “impossible” rhetoric then it’s a poor one.
I agree. Or at least, I don’t see any reason why not.
My point was not that “a relatively simple architecture that contains a Transformer as the core” cannot solve problems via trial and error (in fact I think it’s likely such an architecture exists). My point was that transformers alone cannot do so.
You can call it a “gut claim” if that makes you feel better. But the actual reason is I did some very simple math (about the window size required and given quadratic scaling for transformers) and concluded that practically speaking it was impossible.
Also, importantly, we don’t know what that “relatively simple” architecture looks like. If you look at the various efforts to “extend” transformers to general learning machines, there are a bunch of different approaches: alpha-geometry, diffusion transformers, baby-agi, voyager, dreamer, chain-of-thought, RAG, continuous fine-tuning, V-JEPA. Practically speaking, we have no idea which of these techniques is the “correct” one (if any of them are).
In my opinion saying “Transformers are AGI” is a bit like saying “Deep learning is AGI”. While it is extremely possible that an architecture that heavily relies on Transformers and is AGI exists, we don’t actually know what that architecture is.
Personally, my bet is either on a sort of generalized alpha-geometry approach (where the transformer generates hypothesis and then GOFAI is used to evaluate them) or Diffusion Transformers (where we iteratively de-noise a solution to a problem). But I wouldn’t be at all surprised if a few years from now it is universally agreed that some key insight we’re currently missing marks the dividing line between Transformers and AGI.
If you’re talking about this:
then that’s just irrelevant. You don’t need to evaluate millions of positions to backtrack (unless you think humans don’t backtrack) or play chess.
There’s nothing the former can do that the latter can’t. “architecture” is really overselling it but i couldn’t think of a better word. It’s just function calling.
Humans are not transformers. The “context window” for a human is literally their entire life.
Not really. The majority of your experiences and interactions are forgotten and discarded, the few that aren’t are recalled and triggered by the right input when necessary and not just sitting there in your awareness at all times. Those memories are also modified at every recall.
And that’s really just beside the point. However you want to spin it, evaluating that many positions is not necessary for backtracking or playing chess. If that’s the base of your “impossible” rhetoric then it’s a poor one.