For dumb subsystems, yes. But the picture changes when one of the subsystems is general intelligence. Putting an LLM in charge of controlling a robot seems like it should be hard, since robotics is always hard… and yet, there’s been a rash of successes with this recently as LLMs have gotten just-barely-general-enough to do a decent job of this.
So my prediction is that as we make smarter and more generally capable models, a lot of the other specific barriers (such as embodiment, or emulated keyboard/mouse use) fall away faster than you’d predict from past trends.
So then the question is, how much difficulty will there be in hooking up the subsystems of the general intelligence module: memory, recursive reasoning, multi-modal sensory input handling, etc. A couple years ago I was arguing with people that the jump from language-only to multi-modal would be quick, and also that soon after one group did it that many others would follow suit and it would become a new standard. This was met with skepticism at the time, people argued it would take longer and be more difficult than I was predicting and that we should expect the change to happen further out into the future (e.g. > 5 years) and occur gradually. Now vision+language is common in the frontier models.
So yeah, it’s hard to do such things, but like.… it’s a challenge which I expect teams of brilliant engineers with big research budgets to be able to conquer. Not hard like I expect them to try their best, but fail and be completely blocked for many years, leading to a general halt of progress across all existing teams.
For what it’s worth, though I can’t point to specific predictions I was not at all surprised by multi-modality. It’s still a token prediction problem, there are not fundamental theoretical differences. I think that modestly more insights are necessary for these other problems.
For dumb subsystems, yes. But the picture changes when one of the subsystems is general intelligence. Putting an LLM in charge of controlling a robot seems like it should be hard, since robotics is always hard… and yet, there’s been a rash of successes with this recently as LLMs have gotten just-barely-general-enough to do a decent job of this.
So my prediction is that as we make smarter and more generally capable models, a lot of the other specific barriers (such as embodiment, or emulated keyboard/mouse use) fall away faster than you’d predict from past trends.
So then the question is, how much difficulty will there be in hooking up the subsystems of the general intelligence module: memory, recursive reasoning, multi-modal sensory input handling, etc. A couple years ago I was arguing with people that the jump from language-only to multi-modal would be quick, and also that soon after one group did it that many others would follow suit and it would become a new standard. This was met with skepticism at the time, people argued it would take longer and be more difficult than I was predicting and that we should expect the change to happen further out into the future (e.g. > 5 years) and occur gradually. Now vision+language is common in the frontier models.
So yeah, it’s hard to do such things, but like.… it’s a challenge which I expect teams of brilliant engineers with big research budgets to be able to conquer. Not hard like I expect them to try their best, but fail and be completely blocked for many years, leading to a general halt of progress across all existing teams.
For what it’s worth, though I can’t point to specific predictions I was not at all surprised by multi-modality. It’s still a token prediction problem, there are not fundamental theoretical differences. I think that modestly more insights are necessary for these other problems.