In general I’m not that interested in these sorts of “generic agents” that can do all the things with one neural net and don’t think they affect the relevant timelines very much; it seems like it will be far more economically useful to have separate neural nets doing each of the things and using each other as tools to accomplish particular tasks and so that’s what I expect to see.
Aren’t you worried about agents that can leverage extremely complex knowledge of the world (like Flamingo has) that they gained via text, picture, video, etc inputs, on a robotic controller? Think of an RL agent that can learn how to play Montezuma’s Revenge extremely quickly, because it consumed so much internet data that it knows what a “key” and “rope” are, and that these in-game objects are analogous to those images it saw in pretraining. Something like that getting a malicious command in real life on a physical robot seems terrifying—it would be able to form extremely complex plans in order to achieve a malicious goal, given its environment—and at least from what I can tell from the Gato paper, the only missing ingredient at this point might be “more parameters/TPUs”
I agree that will happen eventually, and the more nuanced version of my position is the one I outlined in my comment on CAIS:
Now I would say that there is some level of data, model capacity, and compute at which an end-to-end / monolithic approach outperforms a structured approach on the training distribution (this is related to but not the same as the bitter lesson). However, at low levels of these three, the structured approach will typically perform better. The required levels at which the end-to-end approach works better depends on the particular task, and increases with task difficulty.
Since we expect all three of these factors to grow over time, I then expect that there will be an expanding Pareto frontier where at any given point the most complex tasks are performed by structured approaches, but as time progresses these are replaced by end-to-end / monolithic systems (but at the same time new, even more complex tasks are found, that can be done in a structured way).
I think when we are first in the situation where AI systems are sufficiently competent to wrest control away from humanity if they wanted to, we would plausibly have robots that take in audiovisual input and can flexibly perform tasks that a human says to them (think of e.g. a household robot butler). So in that sense I agree that eventually we’ll have agents that link together language, vision, and robotics.
The thing I’m not that interested in (from a “how scared should we be” or “timelines” perspective) is when you take a bunch of different tasks, shove them into a single “generic agent”, and the resulting agent is worse on most of the tasks and isn’t correspondingly better at some new task that none of the previous systems could do.
So if for example you could draw an arrow on an image showing what you wanted a robot to do, and the robot then did that, that would be a novel capability that couldn’t be done by previous specialized systems (probably), and I’d be interested in that. It doesn’t look like this agent does that.
Does that mean the socratic models result from a few weeks ago, which does involve connecting more specialised models together, is a better example of progress?
Aren’t you worried about agents that can leverage extremely complex knowledge of the world (like Flamingo has) that they gained via text, picture, video, etc inputs, on a robotic controller? Think of an RL agent that can learn how to play Montezuma’s Revenge extremely quickly, because it consumed so much internet data that it knows what a “key” and “rope” are, and that these in-game objects are analogous to those images it saw in pretraining. Something like that getting a malicious command in real life on a physical robot seems terrifying—it would be able to form extremely complex plans in order to achieve a malicious goal, given its environment—and at least from what I can tell from the Gato paper, the only missing ingredient at this point might be “more parameters/TPUs”
I agree that will happen eventually, and the more nuanced version of my position is the one I outlined in my comment on CAIS:
I think when we are first in the situation where AI systems are sufficiently competent to wrest control away from humanity if they wanted to, we would plausibly have robots that take in audiovisual input and can flexibly perform tasks that a human says to them (think of e.g. a household robot butler). So in that sense I agree that eventually we’ll have agents that link together language, vision, and robotics.
The thing I’m not that interested in (from a “how scared should we be” or “timelines” perspective) is when you take a bunch of different tasks, shove them into a single “generic agent”, and the resulting agent is worse on most of the tasks and isn’t correspondingly better at some new task that none of the previous systems could do.
So if for example you could draw an arrow on an image showing what you wanted a robot to do, and the robot then did that, that would be a novel capability that couldn’t be done by previous specialized systems (probably), and I’d be interested in that. It doesn’t look like this agent does that.
Does that mean the socratic models result from a few weeks ago, which does involve connecting more specialised models together, is a better example of progress?
Yes