I also find it odd that Bio Anchors does not talk much about data requirements, and I‘m glad you pointed that out.
Thus, to get timelines, we’d also need to estimate what dataset/environments are necessary for training AGI. But I’m not sure we know what these datasets/environments look like.
I suspect this could be easier to answer than we think. After all, if you consider a typical human, they only have a certain number of skills, and they only have a certain number of experiences. The skills and experiences may be numerous, but they are finite. If we can enumerate and analyze all of them, we may be able to get a lot of insight into what is “necessary for training AGI”.
If I were to try to come up with an estimate, here is one way I might approach it:
What are all the tasks that a typical human (from a given background) can do?
This could be a very long list, so it might make sense to enumerate the tasks/skills at only a fairly high level at first
For each task, why are humans able to do it? What experiences have humans learned from, such that they are able to do the task? What is the minimal set of experiences, such that if a human was not able to experience and learn from them, they would not be able to do the task?
The developmental psychology literature could be very helpful here
For each task that humans can do, what is currently preventing AI systems from learning to do the task?
Maybe AI systems aren’t yet being trained with all the experiences that humans rely on for the task.
Maybe all the relevant experiences are already available for use in training, but our current model architectures and training paradigms aren’t good enough
Though I suspect that once people know exactly what training data humans require for a skill, it won’t be too hard to come up with a working architecture
Maybe all the relevant experiences are available, and there is an architecture that is highly likely to work, but we just don’t yet have the resources to collect enough data or train a sufficiently high-capacity model
A couple more thoughts on “what dataset/environments are necessary for training AGI”:
In your subfield of NLP, even if evaluation is difficult and NLP practitioners find that they need to develop a bunch of application-specific evaluation methods, multi-task training may still yield a model that performs at a human level on most tasks.
Moving beyond NLP, it might turn out that most interesting tasks can be learned from a very simple and easy-to-collect format of dataset. For example, it might be the case that if you train a model on a large enough subset of narrated videos from YouTube, the model can learn how to make a robot perform any given task in simulation, given natural language instructions. Techniques like LORL are a very small-scale version of this, and LORL-like techniques might turn out to be easy to scale up, since LORL only requires imperfect YouTube-like data (imperfect demonstrations + natural language supervision).
Daniel points out that humans don’t need that much data, and I would point out that AI might not either! We haven’t really tried. There’s no AI system today that‘s actually been trained with a human-equivalent set of experiences. Maybe once we actually try, it will turn out to be easy. I think that’s a real possibility.