Caveating that I did a lot of skimming on both Bio Anchors and Eliezer’s response, the part of Bio Anchors that seemed weakest to me was this:
To be maximally precise, we would need to adjust this probability downward by some amount to account for the possibility that other key resources such as datasets and environments are not available by the time the computation is available
I think the existence of proper datasets/environments is a huge issue for current ML approaches, and you have to assign some nontrivial weight to it being a much bigger bottleneck than computational resources. Like, we’re lucky that GPT-3 is trained with the LM objective (predict the next word) for which there is a lot of naturally-occurring training data (written text). Lucky, because that puts us in a position where there’s something obvious to do with additional compute. But if we hit a limit following that approach (and I think it’s plausible that the signal is too weak in otherwise-unlabeled text) then we’re rather stuck. Thus, to get timelines, we’d also need to estimate what dataset/environments are necessary for training AGI. But I’m not sure we know what these datasets/environments look like. An upper bound is “the complete history of earth since life emerged”, or something… not sure we know any better.
I think parts of Eliezer’s response intersects with this concern, e.g. the energy use analogy. It is the same sort of question, how well do we know what the missing ingredients are? Do we know that compute doesn’t occupy enough of the surface area of possible bottlenecks for a compute-based analysis to be worth much? And I’m specifically suggesting that environments/datasets occupy enough of that surface area to seriously undermine the analysis.
Does Bio Anchors deal with this concern beyond the brief mention above (and I missed it, very possible)? Or are there other arguments out there that suggest compute really is all that’s missing?
I occasionally hear people make this point but it really seems wrong to me, so I’d like to hear more! Here are the reasons it seems wrong to me:
1. Data generally seems to be the sort of thing you can get more of by throwing money at. It’s not universally true but it’s true in most cases, and it only needs to be true for at least one transformative or dangerous task. Moreover, investment in AI is increasing; when a tech company is spending $10,000,000,000 on compute for a single AI training run, they can spend 10% as much money to hire2,000 trained professionals and pay them $100k/yr salaries for 5 years to build training environments and collect/curate data. Not to mention, you are probably doing multiple big AI training runs and you are probably a big tech company that has already set up a huge workforce and data-gathering pipeline with loads of products and stuff.
2. Bitter lesson etc. etc. Most people who I trust on this topic seem to think compute is the main bottleneck, not data.
3. Humans don’t need much data; one way we could get transformative or dangerous AI is if we get “human-level AGI,” there are plausible versions of this scenario that involve creating something that needs as little or even less data than humans, and while it’s possible that the only realistic way to create such a thing is via some giant evolution-like training process that itself requires lots of data… I wouldn’t bet on it.
Q. Can you give an example of what progress in data/environments looks like? What progress has been made in this metric over the past 10 years? It’ll help me get a sense of what you count as data/environment and what you count as “algorithms.” The Bio Anchors report already tracks progress in algorithms.
Final point: I totally agree we should assign nontrivial weight to data being a huge bottleneck, one that won’t be overcome by decades and billions of dollars in spending. But everyone agrees on this already; for example, Ajeya assigns 10% of her credence to “basically no amount of compute would be enough with 2020′s ideas, not even 10^70 flops.” How could recapitulating evolution a billion times over not be enough??? Well, two reasons: Maybe there’s some special sauce architecture that can’t be found by brute evolutionary search, and maybe data/training environments is an incredibly strong bottleneck. I think the line between these two is blurry. Anyhow, the point is, on my interpretation of Ajeya her framework already assigns something like 5% credence to data/environments being a strong bottleneck. If you think it should be more than that, no problem! Just go into her spreadsheet and make it 25% or whatever you think it is.
Yes, good questions, but I think there are convincing answers. Here’s a shot:
1. Some kinds of data can be created this way, like parallel corpora for translation or video annotated with text. But I think it’s selection bias that it seems like most cases are like this. Most of the cases we’re familiar with seem like this because this is what’s easy to do! But transformative tasks are hard, and creating data that really contains latent in it the general structures necessary for task performance, that is also hard. I’m not saying research can’t solve it, but that if you want to estimate a timeline, you can’t consign this part of the puzzle to a footnote of the form “lots of research resources will solve it”. Or, if you do, you might as well relax the whole project and bring only that level of precision across the board.
2. At least in NLP (the AI subfield with which I’m most familiar), my sense of the field’s zeitgeist is quite contrary to “compute is the issue”. I think there’s a large, maybe majority current of thought that our current benchmarks are crap, that performance on them doesn’t relate to any interesting real-world task, that optimizing on them is of really unclear value, and that the field as a whole is unfortunately rudderless right now. I think this current holds true among many young DL researchers, not just the Gary Marcuses of the world. That’s not a formal survey or anything, just my sense from reading NLP papers and twitter. But similarly, I think the notion that compute is the bottleneck is overrepresented in the LessWrong sphere, vs. what others think.
3. Humans not needing much data is misleading IMO because the human brain comes highly optimized out of the box at birth, and indeed that’s the result of a big evolutionary process. To be clear, I agree achieving human-level AI is enough to count as transformative and may well be a second-long epoch on the way to much more powerful AI. But anyway, you have basically the same question to answer there. Namely, I’d still object that Bio Anchors doesn’t address the datasets/environments issue regarding making even just human-level AI. Changing the scope to “merely” human doesn’t answer the objection.
Q/A. As for recent progress: no, I think there has been very little! I’m only really familiar with NLP, so there might be more in the RL environments. (My very vague sense of RL is that it’s still just “video games you can put an agent is” and basically always has been, but don’t take it from me.) As for NLP, there is basically nothing new in the last 10 years. We have lots of unlabeled text for language models, we have parallel corpora for translation, and we have labeled datasets for things like question-answering (see here for a larger list of supervised tasks). I think it’s really unclear whether any of these have latent in them the structures necessary for general language understanding. GPT is the biggest glimmer of hope recently, but part of the problem even there is we can’t even really quantify how close it is to general language understanding. We don’t have a good way of measuring this! Without it, we certainly can’t train, as we can’t compute a loss function. I think there are maybe some arguments that, in the limit, unlabeled text with the LM objective is enough: but that limit might really be more text than can fit on earth, and we’d need to get a handle on that for any estimates.
Final point: I’m more looking for a qualitative acknowledgement that this problem of datasets/environments is hard and unsolved (or arguments for why it isn’t), is as important as compute, and, building on that, serious attention paid to an analysis of what it would take to make the right datasets/environments. Rather than consign it to an “everything else” parameter, analyze what it might take to make better datasets/environments, including trying to get a handle on whether we even know how. I think this would make for a much better analysis, and would address some of Eliezer’s concerns because it would cover more of the specific, mechanistic story about the path to creating transformative AI.
(Full disclosure: I’ve personally done work on making better NLP benchmarks, which I guess has given me an appreciation for how hard and unsolved this problem feels. So, discount appropriately.)
I agree it would be good to extend the bio anchors framework to include more explicit modelling of data requirements and the like instead of just having it be implicit in the existing variables. I’m generally a fan of making more detailed, realistic models and this seems reasonably high on the priority list of extensions to make. I’d also want to extend the model to include multiple different kinds of transformative task and dangerous task, and perhaps to include interaction effects between them (e.g. once we get R&D acceleration then everything else may come soon...) and perhaps to make willingness-to-spend not increase monotonically but rather follow a boom and bust cycle with a big boom probably preceding the final takeoff.
I still don’t think the problem of datasets/environments is hard and unsolved, but I would like to learn more.
What’s so bad about text prediction / entropy (GPT-3′s main metric) as a metric for NLP? I’ve heard good things about it, e.g. this.
Re: Humans not needing much data: They are still proof of concept that you don’t need much data (and/or not much special, hard-to-acquire data, just ordinary life experience, access to books and conversations with people, etc.) to be really generally intelligent. Maybe we’ll figure out how to make AIs like this too. MAYBE the only way to do this is to recapitulate a costly evolutionary process that itself is really tricky to do and requires lots of data that’s hard to get… but I highly doubt it. For example, we might study how the brain works and copy evolution’s homework, so to speak. Or it may turn out that most of the complexity and optimization of the brain just isn’t necessary. To your “Humans not needing much data is misleading IMO because the human brain comes highly optimized out of the box at birth, and indeed that’s the result of a big evolutionary process” I reply with this post.
I still don’t think we have good reason to think ALL transformative tasks or dangerous tasks will require data that’s hard to get. You say this stuff is hard, but all we know is that it’s too hard for the amount of effort that’s been expended so far. For all we know it’s not actually that much harder than that.
If there hasn’t been much progress in the last ten years on the data problem, then… do you also expect there to be not much progress in the next ten years either? And do you thus have, like, 50-year timelines?
Re: humans/brains, I think what humans are a proof of concept of is that, if you start with an infant brain, and expose it to an ordinary life experience (a la training / fine-tuning), then you can get general intelligence. But I think this just doesn’t bear on the topic of Bio Anchors, because Bio Anchors doesn’t presume we have a brain, it presumes we have transformers. And transformers don’t know what to do with a lifetime of experience, at least nowhere near as well as an infant brain does. I agree we might learn more about AI from examining humans! But that’s leaving the Bio Anchors framing of “we just need compute” and getting into the framing of algorithmic improvements etc. I don’t disagree broadly that some approaches to AI might not have as big a (pre-)training phase the way current models do, if, for instance, they figure out a way to “start with” infant brains. But I don’t see the connection to the Bio Anchors framing.
What’s so bad about perplexity? I’m not saying perplexity is bad per se, just that it’s unclear how much data you need, with perplexity as your objective, to achieve general-purpose language facility. It’s unclear both because the connection between perplexity and extrinsic linguistic tasks is unclear, and because we don’t have great ways of measuring extrinsic linguistic tasks. For instance, the essay you cite itself cites two very small experiments showing correlation between perplexity and extrinsic tasks. One of them is a regression on 8 data points, the other has 24 data points. So I just wouldn’t put too much stake in extrapolations there. Furthermore, and this isn’t against perplexity, but I’d be skeptical of the other variable i.e. the linguistic task perplexity is regressed against: in both cases, a vague human judgement of whether model output is “human-like”. I think there’s not much reason to think that is correlated to some general-purpose language facility. Attempts like this to (roughly speaking) operationalize the Turing test have generally been disappointing; getting humans to say “sure, that sounds human” seems to leave a lot to be desired; I think most researchers find them to be disappointingly game-able, vague a critique though that may be. The google Meena paper acknowledges this, and reading between the lines I get the sense they don’t think too much of their extrinsic, human-evaluation metric either. E.g., the best they say is, “[inter-annotator] agreement is reasonable considering the questions are subjective and the final results are always aggregated labels”.
This is sort of my point in a nutshell: we have put very little effort into telling whether the datasets we have contain adequate signal to learn the functions we want to learn, in part because we aren’t even sure how to evaluate those functions. It’s not surprising that perplexity correlates with extrinsic tasks to a degree. For instance, it’s pretty clear that, to get state-of-the-art low perplexity on existing corpora, transformers can learn the latent rules of grammar, and, naturally doing so correlates with better human judgements of model output. So, grammar is latent in the corpora. But is physics latent in the corpora? It would improve a model’s perplexity at least a bit to learn physics: some of these corpora contain physics textbooks with answers to exercises way at the back of the books, so to predict the answers at the back you would have to be able to learn how to do the exercises. But it’s unclear whether current corpora contain enough signal to do that. Would we even know how to tell if the model was or wasn’t learning physics? I’m personally skeptical that it’s happening at all, but I admit that’s just based in my subjective assessment of GPT-3 output… again, part of the problem of not having a good way to judge performance outside of perplexity.
As for why all transformative tasks might have hard-to-get-data… well this is certainly speculative, but people sometimes talk about AI-complete tasks, analogizing to the concept of completeness for complexity classes in complexity theory (e.g., NP-complete). I think that’s the relevant idea here. The goal being general intelligence, I think it’s plausible that most (all? I don’t know) transformative tasks are reducible to each other. And I think you also get a hint of this in NLP tasks, where they are weirdly reducible to each other, given the amazing flexibility of language. Like, for a dumb example, the task of question answering entails the task of translation, because you can ask, “How do you say [passage] in French?” So I think the sheer number of tasks, as initially categorized by humans, can be misleading. Tasks aren’t as independent as they may appear. Anyway, that’s far from a tight argument, but hopefully it provides some intuition.
Honestly I haven’t thought about how to incorporate the dataset bottleneck into a timeline. But, I suppose, I could wind up with even longer timelines if I think that we haven’t made progress because we don’t have the faintest idea how and the lack of progress isn’t for lack of trying. Missing fundamental ideas, etc. How do you forecast when a line of zero slope eventually turns up? If I really think we have shown ourselves to be stumped (not sure), I guess I’d have to fall back on general-purpose tools for forecasting big breakthroughs, and that’s the sort of vague modeling that Bio Anchors seems to be trying to avoid.
BioAnchors is poorly named, the part you are critiquing should be called GPT-3_Anchors.
A better actual BioAnchor would be based on trying to model/predict how key params like data efficiency and energy efficiency are improving over time, and when they will match/surpass the brain.
GPT-3 could also obviously be improved for example by multi-modal training, active learning, curriculum learning, etc. It’s not like it even represents the best of what’s possible for a serious AGI attempt today.
Fwiw, I think nostalgebraist’s recent post hit on some of the same things I was trying to get at, especially around not having adequate testing to know how smart the systems are getting—see the section on what he calls (non-)ecological evaluation.
With apologies for the belated response: I think greghb makes a lot of good points here, and I agree with him on most of the specific disagreements with Daniel. In particular:
I agree that “Bio Anchors doesn’t presume we have a brain, it presumes we have transformers. And transformers don’t know what to do with a lifetime of experience, at least nowhere near as well as an infant brain does.” My guess is that we should not expect human-like sample efficiency from a simple randomly initialized network; instead, we should expect to extensively train a network to the point where it can do this human-like learning. (That said, this is far from obvious, and some AI scientists take the opposite point of view.)
I’m not super sympathetic to Daniel’s implied position that there are lots of possible transformative tasks and we “only need one” of them. I think there’s something to this (in particular, we don’t need to replicate everything humans can do), but I think once we start claiming that there are 5+ independent tasks such that automating them would be transformative, we have to ask ourselves why transformative events are as historically rare as they are. (More at my discussion of persuasion on another thread.)
Overall, I think that datasets/environments are plausible as a major blocker to transformative AI, and I think Bio Anchors would be a lot stronger if it had more to say about this.
I am sympathetic to Bio Anchors’s bottom-line quantitative estimates despite this, though (and to be clear, I held all of these positions at the time Bio Anchors was drafted). It’s not easy for me to explain all of where I’m coming from, but a few intuitions:
We’re still in a regime where compute is an important bottleneck to AI development, and funding and interest are going up. If we get into a regime where compute is plentiful and data/environments are the big blocker, I expect efforts to become heavily focused there.
Combining the first two points leaves me guessing that “if there’s a not-prohibitively-difficult way to do this, people are going to find it on the time frames indicated.” And I think there probably is:
The Internet contains a truly massive amount of information at this point about many different dimensions of the human world. I expect this information source to keep growing, especially as AI advances and interacts more productively and richly with humans, and as AI can potentially be used as an increasingly large part of the process of finding data, cleaning it, etc.
AI developers will also—especially as funding and interest grow—have the ability to collect data by (a) monitoring researchers, contractors, volunteers, etc.; (b) designing products with data collection in mind (e.g., Assistant and Siri).
The above two points seem especially strong to me when considering that automating science and engineering might be sufficient for transformative AI—these seem particularly conducive to learning from digitally captured information.
On a totally separate note, it seems to me that fairly simple ingredients have made the historical human “environment” sufficiently sophisticated to train transformative capabilities. It seems to me that most of what’s “interesting and challenging” about our environment comes from competing with each other, and I’d guess it’s possible to set up some sort of natural-selection-driven environment in which AIs compete with each other; I wouldn’t expect such a thing to be highly sensitive to whether we’re able to capture all of the details of e.g. how to get food that occurred in our past. (I would expect it to be compute-intensive, though.)
Hopefully that gives a sense of where I’m coming from. Overall, I think this is one of the most compelling objections to Bio Anchors; I find it stronger than the points Eliezer focuses on above (unless you are pretty determined to steel-man any argument along the lines of “Brains and AIs are different” into a specific argument about the most important difference.)
I also find it odd that Bio Anchors does not talk much about data requirements, and I‘m glad you pointed that out.
Thus, to get timelines, we’d also need to estimate what dataset/environments are necessary for training AGI. But I’m not sure we know what these datasets/environments look like.
I suspect this could be easier to answer than we think. After all, if you consider a typical human, they only have a certain number of skills, and they only have a certain number of experiences. The skills and experiences may be numerous, but they are finite. If we can enumerate and analyze all of them, we may be able to get a lot of insight into what is “necessary for training AGI”.
If I were to try to come up with an estimate, here is one way I might approach it:
What are all the tasks that a typical human (from a given background) can do?
This could be a very long list, so it might make sense to enumerate the tasks/skills at only a fairly high level at first
For each task, why are humans able to do it? What experiences have humans learned from, such that they are able to do the task? What is the minimal set of experiences, such that if a human was not able to experience and learn from them, they would not be able to do the task?
The developmental psychology literature could be very helpful here
For each task that humans can do, what is currently preventing AI systems from learning to do the task?
Maybe AI systems aren’t yet being trained with all the experiences that humans rely on for the task.
Maybe all the relevant experiences are already available for use in training, but our current model architectures and training paradigms aren’t good enough
Though I suspect that once people know exactly what training data humans require for a skill, it won’t be too hard to come up with a working architecture
Maybe all the relevant experiences are available, and there is an architecture that is highly likely to work, but we just don’t yet have the resources to collect enough data or train a sufficiently high-capacity model
A couple more thoughts on “what dataset/environments are necessary for training AGI”:
In your subfield of NLP, even if evaluation is difficult and NLP practitioners find that they need to develop a bunch of application-specific evaluation methods, multi-task training may still yield a model that performs at a human level on most tasks.
Moving beyond NLP, it might turn out that most interesting tasks can be learned from a very simple and easy-to-collect format of dataset. For example, it might be the case that if you train a model on a large enough subset of narrated videos from YouTube, the model can learn how to make a robot perform any given task in simulation, given natural language instructions. Techniques like LORL are a very small-scale version of this, and LORL-like techniques might turn out to be easy to scale up, since LORL only requires imperfect YouTube-like data (imperfect demonstrations + natural language supervision).
Daniel points out that humans don’t need that much data, and I would point out that AI might not either! We haven’t really tried. There’s no AI system today that‘s actually been trained with a human-equivalent set of experiences. Maybe once we actually try, it will turn out to be easy. I think that’s a real possibility.
Caveating that I did a lot of skimming on both Bio Anchors and Eliezer’s response, the part of Bio Anchors that seemed weakest to me was this:
I think the existence of proper datasets/environments is a huge issue for current ML approaches, and you have to assign some nontrivial weight to it being a much bigger bottleneck than computational resources. Like, we’re lucky that GPT-3 is trained with the LM objective (predict the next word) for which there is a lot of naturally-occurring training data (written text). Lucky, because that puts us in a position where there’s something obvious to do with additional compute. But if we hit a limit following that approach (and I think it’s plausible that the signal is too weak in otherwise-unlabeled text) then we’re rather stuck. Thus, to get timelines, we’d also need to estimate what dataset/environments are necessary for training AGI. But I’m not sure we know what these datasets/environments look like. An upper bound is “the complete history of earth since life emerged”, or something… not sure we know any better.
I think parts of Eliezer’s response intersects with this concern, e.g. the energy use analogy. It is the same sort of question, how well do we know what the missing ingredients are? Do we know that compute doesn’t occupy enough of the surface area of possible bottlenecks for a compute-based analysis to be worth much? And I’m specifically suggesting that environments/datasets occupy enough of that surface area to seriously undermine the analysis.
Does Bio Anchors deal with this concern beyond the brief mention above (and I missed it, very possible)? Or are there other arguments out there that suggest compute really is all that’s missing?
I occasionally hear people make this point but it really seems wrong to me, so I’d like to hear more! Here are the reasons it seems wrong to me:
1. Data generally seems to be the sort of thing you can get more of by throwing money at. It’s not universally true but it’s true in most cases, and it only needs to be true for at least one transformative or dangerous task. Moreover, investment in AI is increasing; when a tech company is spending $10,000,000,000 on compute for a single AI training run, they can spend 10% as much money to hire 2,000 trained professionals and pay them $100k/yr salaries for 5 years to build training environments and collect/curate data. Not to mention, you are probably doing multiple big AI training runs and you are probably a big tech company that has already set up a huge workforce and data-gathering pipeline with loads of products and stuff.
2. Bitter lesson etc. etc. Most people who I trust on this topic seem to think compute is the main bottleneck, not data.
3. Humans don’t need much data; one way we could get transformative or dangerous AI is if we get “human-level AGI,” there are plausible versions of this scenario that involve creating something that needs as little or even less data than humans, and while it’s possible that the only realistic way to create such a thing is via some giant evolution-like training process that itself requires lots of data… I wouldn’t bet on it.
Q. Can you give an example of what progress in data/environments looks like? What progress has been made in this metric over the past 10 years? It’ll help me get a sense of what you count as data/environment and what you count as “algorithms.” The Bio Anchors report already tracks progress in algorithms.
Final point: I totally agree we should assign nontrivial weight to data being a huge bottleneck, one that won’t be overcome by decades and billions of dollars in spending. But everyone agrees on this already; for example, Ajeya assigns 10% of her credence to “basically no amount of compute would be enough with 2020′s ideas, not even 10^70 flops.” How could recapitulating evolution a billion times over not be enough??? Well, two reasons: Maybe there’s some special sauce architecture that can’t be found by brute evolutionary search, and maybe data/training environments is an incredibly strong bottleneck. I think the line between these two is blurry. Anyhow, the point is, on my interpretation of Ajeya her framework already assigns something like 5% credence to data/environments being a strong bottleneck. If you think it should be more than that, no problem! Just go into her spreadsheet and make it 25% or whatever you think it is.
Yes, good questions, but I think there are convincing answers. Here’s a shot:
1. Some kinds of data can be created this way, like parallel corpora for translation or video annotated with text. But I think it’s selection bias that it seems like most cases are like this. Most of the cases we’re familiar with seem like this because this is what’s easy to do! But transformative tasks are hard, and creating data that really contains latent in it the general structures necessary for task performance, that is also hard. I’m not saying research can’t solve it, but that if you want to estimate a timeline, you can’t consign this part of the puzzle to a footnote of the form “lots of research resources will solve it”. Or, if you do, you might as well relax the whole project and bring only that level of precision across the board.
2. At least in NLP (the AI subfield with which I’m most familiar), my sense of the field’s zeitgeist is quite contrary to “compute is the issue”. I think there’s a large, maybe majority current of thought that our current benchmarks are crap, that performance on them doesn’t relate to any interesting real-world task, that optimizing on them is of really unclear value, and that the field as a whole is unfortunately rudderless right now. I think this current holds true among many young DL researchers, not just the Gary Marcuses of the world. That’s not a formal survey or anything, just my sense from reading NLP papers and twitter. But similarly, I think the notion that compute is the bottleneck is overrepresented in the LessWrong sphere, vs. what others think.
3. Humans not needing much data is misleading IMO because the human brain comes highly optimized out of the box at birth, and indeed that’s the result of a big evolutionary process. To be clear, I agree achieving human-level AI is enough to count as transformative and may well be a second-long epoch on the way to much more powerful AI. But anyway, you have basically the same question to answer there. Namely, I’d still object that Bio Anchors doesn’t address the datasets/environments issue regarding making even just human-level AI. Changing the scope to “merely” human doesn’t answer the objection.
Q/A. As for recent progress: no, I think there has been very little! I’m only really familiar with NLP, so there might be more in the RL environments. (My very vague sense of RL is that it’s still just “video games you can put an agent is” and basically always has been, but don’t take it from me.) As for NLP, there is basically nothing new in the last 10 years. We have lots of unlabeled text for language models, we have parallel corpora for translation, and we have labeled datasets for things like question-answering (see here for a larger list of supervised tasks). I think it’s really unclear whether any of these have latent in them the structures necessary for general language understanding. GPT is the biggest glimmer of hope recently, but part of the problem even there is we can’t even really quantify how close it is to general language understanding. We don’t have a good way of measuring this! Without it, we certainly can’t train, as we can’t compute a loss function. I think there are maybe some arguments that, in the limit, unlabeled text with the LM objective is enough: but that limit might really be more text than can fit on earth, and we’d need to get a handle on that for any estimates.
Final point: I’m more looking for a qualitative acknowledgement that this problem of datasets/environments is hard and unsolved (or arguments for why it isn’t), is as important as compute, and, building on that, serious attention paid to an analysis of what it would take to make the right datasets/environments. Rather than consign it to an “everything else” parameter, analyze what it might take to make better datasets/environments, including trying to get a handle on whether we even know how. I think this would make for a much better analysis, and would address some of Eliezer’s concerns because it would cover more of the specific, mechanistic story about the path to creating transformative AI.
(Full disclosure: I’ve personally done work on making better NLP benchmarks, which I guess has given me an appreciation for how hard and unsolved this problem feels. So, discount appropriately.)
Thanks, nice answers!
I agree it would be good to extend the bio anchors framework to include more explicit modelling of data requirements and the like instead of just having it be implicit in the existing variables. I’m generally a fan of making more detailed, realistic models and this seems reasonably high on the priority list of extensions to make. I’d also want to extend the model to include multiple different kinds of transformative task and dangerous task, and perhaps to include interaction effects between them (e.g. once we get R&D acceleration then everything else may come soon...) and perhaps to make willingness-to-spend not increase monotonically but rather follow a boom and bust cycle with a big boom probably preceding the final takeoff.
I still don’t think the problem of datasets/environments is hard and unsolved, but I would like to learn more.
What’s so bad about text prediction / entropy (GPT-3′s main metric) as a metric for NLP? I’ve heard good things about it, e.g. this.
Re: Humans not needing much data: They are still proof of concept that you don’t need much data (and/or not much special, hard-to-acquire data, just ordinary life experience, access to books and conversations with people, etc.) to be really generally intelligent. Maybe we’ll figure out how to make AIs like this too. MAYBE the only way to do this is to recapitulate a costly evolutionary process that itself is really tricky to do and requires lots of data that’s hard to get… but I highly doubt it. For example, we might study how the brain works and copy evolution’s homework, so to speak. Or it may turn out that most of the complexity and optimization of the brain just isn’t necessary. To your “Humans not needing much data is misleading IMO because the human brain comes highly optimized out of the box at birth, and indeed that’s the result of a big evolutionary process” I reply with this post.
I still don’t think we have good reason to think ALL transformative tasks or dangerous tasks will require data that’s hard to get. You say this stuff is hard, but all we know is that it’s too hard for the amount of effort that’s been expended so far. For all we know it’s not actually that much harder than that.
If there hasn’t been much progress in the last ten years on the data problem, then… do you also expect there to be not much progress in the next ten years either? And do you thus have, like, 50-year timelines?
Re: humans/brains, I think what humans are a proof of concept of is that, if you start with an infant brain, and expose it to an ordinary life experience (a la training / fine-tuning), then you can get general intelligence. But I think this just doesn’t bear on the topic of Bio Anchors, because Bio Anchors doesn’t presume we have a brain, it presumes we have transformers. And transformers don’t know what to do with a lifetime of experience, at least nowhere near as well as an infant brain does. I agree we might learn more about AI from examining humans! But that’s leaving the Bio Anchors framing of “we just need compute” and getting into the framing of algorithmic improvements etc. I don’t disagree broadly that some approaches to AI might not have as big a (pre-)training phase the way current models do, if, for instance, they figure out a way to “start with” infant brains. But I don’t see the connection to the Bio Anchors framing.
What’s so bad about perplexity? I’m not saying perplexity is bad per se, just that it’s unclear how much data you need, with perplexity as your objective, to achieve general-purpose language facility. It’s unclear both because the connection between perplexity and extrinsic linguistic tasks is unclear, and because we don’t have great ways of measuring extrinsic linguistic tasks. For instance, the essay you cite itself cites two very small experiments showing correlation between perplexity and extrinsic tasks. One of them is a regression on 8 data points, the other has 24 data points. So I just wouldn’t put too much stake in extrapolations there. Furthermore, and this isn’t against perplexity, but I’d be skeptical of the other variable i.e. the linguistic task perplexity is regressed against: in both cases, a vague human judgement of whether model output is “human-like”. I think there’s not much reason to think that is correlated to some general-purpose language facility. Attempts like this to (roughly speaking) operationalize the Turing test have generally been disappointing; getting humans to say “sure, that sounds human” seems to leave a lot to be desired; I think most researchers find them to be disappointingly game-able, vague a critique though that may be. The google Meena paper acknowledges this, and reading between the lines I get the sense they don’t think too much of their extrinsic, human-evaluation metric either. E.g., the best they say is, “[inter-annotator] agreement is reasonable considering the questions are subjective and the final results are always aggregated labels”.
This is sort of my point in a nutshell: we have put very little effort into telling whether the datasets we have contain adequate signal to learn the functions we want to learn, in part because we aren’t even sure how to evaluate those functions. It’s not surprising that perplexity correlates with extrinsic tasks to a degree. For instance, it’s pretty clear that, to get state-of-the-art low perplexity on existing corpora, transformers can learn the latent rules of grammar, and, naturally doing so correlates with better human judgements of model output. So, grammar is latent in the corpora. But is physics latent in the corpora? It would improve a model’s perplexity at least a bit to learn physics: some of these corpora contain physics textbooks with answers to exercises way at the back of the books, so to predict the answers at the back you would have to be able to learn how to do the exercises. But it’s unclear whether current corpora contain enough signal to do that. Would we even know how to tell if the model was or wasn’t learning physics? I’m personally skeptical that it’s happening at all, but I admit that’s just based in my subjective assessment of GPT-3 output… again, part of the problem of not having a good way to judge performance outside of perplexity.
As for why all transformative tasks might have hard-to-get-data… well this is certainly speculative, but people sometimes talk about AI-complete tasks, analogizing to the concept of completeness for complexity classes in complexity theory (e.g., NP-complete). I think that’s the relevant idea here. The goal being general intelligence, I think it’s plausible that most (all? I don’t know) transformative tasks are reducible to each other. And I think you also get a hint of this in NLP tasks, where they are weirdly reducible to each other, given the amazing flexibility of language. Like, for a dumb example, the task of question answering entails the task of translation, because you can ask, “How do you say [passage] in French?” So I think the sheer number of tasks, as initially categorized by humans, can be misleading. Tasks aren’t as independent as they may appear. Anyway, that’s far from a tight argument, but hopefully it provides some intuition.
Honestly I haven’t thought about how to incorporate the dataset bottleneck into a timeline. But, I suppose, I could wind up with even longer timelines if I think that we haven’t made progress because we don’t have the faintest idea how and the lack of progress isn’t for lack of trying. Missing fundamental ideas, etc. How do you forecast when a line of zero slope eventually turns up? If I really think we have shown ourselves to be stumped (not sure), I guess I’d have to fall back on general-purpose tools for forecasting big breakthroughs, and that’s the sort of vague modeling that Bio Anchors seems to be trying to avoid.
BioAnchors is poorly named, the part you are critiquing should be called GPT-3_Anchors.
A better actual BioAnchor would be based on trying to model/predict how key params like data efficiency and energy efficiency are improving over time, and when they will match/surpass the brain.
GPT-3 could also obviously be improved for example by multi-modal training, active learning, curriculum learning, etc. It’s not like it even represents the best of what’s possible for a serious AGI attempt today.
Fwiw, I think nostalgebraist’s recent post hit on some of the same things I was trying to get at, especially around not having adequate testing to know how smart the systems are getting—see the section on what he calls (non-)ecological evaluation.
With apologies for the belated response: I think greghb makes a lot of good points here, and I agree with him on most of the specific disagreements with Daniel. In particular:
I agree that “Bio Anchors doesn’t presume we have a brain, it presumes we have transformers. And transformers don’t know what to do with a lifetime of experience, at least nowhere near as well as an infant brain does.” My guess is that we should not expect human-like sample efficiency from a simple randomly initialized network; instead, we should expect to extensively train a network to the point where it can do this human-like learning. (That said, this is far from obvious, and some AI scientists take the opposite point of view.)
I’m not super sympathetic to Daniel’s implied position that there are lots of possible transformative tasks and we “only need one” of them. I think there’s something to this (in particular, we don’t need to replicate everything humans can do), but I think once we start claiming that there are 5+ independent tasks such that automating them would be transformative, we have to ask ourselves why transformative events are as historically rare as they are. (More at my discussion of persuasion on another thread.)
Overall, I think that datasets/environments are plausible as a major blocker to transformative AI, and I think Bio Anchors would be a lot stronger if it had more to say about this.
I am sympathetic to Bio Anchors’s bottom-line quantitative estimates despite this, though (and to be clear, I held all of these positions at the time Bio Anchors was drafted). It’s not easy for me to explain all of where I’m coming from, but a few intuitions:
We’re still in a regime where compute is an important bottleneck to AI development, and funding and interest are going up. If we get into a regime where compute is plentiful and data/environments are the big blocker, I expect efforts to become heavily focused there.
Several decades is just a very long time. (This relates to the overall burden of proof on arguments like these, particularly the fact that this century is likely to see most of the effort that has gone into transformative AI development to date.)
Combining the first two points leaves me guessing that “if there’s a not-prohibitively-difficult way to do this, people are going to find it on the time frames indicated.” And I think there probably is:
The Internet contains a truly massive amount of information at this point about many different dimensions of the human world. I expect this information source to keep growing, especially as AI advances and interacts more productively and richly with humans, and as AI can potentially be used as an increasingly large part of the process of finding data, cleaning it, etc.
AI developers will also—especially as funding and interest grow—have the ability to collect data by (a) monitoring researchers, contractors, volunteers, etc.; (b) designing products with data collection in mind (e.g., Assistant and Siri).
The above two points seem especially strong to me when considering that automating science and engineering might be sufficient for transformative AI—these seem particularly conducive to learning from digitally captured information.
On a totally separate note, it seems to me that fairly simple ingredients have made the historical human “environment” sufficiently sophisticated to train transformative capabilities. It seems to me that most of what’s “interesting and challenging” about our environment comes from competing with each other, and I’d guess it’s possible to set up some sort of natural-selection-driven environment in which AIs compete with each other; I wouldn’t expect such a thing to be highly sensitive to whether we’re able to capture all of the details of e.g. how to get food that occurred in our past. (I would expect it to be compute-intensive, though.)
Hopefully that gives a sense of where I’m coming from. Overall, I think this is one of the most compelling objections to Bio Anchors; I find it stronger than the points Eliezer focuses on above (unless you are pretty determined to steel-man any argument along the lines of “Brains and AIs are different” into a specific argument about the most important difference.)
I also find it odd that Bio Anchors does not talk much about data requirements, and I‘m glad you pointed that out.
I suspect this could be easier to answer than we think. After all, if you consider a typical human, they only have a certain number of skills, and they only have a certain number of experiences. The skills and experiences may be numerous, but they are finite. If we can enumerate and analyze all of them, we may be able to get a lot of insight into what is “necessary for training AGI”.
If I were to try to come up with an estimate, here is one way I might approach it:
What are all the tasks that a typical human (from a given background) can do?
This could be a very long list, so it might make sense to enumerate the tasks/skills at only a fairly high level at first
For each task, why are humans able to do it? What experiences have humans learned from, such that they are able to do the task? What is the minimal set of experiences, such that if a human was not able to experience and learn from them, they would not be able to do the task?
The developmental psychology literature could be very helpful here
For each task that humans can do, what is currently preventing AI systems from learning to do the task?
Maybe AI systems aren’t yet being trained with all the experiences that humans rely on for the task.
Maybe all the relevant experiences are already available for use in training, but our current model architectures and training paradigms aren’t good enough
Though I suspect that once people know exactly what training data humans require for a skill, it won’t be too hard to come up with a working architecture
Maybe all the relevant experiences are available, and there is an architecture that is highly likely to work, but we just don’t yet have the resources to collect enough data or train a sufficiently high-capacity model
A couple more thoughts on “what dataset/environments are necessary for training AGI”:
In your subfield of NLP, even if evaluation is difficult and NLP practitioners find that they need to develop a bunch of application-specific evaluation methods, multi-task training may still yield a model that performs at a human level on most tasks.
Moving beyond NLP, it might turn out that most interesting tasks can be learned from a very simple and easy-to-collect format of dataset. For example, it might be the case that if you train a model on a large enough subset of narrated videos from YouTube, the model can learn how to make a robot perform any given task in simulation, given natural language instructions. Techniques like LORL are a very small-scale version of this, and LORL-like techniques might turn out to be easy to scale up, since LORL only requires imperfect YouTube-like data (imperfect demonstrations + natural language supervision).
Daniel points out that humans don’t need that much data, and I would point out that AI might not either! We haven’t really tried. There’s no AI system today that‘s actually been trained with a human-equivalent set of experiences. Maybe once we actually try, it will turn out to be easy. I think that’s a real possibility.