My guess is that the issue of sample efficiency results from equivocation between datasets used for training a model and datasets provided externally. What is the sample efficiency of AlphaZero? It’s as bad as anything else if we divide by the datasets generated by amplification, but it’s infinitely large if we divide by externally provided datasets, as there are none. The sample efficiency relevant for the cost of training includes the datasets generated by amplification, but in informal comparison with human performance the estimate is about how much the humans observed externally before attaining some level of performance, hence the equivocation.
Similarly if someone figures out amplification for language models (something like debate, but actually works), it can then train on the vastly larger (and better) datasets generated by the model itself, that’s only bootstrapped from the external dataset, and so its sample efficiency with respect to the external dataset is going to skyrocket (one issue is that the external dataset is already large, so it’s more about quality than quantity, but alternatively this form of training might be able to bootstrap from a much smaller external dataset). So the usual measure of sample efficiency doesn’t seem very informative about what’s possible with exactly the same learning algorithm after the amplification loop is closed.
For context, I’m interested in questions like “If we had a big transformer that was being fine-tuned as a chatbot with millions of daily conversations, would it be up-to-date on the latest news of the day? What about local news? What about e.g. subculture drama? How often would people have to talk about something for it to be impressed in long-term memory?”
This sounds like something an appropriate amplification may well be able to help the model memorize, even for things mentioned only once, without changing the learning algorithm, at the cost of more training on the auxiliary data generated by the amplification (in this case probably with prompts from the external datasets that need to be combed for rare details).
(I do understand that the question you are asking is about what happens without auxiliary data. I’m commenting on a way accounting for prompt engineering breaks estimates of potential performance of the same learning algorithm. It then becomes an issue of cost, not limitations of the algorithm, in a way that’s different from scaling laws.)
Right on. That’s a good point. So really I guess the conclusion is: Compute is the bottleneck; an AI chatbot or whatever could totally learn random facts the very first time it encounters them, if you had things set up to amplify that data into some auxiliary dataset and then train on it. Costs a few orders of magnitude more compute perhaps, but gets the job done. Right? (And this could be automated & “smart” in the sense that the AI could decide what stuff to memorize/internalize, what stuff to forget, and what stuff to add to some software database.)
Right. Of course if the sample efficiency of learning improves, the cost goes down, but that’s not really crucial for anything. The learning part of AGI is already essentially solved, it just needs to be put into a place where it’s getting fed the right data.
My guess is that the issue of sample efficiency results from equivocation between datasets used for training a model and datasets provided externally. What is the sample efficiency of AlphaZero? It’s as bad as anything else if we divide by the datasets generated by amplification, but it’s infinitely large if we divide by externally provided datasets, as there are none. The sample efficiency relevant for the cost of training includes the datasets generated by amplification, but in informal comparison with human performance the estimate is about how much the humans observed externally before attaining some level of performance, hence the equivocation.
Similarly if someone figures out amplification for language models (something like debate, but actually works), it can then train on the vastly larger (and better) datasets generated by the model itself, that’s only bootstrapped from the external dataset, and so its sample efficiency with respect to the external dataset is going to skyrocket (one issue is that the external dataset is already large, so it’s more about quality than quantity, but alternatively this form of training might be able to bootstrap from a much smaller external dataset). So the usual measure of sample efficiency doesn’t seem very informative about what’s possible with exactly the same learning algorithm after the amplification loop is closed.
For context, I’m interested in questions like “If we had a big transformer that was being fine-tuned as a chatbot with millions of daily conversations, would it be up-to-date on the latest news of the day? What about local news? What about e.g. subculture drama? How often would people have to talk about something for it to be impressed in long-term memory?”
This sounds like something an appropriate amplification may well be able to help the model memorize, even for things mentioned only once, without changing the learning algorithm, at the cost of more training on the auxiliary data generated by the amplification (in this case probably with prompts from the external datasets that need to be combed for rare details).
(I do understand that the question you are asking is about what happens without auxiliary data. I’m commenting on a way accounting for prompt engineering breaks estimates of potential performance of the same learning algorithm. It then becomes an issue of cost, not limitations of the algorithm, in a way that’s different from scaling laws.)
Right on. That’s a good point. So really I guess the conclusion is: Compute is the bottleneck; an AI chatbot or whatever could totally learn random facts the very first time it encounters them, if you had things set up to amplify that data into some auxiliary dataset and then train on it. Costs a few orders of magnitude more compute perhaps, but gets the job done. Right? (And this could be automated & “smart” in the sense that the AI could decide what stuff to memorize/internalize, what stuff to forget, and what stuff to add to some software database.)
Right. Of course if the sample efficiency of learning improves, the cost goes down, but that’s not really crucial for anything. The learning part of AGI is already essentially solved, it just needs to be put into a place where it’s getting fed the right data.