For context, I’m interested in questions like “If we had a big transformer that was being fine-tuned as a chatbot with millions of daily conversations, would it be up-to-date on the latest news of the day? What about local news? What about e.g. subculture drama? How often would people have to talk about something for it to be impressed in long-term memory?”
This sounds like something an appropriate amplification may well be able to help the model memorize, even for things mentioned only once, without changing the learning algorithm, at the cost of more training on the auxiliary data generated by the amplification (in this case probably with prompts from the external datasets that need to be combed for rare details).
(I do understand that the question you are asking is about what happens without auxiliary data. I’m commenting on a way accounting for prompt engineering breaks estimates of potential performance of the same learning algorithm. It then becomes an issue of cost, not limitations of the algorithm, in a way that’s different from scaling laws.)
Right on. That’s a good point. So really I guess the conclusion is: Compute is the bottleneck; an AI chatbot or whatever could totally learn random facts the very first time it encounters them, if you had things set up to amplify that data into some auxiliary dataset and then train on it. Costs a few orders of magnitude more compute perhaps, but gets the job done. Right? (And this could be automated & “smart” in the sense that the AI could decide what stuff to memorize/internalize, what stuff to forget, and what stuff to add to some software database.)
Right. Of course if the sample efficiency of learning improves, the cost goes down, but that’s not really crucial for anything. The learning part of AGI is already essentially solved, it just needs to be put into a place where it’s getting fed the right data.
For context, I’m interested in questions like “If we had a big transformer that was being fine-tuned as a chatbot with millions of daily conversations, would it be up-to-date on the latest news of the day? What about local news? What about e.g. subculture drama? How often would people have to talk about something for it to be impressed in long-term memory?”
This sounds like something an appropriate amplification may well be able to help the model memorize, even for things mentioned only once, without changing the learning algorithm, at the cost of more training on the auxiliary data generated by the amplification (in this case probably with prompts from the external datasets that need to be combed for rare details).
(I do understand that the question you are asking is about what happens without auxiliary data. I’m commenting on a way accounting for prompt engineering breaks estimates of potential performance of the same learning algorithm. It then becomes an issue of cost, not limitations of the algorithm, in a way that’s different from scaling laws.)
Right on. That’s a good point. So really I guess the conclusion is: Compute is the bottleneck; an AI chatbot or whatever could totally learn random facts the very first time it encounters them, if you had things set up to amplify that data into some auxiliary dataset and then train on it. Costs a few orders of magnitude more compute perhaps, but gets the job done. Right? (And this could be automated & “smart” in the sense that the AI could decide what stuff to memorize/internalize, what stuff to forget, and what stuff to add to some software database.)
Right. Of course if the sample efficiency of learning improves, the cost goes down, but that’s not really crucial for anything. The learning part of AGI is already essentially solved, it just needs to be put into a place where it’s getting fed the right data.