Thanks for the comment! I’ll add a sentence or a footnote for both loss, and weights in the sections you mentioned. As for forecasting in section 5.2, that claim is imagining something like
Different copies of a model can share parameter updates. For instance, ChatGPT could be deployed to millions of users, learn something from each interaction, and then propagate gradient updates to a central server where they are averaged together and applied to all copies of the model. (source)
This is slightly different from what is happening currently. Models are not undergoing online learning as far as I know. They are trained, they are deployed, and occasionally they are fine-tuned but that’s it. It is not going through continuous learning through interaction. By learning I mean the weights are not updated, but more information can still be acquired through context injections.
It is not clear yet if parallel learning would be efficient or whether the data gathered through interactions can be efficiently utilized in parallel. But given that we already use batch learning during training, it does seem possible.
Alternatively, shared knowledge can also take the form of a central ever-growing vector database like Pinecone or something. In which case compound AI systems can just learn to efficiently store and query the vector database, and inject those queries into the expanding context windows simulating improved world models.
Thanks for the comment! I’ll add a sentence or a footnote for both loss, and weights in the sections you mentioned. As for forecasting in section 5.2, that claim is imagining something like
This is slightly different from what is happening currently. Models are not undergoing online learning as far as I know. They are trained, they are deployed, and occasionally they are fine-tuned but that’s it. It is not going through continuous learning through interaction. By learning I mean the weights are not updated, but more information can still be acquired through context injections.
It is not clear yet if parallel learning would be efficient or whether the data gathered through interactions can be efficiently utilized in parallel. But given that we already use batch learning during training, it does seem possible.
Alternatively, shared knowledge can also take the form of a central ever-growing vector database like Pinecone or something. In which case compound AI systems can just learn to efficiently store and query the vector database, and inject those queries into the expanding context windows simulating improved world models.