training on whichever distribution does give human-level reasoning might have substantially different scaling regularities.
I agree again. I talked a little bit about this at the end of my post, but overall I just don’t have any data for scaling laws on better distributions than the one in the Chinchilla paper. I’d love to know the scaling properties of training on scientific tasks and incorporate that into the model, but I just don’t have anything like that right now.
Also, this post is more about the method rather than any conclusions I may have drawn. I hope this model can be updated with better data some day.
I agree again. I talked a little bit about this at the end of my post, but overall I just don’t have any data for scaling laws on better distributions than the one in the Chinchilla paper. I’d love to know the scaling properties of training on scientific tasks and incorporate that into the model, but I just don’t have anything like that right now.
Also, this post is more about the method rather than any conclusions I may have drawn. I hope this model can be updated with better data some day.