The primary thing I’m aiming to predict using this model is when LLMs will be capable of performing human-level reasoning/thinking reliably over long sequences.
Yeah, and I agree this model seems to be aiming at that. What I was trying to get at in the later part of my comment is that I’m not sure you can get human-level reasoning on text as it exists now (perhaps because it fails to capture certain patterns), that it might require more engagement with the real world (because maybe that’s how you capture those patterns), and that training on whichever distribution does give human-level reasoning might have substantially different scaling regularities. But I don’t think I made this very clear and it should be read as “Rick’s wild speculation”, not “Rick’s critique of the model’s assumptions”.
training on whichever distribution does give human-level reasoning might have substantially different scaling regularities.
I agree again. I talked a little bit about this at the end of my post, but overall I just don’t have any data for scaling laws on better distributions than the one in the Chinchilla paper. I’d love to know the scaling properties of training on scientific tasks and incorporate that into the model, but I just don’t have anything like that right now.
Also, this post is more about the method rather than any conclusions I may have drawn. I hope this model can be updated with better data some day.
Yeah, and I agree this model seems to be aiming at that. What I was trying to get at in the later part of my comment is that I’m not sure you can get human-level reasoning on text as it exists now (perhaps because it fails to capture certain patterns), that it might require more engagement with the real world (because maybe that’s how you capture those patterns), and that training on whichever distribution does give human-level reasoning might have substantially different scaling regularities. But I don’t think I made this very clear and it should be read as “Rick’s wild speculation”, not “Rick’s critique of the model’s assumptions”.
I agree again. I talked a little bit about this at the end of my post, but overall I just don’t have any data for scaling laws on better distributions than the one in the Chinchilla paper. I’d love to know the scaling properties of training on scientific tasks and incorporate that into the model, but I just don’t have anything like that right now.
Also, this post is more about the method rather than any conclusions I may have drawn. I hope this model can be updated with better data some day.