I think your critiques are great since you’re thinking clearly about how this approach is supposed to work. At a high level my reply to your comment is something like, “I basically agree, but don’t think that anything you mentioned is devastating. I’m trying to build something that is better than Bio Anchors, and I think I probably succeeded even with all these flaws.”
That said, I’ll address your points more directly.
My understanding is that the irreducible part of the loss has nothing (necessarily) to do with “entropy of natural text” and even less with “roughly human-level”—it is the loss this particular architecture for this particular training regime can reach in the limit on this particular training data distribution.
That’s correct, but if the hypothesis space is sufficiently large, then the term E in the Hoffmann et al. equation for loss should actually correspond to the entropy of natural language. The reason comes down to how entropy is defined. Roughly speaking, language entropy can be defined as the limit of a certain functional approximation of entropy as the capacity of that model goes to infinity. This is strikingly similar to what I’m doing with the Hoffmann et al. equation.
The main difference is that, in the case of Hoffmann et al., we can’t be sure that a sufficiently large model trained over a sufficiently large amount of data would actually converge onto the entropy of the English language. In fact, we have reason to believe that it wouldn’t, due to constraints like the limited size of the context window.
However, I don’t see any fundamental issue with treating E as the entropy of the whatever distribution we’re talking about, so long as our hypothesis space is vast enough.
In this post, I’m not trying to lean into the Hoffmann et al. results per se, except to the extent that they provided the best current source of data on scaling language models. Indeed, initially I didn’t even present any results at all, until someone persuaded me to put a preliminary CDF over TAI timeline in the post to make it more interesting. I’m mostly trying to explain the approach, which can certainly be improved upon if we get better data about the scaling properties of models trained on a more relevant distribution, like scientific papers.
Human level of token prediction is way worse than probably any GPT, so why would that loss indicate human level of reasoning?
I think this question might rest on a misconception that also came up a few times when I was sharing this draft with people, so it’s perhaps important to put this caveat into the post more directly. This approach has almost nothing to do with human-level abilities to predict tokens. It’s based on something completely different, which is more similar to whether humans can distinguish long outputs written by language models from outputs written by humans. There are good reasons to believe that these two approaches should yield very different results.
Also, new (significant) papers are not sampled from the distribution of papers. They are out of distribution because they go beyond all previous papers. So I’m not sure your formula doesn’t just measure the ability to coherently rehash stuff that is already known.
This is a good point. Ultimately, my response is that I’m trying to measure something like the hardness of training a model to think reliably over long sequences, rather than something like, the hardness of training a model to copy the exact distribution its trained on. We can already see with current models, like GPT-3.5, that it can often produce novel results (e.g. its poetry) despite the intuition that it shouldn’t be able to “go beyond” its training distribution. I think this points to something important and true, which is that language models seem to be learning how to do the stuff that in practice allows it to write content on par with what humans write, rather than merely learning how to emulate its training distribution.
I think your critiques are great since you’re thinking clearly about how this approach is supposed to work. At a high level my reply to your comment is something like, “I basically agree, but don’t think that anything you mentioned is devastating. I’m trying to build something that is better than Bio Anchors, and I think I probably succeeded even with all these flaws.”
That said, I’ll address your points more directly.
That’s correct, but if the hypothesis space is sufficiently large, then the term E in the Hoffmann et al. equation for loss should actually correspond to the entropy of natural language. The reason comes down to how entropy is defined. Roughly speaking, language entropy can be defined as the limit of a certain functional approximation of entropy as the capacity of that model goes to infinity. This is strikingly similar to what I’m doing with the Hoffmann et al. equation.
The main difference is that, in the case of Hoffmann et al., we can’t be sure that a sufficiently large model trained over a sufficiently large amount of data would actually converge onto the entropy of the English language. In fact, we have reason to believe that it wouldn’t, due to constraints like the limited size of the context window.
However, I don’t see any fundamental issue with treating E as the entropy of the whatever distribution we’re talking about, so long as our hypothesis space is vast enough.
In this post, I’m not trying to lean into the Hoffmann et al. results per se, except to the extent that they provided the best current source of data on scaling language models. Indeed, initially I didn’t even present any results at all, until someone persuaded me to put a preliminary CDF over TAI timeline in the post to make it more interesting. I’m mostly trying to explain the approach, which can certainly be improved upon if we get better data about the scaling properties of models trained on a more relevant distribution, like scientific papers.
I think this question might rest on a misconception that also came up a few times when I was sharing this draft with people, so it’s perhaps important to put this caveat into the post more directly. This approach has almost nothing to do with human-level abilities to predict tokens. It’s based on something completely different, which is more similar to whether humans can distinguish long outputs written by language models from outputs written by humans. There are good reasons to believe that these two approaches should yield very different results.
This is a good point. Ultimately, my response is that I’m trying to measure something like the hardness of training a model to think reliably over long sequences, rather than something like, the hardness of training a model to copy the exact distribution its trained on. We can already see with current models, like GPT-3.5, that it can often produce novel results (e.g. its poetry) despite the intuition that it shouldn’t be able to “go beyond” its training distribution. I think this points to something important and true, which is that language models seem to be learning how to do the stuff that in practice allows it to write content on par with what humans write, rather than merely learning how to emulate its training distribution.