Why did you decide to go with the equivalence of 1 token = 1 bit? Since a token can usually take on the order of 10k to 100k possible values, wouldn’t 1 token equal 13-17 bits a more accurate equivalence?
My thinking here is that the scaffolded LLM is a computer which operates directly in the natural language semantic space so it makes more sense to define the units of its context in terms of its fundamental units such as tokens. Of course each token has a lot more information-theoretic content than a single bit—but this is why a single NLOP is much more powerful than a single FLOP. I agree that tokens directly are probably not the correct measure since they are too object level and there is likely some kind of ‘semantic bit’ idealisation which needs to be worked out.
Processor register as a better analog for the context window
One caveat I’d like to discuss: in the post, you describe the context window of NLPU as the analog for the RAM of computers. I think a more accurate analog could be processor registers.
Similarly to the context window, they are the memory bits directly connected to the computing unit. Whereas, it takes an instruction to load information from RAM before it can be used by the CPU. The RAM sits in the middle of the memory hierarchy, while registers are at its top.
I think I discuss this in the memory hierarchy section of the post. I agree that it is unclear what the best conceptualisation of the context window is. I agree it is not necessarily directly compatible with the RAM and may be more like processor registers. I think the main point is that currently scaffolded LLM systems have a 2 level memory hierarchy and computers have evolved a fairly complex and highly optimised multi-step system. It may be that we also eventually develop such a system or its equivalent for LLMs. I actually do not know how the memory hierarchy for the earliest computers worked—did they already have a register → RAM → disk distinction?
I think this might be an additional factor—on top of the increased power and reliability of LLM—that made us wait for so long after GPT3 before beginning to design complicated chaining of LLM calls. A single LM can store enough data in its context window to do many useful tasks: as you describe, there are many NLPU primitives to discover and exploit. On the other hand, a CPU with no RAM is basically an over-engineered calculator. It becomes truly useful once embedded in a von-Neumann architecture.
This is an interesting hypothesis. My alternate hypothesis is essentially a combination of a.) reliability and instruction following with GPT3 was just too bad for this to work appreciably and we broke through some kind of barrier with GPT4 and secondly just that there actually was not that much time. GPT3 API only became widely useable in mid-2021 IIRC so that is about a year and a bit between that and ChatGPT release which is hardly any time to start iterating on this stuff.
Multimodal models
If the natural type signature of a CPU is bits → bits, the natural type of the natural language processing unit (NLPU) is strings → strings.
With the rise of multimodal (image + text) models, NLPU could be required to deal with other data types than “string” like image embeddings, as images cannot be efficiently converted into natural text.
Indeed. Should be interesting to see if we converge to some canonical datatype or not. The reason strings are so nice is that they compose easily and are incredibly flexible. The alternative is having directly chained architectures which communicate in embeddings, which can then be arbitrarily multimodal. Whether this works or not depends on how ‘internalised’ the cognition of the system is. Current agentic LLM trend is to externalise which is, imho, good from an interpretability and steer ability perspective. It may reverse.
Thanks for these points!
My thinking here is that the scaffolded LLM is a computer which operates directly in the natural language semantic space so it makes more sense to define the units of its context in terms of its fundamental units such as tokens. Of course each token has a lot more information-theoretic content than a single bit—but this is why a single NLOP is much more powerful than a single FLOP. I agree that tokens directly are probably not the correct measure since they are too object level and there is likely some kind of ‘semantic bit’ idealisation which needs to be worked out.
I think I discuss this in the memory hierarchy section of the post. I agree that it is unclear what the best conceptualisation of the context window is. I agree it is not necessarily directly compatible with the RAM and may be more like processor registers. I think the main point is that currently scaffolded LLM systems have a 2 level memory hierarchy and computers have evolved a fairly complex and highly optimised multi-step system. It may be that we also eventually develop such a system or its equivalent for LLMs. I actually do not know how the memory hierarchy for the earliest computers worked—did they already have a register → RAM → disk distinction?
This is an interesting hypothesis. My alternate hypothesis is essentially a combination of a.) reliability and instruction following with GPT3 was just too bad for this to work appreciably and we broke through some kind of barrier with GPT4 and secondly just that there actually was not that much time. GPT3 API only became widely useable in mid-2021 IIRC so that is about a year and a bit between that and ChatGPT release which is hardly any time to start iterating on this stuff.
Indeed. Should be interesting to see if we converge to some canonical datatype or not. The reason strings are so nice is that they compose easily and are incredibly flexible. The alternative is having directly chained architectures which communicate in embeddings, which can then be arbitrarily multimodal. Whether this works or not depends on how ‘internalised’ the cognition of the system is. Current agentic LLM trend is to externalise which is, imho, good from an interpretability and steer ability perspective. It may reverse.