A general way to look at it is that we are just upgraded primates, the real AI software was our language the whole time. The language is what makes it possible for us to think and plan and develop abilities above our base ‘primitive tool using pack primate’ inherited set. The language itself is full of all these patterns and built in ‘algorithms’ that let us think, albeit with rather terrible error when we leave the rails of situations it can cover, as well as many hidden biases.
But on a more abstract level, what we did when we trained the LLM was ask our black box to learn as many text completions as possible with the weights it has.
Meaning that we fed [1 trillion tokens] → [at most 65 billion weights] and we asked for it to predict the next token from any arbitrary position in that 1 trillion.
There is insufficient space to do this because it’s saying that you need to, from [context_buffer_length], always get the next token. So the brute force approach would be [context length] x [number of bytes per token] x [training set size]. Or if I didn’t make an error, 8192 terabytes and we have 260 gigabytes at most. Text is generally only compressible about 2:1, so the best we could possibly do with a naive approach only predicts a small fraction of the tokens and would have huge error. Learning a huge hash table won’t work, because essentially you would need 1 trillion * [hash size] hashes of the prior tokens in the string to remember the next token. The issue with hashing is 2 strings that are similar with the same next word will have a different hash representation. (not to mention all the collisions...)
Also the weights are not allowed to just cache the [context_buffer_hash, next] tokens, but are for mathematical functions that have been arranged in a way that has an inherent ability at learning these patterns.
So the smallest representation—the most compression—ends up being for the machine to represent, say, all the training set examples of “Write a python3 program to print “Hello world”″ or Write a program to print “Bob is your uncle” or 500 other examples that all have a common pattern of “write a program that causes $string_given to print”.
So it compresses all those examples it saw into this general solution. And then if you think about it, there are many different ways all the prompts could be written to ask for this simple program to print something. And it compresses those as well. And then there are multiple output languages but most programming languages have the simple task of “print” with compressible commonalities...
Also notice something above? It’s seen “Hello, World!” so many times that it falsely matched my query on to print it with the exact syntax of “Hello, World!” and not the correct answer the prompt asked for. This is consistent with compression. It’s also why this exists. The machine remembers both Monty Hall and Monty Fall, but often gives the wrong answer.
It didn’t “want” to develop these compressions, it’s just where the reward gradient went, where greater compression leaves more weights for other things and then more reward and so on. The transformer architecture also encourages compressed representations though I don’t work on this part of the stack so I don’t know why.
Thinking of it this way, as just “compressing text”, and you realize that developing “characters” is more compact. The machine doesn’t care that “sherlock holmes” isn’t real, the key thing is that the Sherlock has a compressible character. That is, rather than remembering all the public domain text for the detective stories, it’s more compressed to write a function that emits the patterns of tokens that “sherlock” would emit.
Unfortunately, since Sherlock isn’t a real detective, if you ask an LLM to solve a problem “like Sherlock holmes”, you’ll get a bunch of prose about unlikely leaps of logic and specious conclusions with far too much certainty.
Viewed this way—that the machine really does think one token at a time, always taking the greedy approach, there are problems this won’t solve and a lot of problems it will.
A general way to look at it is that we are just upgraded primates, the real AI software was our language the whole time. The language is what makes it possible for us to think and plan and develop abilities above our base ‘primitive tool using pack primate’ inherited set. The language itself is full of all these patterns and built in ‘algorithms’ that let us think, albeit with rather terrible error when we leave the rails of situations it can cover, as well as many hidden biases.
But on a more abstract level, what we did when we trained the LLM was ask our black box to learn as many text completions as possible with the weights it has.
Meaning that we fed [1 trillion tokens] → [at most 65 billion weights] and we asked for it to predict the next token from any arbitrary position in that 1 trillion.
There is insufficient space to do this because it’s saying that you need to, from [context_buffer_length], always get the next token. So the brute force approach would be [context length] x [number of bytes per token] x [training set size]. Or if I didn’t make an error, 8192 terabytes and we have 260 gigabytes at most. Text is generally only compressible about 2:1, so the best we could possibly do with a naive approach only predicts a small fraction of the tokens and would have huge error. Learning a huge hash table won’t work, because essentially you would need 1 trillion * [hash size] hashes of the prior tokens in the string to remember the next token. The issue with hashing is 2 strings that are similar with the same next word will have a different hash representation. (not to mention all the collisions...)
Also the weights are not allowed to just cache the [context_buffer_hash, next] tokens, but are for mathematical functions that have been arranged in a way that has an inherent ability at learning these patterns.
So the smallest representation—the most compression—ends up being for the machine to represent, say, all the training set examples of “Write a python3 program to print “Hello world”″ or Write a program to print “Bob is your uncle” or 500 other examples that all have a common pattern of “write a program that causes $string_given to print”.
So it compresses all those examples it saw into this general solution. And then if you think about it, there are many different ways all the prompts could be written to ask for this simple program to print something. And it compresses those as well. And then there are multiple output languages but most programming languages have the simple task of “print” with compressible commonalities...
Also notice something above? It’s seen “Hello, World!” so many times that it falsely matched my query on to print it with the exact syntax of “Hello, World!” and not the correct answer the prompt asked for. This is consistent with compression. It’s also why this exists. The machine remembers both Monty Hall and Monty Fall, but often gives the wrong answer.
It didn’t “want” to develop these compressions, it’s just where the reward gradient went, where greater compression leaves more weights for other things and then more reward and so on. The transformer architecture also encourages compressed representations though I don’t work on this part of the stack so I don’t know why.
Thinking of it this way, as just “compressing text”, and you realize that developing “characters” is more compact. The machine doesn’t care that “sherlock holmes” isn’t real, the key thing is that the Sherlock has a compressible character. That is, rather than remembering all the public domain text for the detective stories, it’s more compressed to write a function that emits the patterns of tokens that “sherlock” would emit.
Unfortunately, since Sherlock isn’t a real detective, if you ask an LLM to solve a problem “like Sherlock holmes”, you’ll get a bunch of prose about unlikely leaps of logic and specious conclusions with far too much certainty.
Viewed this way—that the machine really does think one token at a time, always taking the greedy approach, there are problems this won’t solve and a lot of problems it will.