Obviously LLMs memorize some things, the easy example is that the pretraining dataset of GPT-4 probably contained lots of cryptographically hashed strings which are impossible to infer from the overall patterns of language. Predicting those accurately absolutely requires memorization, there’s literally no other way unless the LLM solves an NP-hard problem. Then there are in-between things like Barack Obama’s age, which might be possible to infer from other language (a president is probably not 10 yrs old or 230), but within the plausible range, you also just need to memorize it.
Where it gets interesting is when you leave the space of token strings the machine has seen, but you are somewhere in the input space “in between” strings it has seen. That’s why this works at all and exhibits any intelligence.
For example if it has seen a whole bunch of patterns like “A->B”, and “C->D”, if you give input “G” it will complete with “->F”.
Obviously LLMs memorize some things, the easy example is that the pretraining dataset of GPT-4 probably contained lots of cryptographically hashed strings which are impossible to infer from the overall patterns of language. Predicting those accurately absolutely requires memorization, there’s literally no other way unless the LLM solves an NP-hard problem. Then there are in-between things like Barack Obama’s age, which might be possible to infer from other language (a president is probably not 10 yrs old or 230), but within the plausible range, you also just need to memorize it.
Where it gets interesting is when you leave the space of token strings the machine has seen, but you are somewhere in the input space “in between” strings it has seen. That’s why this works at all and exhibits any intelligence.
For example if it has seen a whole bunch of patterns like “A->B”, and “C->D”, if you give input “G” it will complete with “->F”.
Or for President ages, what if the president isn’t real? https://chat.openai.com/share/3ccdc340-ada5-4471-b114-0b936d1396ad
There are fake/fictional presidents in the training data.