Indeed, and my point is that that seems entirely probable. He asked for a dictionary definition of words like ‘cat’ for children, and those absolutely exist online and are easy to find, and I gave an example of one for ‘cat’.
(And my secondary point was that ironically, you might argue that GPT is generalizing and not memorizing… because its definition is so bad compared to an actual Internet-corpus definition for children, and is bad in that instantly-recognizable ChatGPTese condescending talking-down bureaucrat smarm way. No human would ever define ‘cat’ for 11yos like that. If it was ‘just memorizing’, the definitions would be better.)
Whatever one means by “memorize” is by no means self-evident. If you prompt ChatGPT with “To be, or not to be,” it will return the whole soliloquy. Sometimes. Other times it will give you an opening chunk and then an explanation that that’s the well known soliloquy, etc. By poking around I discovered that I could elicit the soliloquy by giving it prompts that consisting of syntactically coherent phrases, but if I gave it prompts that were not syntactically coherent, it didn’t recognize the source, that is, until a bit more prompting. I’ve never found the idea that LLMs were just memorizing to be very plausible.
I was assuming lots of places widely spread. What I was curious about was a specific connection in the available data between the terms I used in my prompts and the levels of language. gwern’s comment satisfies that concern.
I assume OP thought that there was some specific place in the training data the LLM was replicating.
Indeed, and my point is that that seems entirely probable. He asked for a dictionary definition of words like ‘cat’ for children, and those absolutely exist online and are easy to find, and I gave an example of one for ‘cat’.
(And my secondary point was that ironically, you might argue that GPT is generalizing and not memorizing… because its definition is so bad compared to an actual Internet-corpus definition for children, and is bad in that instantly-recognizable ChatGPTese condescending talking-down bureaucrat smarm way. No human would ever define ‘cat’ for 11yos like that. If it was ‘just memorizing’, the definitions would be better.)
Whatever one means by “memorize” is by no means self-evident. If you prompt ChatGPT with “To be, or not to be,” it will return the whole soliloquy. Sometimes. Other times it will give you an opening chunk and then an explanation that that’s the well known soliloquy, etc. By poking around I discovered that I could elicit the soliloquy by giving it prompts that consisting of syntactically coherent phrases, but if I gave it prompts that were not syntactically coherent, it didn’t recognize the source, that is, until a bit more prompting. I’ve never found the idea that LLMs were just memorizing to be very plausible.
In any event, here’s a bunch of experiments explicitly aimed at memorizing, including the Hamlet soliloquy stuff: https://www.academia.edu/107318793/Discursive_Competence_in_ChatGPT_Part_2_Memory_for_Texts_Version_3
I was assuming lots of places widely spread. What I was curious about was a specific connection in the available data between the terms I used in my prompts and the levels of language. gwern’s comment satisfies that concern.