For context, I have moderate experience working with LLMs, and I think this is a great summary for laypeople. I’ve observed humans have a general tendency to anthropomorphize behavior that seems “intelligent”, and it seems more productive to resist that tendency and seek better explanations.
At risk of becoming too technical, one topic that could help bridge the “From predictor to generator” and “A better guessing machine” sections is a bit more detail on how outputs are chosen in modern models. The greedy (choose the most likely next word) and random (sample from the distribution of next words) strategies are mentioned, but the most common method used today is some form of beam search or multi-token sampling, which explores multiple future sequences and chooses its next words based on some metric.
Metaphorically, this behavior seems “human”—one can imagine a writer beginning a sentence, then hastily deleting it in favor of something more coherent. But the metric for the human writer is “does this clearly communicate the idea I’m trying to convey?”, while the metric for the LLM is generally some variant of “is this output statistically likely to match the training data?”
Huggingface has a nice guide that covers popular approaches to generation circa 2020. I recently read about tail free sampling as well. I’m sure other techniques have been developed since then, though I’m not immersed enough in NLP state-of-the-art to be aware of them.
If you’re curious, the most interesting pure stochastic sampling variant I’ve seen lately is: “Contrastive Search Is What You Need For Neural Text Generation”, Su & Collier 2022. (Unfortunately, only benchmarked on very small models and AFAIK no one has generated samples from large GPT-3 scale models or provided quantitative/qualitative description.)
Thanks! I had actually skimmed this recently but forgot to add it to my reading list. The cherry-picked examples for text generation seem a bit low-information, but it would be interesting to see their technique applied to a larger model.
For context, I have moderate experience working with LLMs, and I think this is a great summary for laypeople. I’ve observed humans have a general tendency to anthropomorphize behavior that seems “intelligent”, and it seems more productive to resist that tendency and seek better explanations.
At risk of becoming too technical, one topic that could help bridge the “From predictor to generator” and “A better guessing machine” sections is a bit more detail on how outputs are chosen in modern models. The greedy (choose the most likely next word) and random (sample from the distribution of next words) strategies are mentioned, but the most common method used today is some form of beam search or multi-token sampling, which explores multiple future sequences and chooses its next words based on some metric.
Metaphorically, this behavior seems “human”—one can imagine a writer beginning a sentence, then hastily deleting it in favor of something more coherent. But the metric for the human writer is “does this clearly communicate the idea I’m trying to convey?”, while the metric for the LLM is generally some variant of “is this output statistically likely to match the training data?”
Thanks! I was not aware of beam search. Any good references to learn about it?
Huggingface has a nice guide that covers popular approaches to generation circa 2020. I recently read about tail free sampling as well. I’m sure other techniques have been developed since then, though I’m not immersed enough in NLP state-of-the-art to be aware of them.
If you’re curious, the most interesting pure stochastic sampling variant I’ve seen lately is: “Contrastive Search Is What You Need For Neural Text Generation”, Su & Collier 2022. (Unfortunately, only benchmarked on very small models and AFAIK no one has generated samples from large GPT-3 scale models or provided quantitative/qualitative description.)
Thanks! I had actually skimmed this recently but forgot to add it to my reading list. The cherry-picked examples for text generation seem a bit low-information, but it would be interesting to see their technique applied to a larger model.
Thanks, I added a parenthetical sentence to indicate this possibility.