Gallabytes: … This is a pretty big factor in why I expect some kind of diffusion to eventually overtake AR on language modeling too. We don’t actually care about the exact words anywhere near as much as we care about the ideas they code for, and if we can work at that level diffusion will win.
It will probably keep being worse on perplexity while being noticeably smarter, less glitchy, and easier to control.
Sherjil Ozair: Counterpoint: one-word change can completely change the meaning of a sentence, unlike one pixel in an image.
I think this is an interesting point made by Gallabytes, and that Sherjil misses the heart of it here. Adding noise to a sentence must be done in the semantic-space, not the token-space. You shift the meaning and implications of the sentence subtly by switching words for synonyms or near synonyms, preserving most of the meaning of the sentence but changing it’s ‘flavor’. I expect this subtle noising of existing text data would greatly increase the value we are able to extract from existing datasets, by making meaning more salient to the algorithm than irrelevant statistical specifics of tokens. The hard part is ensuring that the word substitution in fact moves only a small distance in semantic-space.
yeah I basically think you need to construct the semantic space for this to work, and haven’t seen much work on that front from language modeling researchers.
drives me kinda nuts because I don’t think it would actually be that hard to do, and the benefits might be pretty substantial.
I think this is an interesting point made by Gallabytes, and that Sherjil misses the heart of it here. Adding noise to a sentence must be done in the semantic-space, not the token-space. You shift the meaning and implications of the sentence subtly by switching words for synonyms or near synonyms, preserving most of the meaning of the sentence but changing it’s ‘flavor’. I expect this subtle noising of existing text data would greatly increase the value we are able to extract from existing datasets, by making meaning more salient to the algorithm than irrelevant statistical specifics of tokens. The hard part is ensuring that the word substitution in fact moves only a small distance in semantic-space.
yeah I basically think you need to construct the semantic space for this to work, and haven’t seen much work on that front from language modeling researchers.
drives me kinda nuts because I don’t think it would actually be that hard to do, and the benefits might be pretty substantial.