I’ve been continuing to think about this. I eventually remembered that the place I remembered the idea from was pervade conversations with other AI researchers! Sorry to have sent you on a wild goose chase!
It would be interesting to try an experiment with this. Perhaps doing hierarchical clustering on Open Web Text (an early not-too-huge dataset from GPT-2 days). Then get an LLM worth a large context window to review a random subset of each cluster and write a description of it (including an estimate of factual validity). Then, when training, those descriptions would be non-predicted context given to the model. If you do use hierarchical clustering, this will result in a general description and some specific subtype descriptions for every datapoint.
I’ve been continuing to think about this. I eventually remembered that the place I remembered the idea from was pervade conversations with other AI researchers! Sorry to have sent you on a wild goose chase!
It would be interesting to try an experiment with this. Perhaps doing hierarchical clustering on Open Web Text (an early not-too-huge dataset from GPT-2 days). Then get an LLM worth a large context window to review a random subset of each cluster and write a description of it (including an estimate of factual validity). Then, when training, those descriptions would be non-predicted context given to the model. If you do use hierarchical clustering, this will result in a general description and some specific subtype descriptions for every datapoint.