Long outputs will tend to naturally deteriorate, as it tries to reproduce the existing deterioration and accidentally adds some more. Better: Sample one tag at a time. Shuffle the inputs every time to access different subdistributions. (I wonder how much the subdistributions differ for two random shuffles...) If you output the tag that has the highest minimum probability in each of a hundred subdistributions, I bet that’ll produce a tag that’s not in the inputs.
The tagging work that’s already been done need not be a waste, because you can essentially use it as training data for the kind of tags you’d like an automated system to discover and assign. For example, tweak the hyperparameters of the topic modeling system until it is really good at independently rediscovering/reassigning the tags that have already been manually assigned.
An advantage of the automated approach is that you should be able to reapply it to some other document corpus—for example, autogenerate tags for the EA Forum, or all AI alignment related papers/discussion off LW, or the entire AI literature in order to help with/substitute for this job https://intelligence.org/2017/12/12/ml-living-library/ (especially if you can get some kind of hierarchical tagging to work)
I’ve actually spent a while thinking about this sort of problem and I’m happy to video call and chat more if you want.
In this case someone just gave it a list and asked it to create more. (I do think the ideal process here would have been to feed it some posts + corresponding taglists, and then given it a final post with a “Tags: …” prompt. But, that was a bit more work and nobody did it yet AFAICT)
Update: we asked GPT-3 to create some tags, and it suggested, among other things: “Robotic Autonomy”
For which article? Was it an article about Robotic Autonomy? Or, did you just give it a list of LW tags and have it create more.
And here is me asking it to extend the list of all Rationality tags:
Long outputs will tend to naturally deteriorate, as it tries to reproduce the existing deterioration and accidentally adds some more. Better: Sample one tag at a time. Shuffle the inputs every time to access different subdistributions. (I wonder how much the subdistributions differ for two random shuffles...) If you output the tag that has the highest minimum probability in each of a hundred subdistributions, I bet that’ll produce a tag that’s not in the inputs.
Shuffling would also be good to combat the alphabetic order, which has got to be skewing output somehow.
I just did it again, asking it to extend the list of all the AI tags.
I expect you’d get better results by using older, less hyped NLP techniques that are designed for this sort of thing:
https://stackoverflow.com/questions/15377290/unsupervised-automatic-tagging-algorithms
The tagging work that’s already been done need not be a waste, because you can essentially use it as training data for the kind of tags you’d like an automated system to discover and assign. For example, tweak the hyperparameters of the topic modeling system until it is really good at independently rediscovering/reassigning the tags that have already been manually assigned.
An advantage of the automated approach is that you should be able to reapply it to some other document corpus—for example, autogenerate tags for the EA Forum, or all AI alignment related papers/discussion off LW, or the entire AI literature in order to help with/substitute for this job https://intelligence.org/2017/12/12/ml-living-library/ (especially if you can get some kind of hierarchical tagging to work)
I’ve actually spent a while thinking about this sort of problem and I’m happy to video call and chat more if you want.
In this case someone just gave it a list and asked it to create more. (I do think the ideal process here would have been to feed it some posts + corresponding taglists, and then given it a final post with a “Tags: …” prompt. But, that was a bit more work and nobody did it yet AFAICT)