You can probably avoid the generation of crank works and fiction by training a new version of GPT in which every learning example is labeled with <year of publication> and <subject matter>, which GPT has access to when it predict an example. So if you then generate a prompt and condition of something like <year: 2040> <subject matter: peer-reviewed physics publication>, you can easily tell GPT to avoid fiction and crank works, as well as make it model future scientific progress.
The practical problem with that is probably that you need to manually decide which papers go in which category. GPT needs such an enormous amount of data that any curating done needs to be automated. So metadata like authors, subject, date, website of provenance are quite easy to obtain for each example, but really high level stuff like “paper is about applying the methods of field X in field Y” is really hard.
You can probably avoid the generation of crank works and fiction by training a new version of GPT in which every learning example is labeled with <year of publication> and <subject matter>, which GPT has access to when it predict an example. So if you then generate a prompt and condition of something like <year: 2040> <subject matter: peer-reviewed physics publication>, you can easily tell GPT to avoid fiction and crank works, as well as make it model future scientific progress.
Hmm. I’m having a hard time writing this clearly, but I wonder if you could get interesting results by:
Training on a wide range of notably excellent papers from “narrow-scoped” domains,
Training on a wide range of papers that explore “we found this worked in X field, and we’re now seeing if it also works in Y field” syntheses,
Then giving GPT-N prompts to synthesize narrow-scoped domains in which that hasn’t been done yet.
You’d get some nonsense, I imagine, but it would probably at least spit out plausible hypotheses for actual testing, eh?
The practical problem with that is probably that you need to manually decide which papers go in which category. GPT needs such an enormous amount of data that any curating done needs to be automated. So metadata like authors, subject, date, website of provenance are quite easy to obtain for each example, but really high level stuff like “paper is about applying the methods of field X in field Y” is really hard.