Glad that you liked my answer! Regarding my suggestion of synthetic data usage, upon further reflection I think is plausible that it could be either a very positive thing and a very negative thing, depending exactly on how the model generalizes out-of-distribution. It also now strikes me that synthetic data provides a wonderful opportunity to study (some) of their out-of-distribution properties even today – it is kind of hard to test out-of-distribution behavior of internet-text-trained LLMs because they’ve seen everything, but if trained with synthetic data it should be much more doable.
Or, if not purely synthetic, how about deliberately filtered in weird ways. Could you train a model where information about a specific scientific theory has been thoroughly censored from the training data? Then in testing you could give the information which was available to the human orginators of the now accepted scientific theory. See if the model can figure out the puzzle. Be concerned if it does better than you expect.
Glad that you liked my answer! Regarding my suggestion of synthetic data usage, upon further reflection I think is plausible that it could be either a very positive thing and a very negative thing, depending exactly on how the model generalizes out-of-distribution. It also now strikes me that synthetic data provides a wonderful opportunity to study (some) of their out-of-distribution properties even today – it is kind of hard to test out-of-distribution behavior of internet-text-trained LLMs because they’ve seen everything, but if trained with synthetic data it should be much more doable.
Or, if not purely synthetic, how about deliberately filtered in weird ways. Could you train a model where information about a specific scientific theory has been thoroughly censored from the training data? Then in testing you could give the information which was available to the human orginators of the now accepted scientific theory. See if the model can figure out the puzzle. Be concerned if it does better than you expect.