Yeah I think there’s not a benefit to being fancy. Except maybe if you can actively sample datapoints the model is most confused about—but even then is it worth it with actual GPT to stop and update before generating new samples? I think the more parameters you have, the less doing this makes sense, because your finetuning only has to move a short distance in a very high-dimensional space.
This raises a question I don’t have an intuition for, though, which is how big a divergence from GPT you get if you train from scratch while trying to enforce this sort of self-supervised constraint.
I would imagine that if you have a limited question pool used for self-supervision, then applying this constraint while training from scratch would result in overfitting with less generalization (but I’m not super confident in this, and there might be descent ways to avoid this).
If the question pool is very large/generated or the constraint is generally enforced on text generation (I’m not sure this makes much sense), then this might do something interesting.
I don’t have the resources to run an experiment like this at the moment (particularly not with a very large model like GPT-J).
Yeah I think there’s not a benefit to being fancy. Except maybe if you can actively sample datapoints the model is most confused about—but even then is it worth it with actual GPT to stop and update before generating new samples? I think the more parameters you have, the less doing this makes sense, because your finetuning only has to move a short distance in a very high-dimensional space.
This raises a question I don’t have an intuition for, though, which is how big a divergence from GPT you get if you train from scratch while trying to enforce this sort of self-supervised constraint.
I would imagine that if you have a limited question pool used for self-supervision, then applying this constraint while training from scratch would result in overfitting with less generalization (but I’m not super confident in this, and there might be descent ways to avoid this).
If the question pool is very large/generated or the constraint is generally enforced on text generation (I’m not sure this makes much sense), then this might do something interesting.
I don’t have the resources to run an experiment like this at the moment (particularly not with a very large model like GPT-J).