Imagine there was a bijection between model parameters and resulting function. (I’m aware this is not at all true.) In that case it seems like you are enforcing the constraint that the two heads have identical parameters.
AFAIK, I always imagined the idea behind this objective function to be quite similar to contrastive learning, where you have two networks (or equivalently two sets of parameters), and the goal is to maximize agreement for pairs of inputs to each network that have the same ground truth class/label (conversely maximize disagreement for pairs that are different). That in mind, there are various papers (e.g.) that explore the possibility of “collapsed” solutions like the one you mentioned (where both networks are learning the same mapping, such that there’s less benefit to propagating any examples through two networks), which makes this something that we want to minimize. In practice, though, this has been found to occur rarely (c.f. [1]).
Nonetheless, since reading Paul’s statement about the problem of the instrumental model, I’ve been thinking about issues that might arise with the proposed solution, even though similar approaches (i.e. the contrastive training objective) have proven effective for robustness in general (e.g. against adversarial perturbations, data limited scenarios). If I were committed to this stance, I would agree somewhat with the desire to explore alternatives, and I have thought about the extent to which some sort of reconstruction loss could be introduced; this is where the goal might instead be to “maximize agreement” with a set of non-trivial observations/facts that are guaranteed to be more “objective” (somehow) than the original training data (one inspiration being that reconstruction losses in vision deep learning papers like this one often turn out to be good regularizers). So far I haven’t had any promising proposals come to light for generative LM.
I am still holding onto the thought, given the remote possibility that all of my above assumptions are correct, and also because “generative models” might reflect the ideal approach to unsupervised learning, whereas “contrastive learning” is sometimes seen as a sort of compromise since (unlike generative models) it’s amenable to limited compute [2].
It makes sense that negative pairs would help to a large extent, but not all contrastive papers used negative examples, like BYOL (ref). Edit: but now I’m realizing that this might no longer fit the definition of contrastive learning (instead just ordinary self supervised learning), so I apologize about the error/confusion in that case.