If memory serves, with BYOL you are using current representations of an input x1 to predict representations of a related input x2, but the representation of x2 comes from an old version of the encoder. So, as long as you start with a non-collapsed initial encoder, the fact that you are predicting a past encoder which is non-collapsed ensures that the current encoder you learn will also be non-collapsed.
(Mostly my point is that there are specific algorithmic reasons to expect that you don’t get the collapsed solutions, it isn’t just a tendency of neural nets to avoid collapsed solutions.)
but now I’m realizing that this might no longer fit the definition of contrastive learning (instead just ordinary self supervised learning), so I apologize about the error/confusion in that case.
No worries, I think it’s still a relevant example for thinking about “collapsed” solutions.
If memory serves, with BYOL you are using current representations of an input x1 to predict representations of a related input x2, but the representation of x2 comes from an old version of the encoder. So, as long as you start with a non-collapsed initial encoder, the fact that you are predicting a past encoder which is non-collapsed ensures that the current encoder you learn will also be non-collapsed.
(Mostly my point is that there are specific algorithmic reasons to expect that you don’t get the collapsed solutions, it isn’t just a tendency of neural nets to avoid collapsed solutions.)
No worries, I think it’s still a relevant example for thinking about “collapsed” solutions.