RobertKirk comments on Speculative inferences about path dependence in LLM supervised fine-tuning from results on linear mode connectivity and model souping

RobertKirk 8 Jan 2024 9:16 UTC
2 points
2
I think a point of no return exists if you only use small LRs. I think if you can use any LR (or any LR schedule) then you can definitely jump out of the loss basin. You could imagine just choosing a really large LR to basically resent to a random init and then starting again.

I do think that if you want to utilise the pretrained model effectively, you likely want to stay in the same loss basin during fine-tuning.