Not easily detected. As in, there might be a sudden (in SGD steps) change in the internal structure of the network over training that is not easily visible in the loss or other metrics that you would normally track. If you think of the loss as an average over performance on many thousands of subtasks, a change in internal structure (e.g. a circuit appearing in a phase transition) relevant to one task may not change the loss much.
Not easily detected. As in, there might be a sudden (in SGD steps) change in the internal structure of the network over training that is not easily visible in the loss or other metrics that you would normally track. If you think of the loss as an average over performance on many thousands of subtasks, a change in internal structure (e.g. a circuit appearing in a phase transition) relevant to one task may not change the loss much.