That’s within-training by epoch/iteration, not across trained models by total size/compute. It’s not clear that they are at all the same sort of thing, because you can get spikes trivially by things like the learning rate dropping. Investigating whether there is any connection would be interesting.
That’s within-training by epoch/iteration, not across trained models by total size/compute. It’s not clear that they are at all the same sort of thing, because you can get spikes trivially by things like the learning rate dropping. Investigating whether there is any connection would be interesting.