I agree with Zach above about the main point of the paper. One other thing I’d note is that SGD can’t have literally the same outcomes as random sampling, since random sampling wouldn’t display phenomena like double descent (AN #77).
Would you mind explaining why this is? It seems to me like random sampling would display double descent. For example, as you increase model size, at first you get more and more parameters that let you approximate the data better… but then you get too many parameters and just start memorizing the data… but then when you get even more parameters, you have enough functions available that simpler ones win out… Doesn’t this story work just as well for random sampling as it does for SGD?
Hmm, I think you’re right. I’m not sure what I was thinking when I wrote that. (Though I give it like 50% that if past-me could explain his reasons, I’d agree with him.)
Possibly I was thinking of epochal double descent, but that shouldn’t matter because we’re comparing the final outcome of SGD to random sampling, so epochal double descent doesn’t come into the picture.
Would you mind explaining why this is? It seems to me like random sampling would display double descent. For example, as you increase model size, at first you get more and more parameters that let you approximate the data better… but then you get too many parameters and just start memorizing the data… but then when you get even more parameters, you have enough functions available that simpler ones win out… Doesn’t this story work just as well for random sampling as it does for SGD?
Hmm, I think you’re right. I’m not sure what I was thinking when I wrote that. (Though I give it like 50% that if past-me could explain his reasons, I’d agree with him.)
Possibly I was thinking of epochal double descent, but that shouldn’t matter because we’re comparing the final outcome of SGD to random sampling, so epochal double descent doesn’t come into the picture.
OK, thanks!