This post introduces a model, and shows that it behaves sort of like a noisy version of gradient descent.
However, the term “stochastic gradient descent” does not just mean “gradient descent with noise.” It refers more specifically to mini-batch gradient descent. (See e.g. Wikipedia.)
In mini-batch gradient descent, the “true” fitness[1] function is the expectation of some function L over a data distribution ρ. But you never have access to this function or its gradient. Instead, you draw a finite sample from ρ, compute the mean of ∇L over the sample, and take a step in this direction. The noise comes from the variance of the finite-sample mean as an estimator of the expectation.
The model here is quite different. There is no “data distribution,” and the true fitness function is not an expectation value which we could noisily estimate with sampling. The noise here comes not from a noisy estimate of the gradient, but from a prescribed stochastic relationship (Ps) between the true gradient and the next step.
I don’t think the model in this post behaves like mini-batch gradient descent. Consider a case where we’re doing SGD on a vector x, and two of its components xi,xj have the following properties:
The “true gradient” (the expected gradient over the data distribution) is 0 in the xi and xj directions.
The xi and xj components of the per-example gradient are perfectly (positively) correlated with one another.
If you like, you can think of the per-example gradient as sampling a single number z from a distribution with mean 0, and setting the xi and xj components to aiz and ajz respectively, for some positive constants ai,aj.
When we sample a mini-batch and average over it, these components are simply ai¯z and aj¯z, where ¯z is the average of z over the mini-batch. So the perfect correlation carries over to the mini-batch gradient, and thus to the SGD step. If SGD increases xi, it will always increase xj alongside it (etc.)
However, applying the model from this post to the same case:
Candidate steps are sampled according to Pm, which is radially symmetric. So (e.g.) a candidate step with positive xi and negative xj is just as likely as one with both positive, all else being equal.
The probability of accepting a candidate step depends only on the true gradient[2], which is 0 in the directions of interest. So, the xi and xj components of a candidate step have no effect on its probability of selection.
Thus, the the xi and xj components of the step will be uncorrelated, rather than perfectly correlated as in SGD.
Some other comments:
The descendant-generation process in this post seems very different from the familiar biological cases it’s trying to draw an analogy to.
In biology, “selection” generally involves having more or fewer descendants relative to the population average.
Here, there is always exactly one descendant. “Selection” occurs because we generate (real) descendants by first generating a ghostly “candidate descendant,” comparing it to its parent (or a clone of its parent), possibly rejecting it against the parent and drawing another candidate, etc.
This could be physically implemented in principle, I guess. (Maybe it has been, somewhere?) But I’m not convinced it’s equivalent to any familiar case of biological selection. Nor it is clear to me how close the relationship is, if it’s not equivalence.
The connection drawn here to gradient descent is not exact, even setting aside the stochastic part.
You note that we get a “gradient-dependent learning rate,” essentially because Ps can have all sorts of shapes—we only know that it’s monotonic, which gives us a monotonic relation between step size and gradient norm, but nothing more.
But notably, (S)GD does not have a gradient-dependent learning rate. To call this an equivalence, I’d want to know the conditions under which the learning rate is constant (if this is possible).
It is also is possible this model always corresponds to vanilla GD (i.e. with a constant learning rate), except instead of ascending f, we are ascending some function related to both Ps and f.
This post calls f the “fitness function,” which is not (AFAIK) how the term “fitness” is used evolutionary biology.
Fitness in biology typically means “expected number of descendants” (absolute fitness) or “expected change in population fraction” (relative fitness).
Neither of these have direct analogues here, but they are more conceptually analogous to P(xt+1|xt) than f. The fitness should directly tell you how much more or less of something you should expect in the next generation.
That is, biology-fitness is about what actually happens when we “run the whole model” forward by a timestep, rather than being an isolated component of the model.
(In cases like the replicator equation, there is model component called a “fitness function,” but the name is justified by its relationship to biology-fitness given the full model dynamics.)
Arguably this is just semantics? But if we stop calling f by a suggestive name, it’s no longer clear what importance we should attach to it, if any. We might care about the quantity whose gradient we’re ascending, or about the biology-fitness, but f is not either of those.
I’m using this term here for consistency with the post, though I call it into question later on. “Loss function” or “cost function” would be more standard in SGD.
There is no such thing as a per-example gradient in the model. I’m assuming the “true gradient” from SGD corresponds to ∇f in the model, since the intended analogy seems to be “the model steps look like ascending f plus noise, just like SGD steps look like descending the true loss function plus noise.”
As an initial aside, I wonder if there is a general factor of spherical-cow-trustingness (SCT?) which separates us. I surely have a moderate amount of SCT! This is not a stance which is easily succinctly justified, but for me comes from having seen (and conjured) many successful spherical cows over the years. Have you read MacKay’s Information Theory, Inference, and Learning Algorithms? In the chapter ‘Why have sex?’, he has an even more spherical model of natural selection than mine here, which I (and many others) consider very illuminating. But, it’s so spherical. So spherical. Still works :).
On that note, my models here of evolution seem pretty solid and obviously capturing the spirit, to me. The first one (which I call ‘annealing-style’) is very close to a classic introduction of simulated annealing, a widely-used/considered simple mutate-and-select algorithm. The second one is less rigorously-presented but captures a lot more of the spirit and detail of natural evolutions, including horizontal transfer and a larger population.
This is a much appreciated critique. You’ve clearly engaged with this post in some depth and put effort in to describe some disagreements. I’m not currently moved (beyond the existing stance of the original post, which has a bunch of caveats and future work).
On to some specific responses.
[Of the first evolution model:] the true fitness function is not an expectation value which we could noisily estimate with sampling
No, I think it just is this, the sampling being the stochastic realisation of competition between differently-fit individuals/populations/alleles. If you wanted, you could introduce something like a ‘data distribution’ (e.g. of environmental scenarios) and an expectation over success-rates on that distribution as the true-loss analogue. But you’d just end up with something of the form of my Ps i.e. there’d be some latent ‘true fitness’ or ‘true Elo-across-envs’ (the actual expectation) and sample-based estimation would yield some (locally monotonic) probability of an ‘actually fitter’ instance being measured as such in any given lineup.[1]
a prescribed stochastic relationship (Ps) between the true gradient and the next step.
Mild push-back (I’m not sure exactly what you’re trying to say here): I don’t prescribe a relationship between the gradient and the next step. All I prescribe is that fitter instances are directionally selected/preferred. The gradient comes out of the maths; that’s basically the heart of this post!
Your example of correlated directions is really nice. I agree that SGD in full generality can have this. I also think a less spherical model of mutation+selection would also have this! (In fact lots of empirical and theoretical genetics finds/predicts correlations.)
Go back to my ‘distribution of environment scenarios’ lower-level model. Certainly this could induce correlations if we wanted it to, for basically the same reasons as in your example.
On descendant-generation and population, I refer at first to my spherical-cow-trustingness. To me it obviously captures the spirit! But in any case, I think the details are closer than you seem to allow.
Here, there is always exactly one descendant.
This is only true of one semantic, for the first model. The first model also has another, population-wise semantic. ‘Instead of a parent individual and cloned and mutated offspring individuals, considering parent and offspring populations, the same reasoning and proof is immediately applicable if we assume that mutations arise sufficiently rarely to be either fixed or lost before the next mutation arises… Of course this is not true for real natural selection.’
In the second model there is a population of different individuals with different genomes, consisting of a ‘fixed’ part and a set of ‘in flight’ (not yet fixed or extinct) mutations. This is a less-spherical cow, but I think captures even more the detail and spirit of real natural selection.
On learning rates, I appreciate this remark. I called this out in the original as something for further research, but (for SCT reasons) I don’t consider it especially important/urgent.
(S)GD does not have a gradient-dependent learning rate. To call this an equivalence, I’d want to know the conditions under which the learning rate is constant (if this is possible).
As I said elsewhere, any number of practical departures from pure SGD mess with the step size anyway (and with the gradient!) so this feels like asking for too much. Do we really think SGD vs momentum vs Adam vs … is relevant to the conclusions we want to draw? (Serious question; my best guess is ‘no’, but I hold that medium-lightly.)
Now think about the space including SGD, RMSProp, etc. The ‘depends on gradient norm’ piece which arises from my evolution model seems entirely at home in that family. But I would also be interested in further analysis exploring the details of (various models of) natural selection from a ‘learning rate’ POV.
Of course, in practical SGD, there are also often learning-rate schedules, including decay but also more wacky things. Would be interesting (but not especially loadbearing) to learn more about how natural selection handles this (I’ve heard it suggested that mutation-rate, among other things, is meta-selected for adaptability, which seems eminently plausible).
This post calls f the “fitness function,” which is not (AFAIK) how the term “fitness” is used evolutionary biology.
True, usually. But in Darwin’s day, ‘fitness’ just meant ‘how fitted are you to the environment?’ i.e. ‘what’s your Elo rating here?’, which has later been operationalised in various ways, usually accounting for number or proportion of descendants (either at a genetic unit level or at an inclusive genetic level for organisms).
We might care about the quantity whose gradient we’re ascending, or about the biology-fitness, but f is not either of those.
f is the quantity whose gradient we’re ascending. And it’s biology-fitness in the pre-operationalisation-as-fecundity/offspring sense.
Please let me know if I’ve missed anything that seems important to you—these are longish comments.
e.g. imagine characterising the sample-based loss estimate via mean and variance. For now, assume normal and homoscedastic (more spherical cows!). Is my claim about a monotonic selection prob arising from this then basically clear? There’s some prob of ‘accidentally’ getting samples which invert the fitness order, but this prob decreases as the true fitness difference increases. (For non-normal or heteroscedastic, this still holds locally, but the coefficients will be different in different regions.)
I no longer feel like I know what claim you’re advancing.
In this post and in recent comments (1, 2), you’ve said that you have proven an “equivalence” between selection and SGD under certain conditions. You’ve used phrases like “mathematically equivalent” and “a proof of equivalence.”
When someone tells me they have a mathematical proof that two things are equivalent, I read the word “equivalent” to mean “really, totally, exactly the same (under the right conditions).” I think this is a pretty standard way to interpret this kind of language. And in my critique, I argued—I think successfully—against this kind of equivalence.
From your latest comments, it appears that you are not claiming anything this strong. But now I don’t know what you are claiming. For instance, you write here:
On the distribution of noise, I’ll happily acknowledge that I didn’t show equivalence. I half expect that one could be eked out at a stretch, but I also think this is another minor and unimportant detail.
Details that are “unimportant” in one context, or for making a particular argument, may be very important in some other context or argument[1]. To take one example:
As I said elsewhere, any number of practical departures from pure SGD mess with the step size anyway (and with the gradient!) so this feels like asking for too much. Do we really think SGD vs momentum vs Adam vs … is relevant to the conclusions we want to draw? (Serious question; my best guess is ‘no’, but I hold that medium-lightly.)
This all depends on “the conclusions we want to draw.” I don’t know which conclusions you want to draw, and surely the difference between these optimizers is not always irrelevant.
If I am, say, training a neural net, then I am definitely not going to be indifferent between Adam and SGD—Adam is way faster! And even if you don’t care about speed, just about asymptotic performance, the two find solutions with different characteristics, cf. the literature on the “adaptive optimization gap.”
The profusion of different SGD-like optimizers is not evidence that the differences between them don’t matter. It’s the opposite: the differences matter a lot, and that’s why people keep coming up with new variants. If all such optimizers were about the same, there’d be no reason to make new ones.
Or, consider that the analogy only holds for infinitesimal step size, on both sides[2]. Yet the practical efficacy of SGD has much to do with the fact that it works fine even at large step sizes, up to some breaking point.
Until we tie down what we’re trying to do, I don’t know how to assess whether any of these distinctions are (ir)relevant or (un)important.
On another note, the fact that the gradient “comes out of the maths” does not seem very noteworthy to me. It’s inevitable from the problem setup: everything is rotationally symmetric except f, and we’re restricted to function evaluations inside an ϵ-ball, so we can’t feel the higher-order derivatives of f. As long as we’re not analytically computing any of those higher derivatives, we should expect any direction appearing in the result to be a function of the gradient—it’s the only source of directionality available, the only thing that breaks the symmetry[3].
Here the function of the gradient is just the identity, but I expect you’d still deem it “acceptable” if it were not, since it would still fall into the “class” of optimizers that includes Adam etc. (For Adam, the function involves the gradient and the buffers computed from it, which in turn require the introduction of a privileged basis.)
By these standards, what optimizer isn’t equivalent to gradient descent? Optimizers that analytically compute higher-order derivatives, that’s all I can come up with. Which is a strange place to draw a line in the sand, IMO: “all 0th and 1st order optimizers are the same, but 2nd+ order optimizers are different.” Surely there are lots of important differences between different 0th and 1st order optimizers?
The nice thing about a true equivalence result is that it’s not context-dependent like this. All the details match, all at once, so we can readily apply the result in whatever context we like without having to check whether this context cares about some particular nuance that didn’t transfer.
We need this limit because the SGD step never depends on higher-order derivatives, no matter the step size. But a guess-and-check optimizer like selection can feel higher-order derivatives at finite step sizes.
Likewise, we should expect a scalar like the step size to be some function of the gradient magnitude, since there aren’t any other scalars in the problem (again, ignoring higher-order derivatives). I guess there’s f itself, but it seems natural to “quotient out” that degree of freedom by requiring the optimizer to behave the same way for f and f+c.
This post introduces a model, and shows that it behaves sort of like a noisy version of gradient descent.
However, the term “stochastic gradient descent” does not just mean “gradient descent with noise.” It refers more specifically to mini-batch gradient descent. (See e.g. Wikipedia.)
In mini-batch gradient descent, the “true” fitness[1] function is the expectation of some function L over a data distribution ρ. But you never have access to this function or its gradient. Instead, you draw a finite sample from ρ, compute the mean of ∇L over the sample, and take a step in this direction. The noise comes from the variance of the finite-sample mean as an estimator of the expectation.
The model here is quite different. There is no “data distribution,” and the true fitness function is not an expectation value which we could noisily estimate with sampling. The noise here comes not from a noisy estimate of the gradient, but from a prescribed stochastic relationship (Ps) between the true gradient and the next step.
I don’t think the model in this post behaves like mini-batch gradient descent. Consider a case where we’re doing SGD on a vector x, and two of its components xi,xj have the following properties:
The “true gradient” (the expected gradient over the data distribution) is 0 in the xi and xj directions.
The xi and xj components of the per-example gradient are perfectly (positively) correlated with one another.
If you like, you can think of the per-example gradient as sampling a single number z from a distribution with mean 0, and setting the xi and xj components to aiz and ajz respectively, for some positive constants ai,aj.
When we sample a mini-batch and average over it, these components are simply ai¯z and aj¯z, where ¯z is the average of z over the mini-batch. So the perfect correlation carries over to the mini-batch gradient, and thus to the SGD step. If SGD increases xi, it will always increase xj alongside it (etc.)
However, applying the model from this post to the same case:
Candidate steps are sampled according to Pm, which is radially symmetric. So (e.g.) a candidate step with positive xi and negative xj is just as likely as one with both positive, all else being equal.
The probability of accepting a candidate step depends only on the true gradient[2], which is 0 in the directions of interest. So, the xi and xj components of a candidate step have no effect on its probability of selection.
Thus, the the xi and xj components of the step will be uncorrelated, rather than perfectly correlated as in SGD.
Some other comments:
The descendant-generation process in this post seems very different from the familiar biological cases it’s trying to draw an analogy to.
In biology, “selection” generally involves having more or fewer descendants relative to the population average.
Here, there is always exactly one descendant. “Selection” occurs because we generate (real) descendants by first generating a ghostly “candidate descendant,” comparing it to its parent (or a clone of its parent), possibly rejecting it against the parent and drawing another candidate, etc.
This could be physically implemented in principle, I guess. (Maybe it has been, somewhere?) But I’m not convinced it’s equivalent to any familiar case of biological selection. Nor it is clear to me how close the relationship is, if it’s not equivalence.
The connection drawn here to gradient descent is not exact, even setting aside the stochastic part.
You note that we get a “gradient-dependent learning rate,” essentially because Ps can have all sorts of shapes—we only know that it’s monotonic, which gives us a monotonic relation between step size and gradient norm, but nothing more.
But notably, (S)GD does not have a gradient-dependent learning rate. To call this an equivalence, I’d want to know the conditions under which the learning rate is constant (if this is possible).
It is also is possible this model always corresponds to vanilla GD (i.e. with a constant learning rate), except instead of ascending f, we are ascending some function related to both Ps and f.
This post calls f the “fitness function,” which is not (AFAIK) how the term “fitness” is used evolutionary biology.
Fitness in biology typically means “expected number of descendants” (absolute fitness) or “expected change in population fraction” (relative fitness).
Neither of these have direct analogues here, but they are more conceptually analogous to P(xt+1|xt) than f. The fitness should directly tell you how much more or less of something you should expect in the next generation.
That is, biology-fitness is about what actually happens when we “run the whole model” forward by a timestep, rather than being an isolated component of the model.
(In cases like the replicator equation, there is model component called a “fitness function,” but the name is justified by its relationship to biology-fitness given the full model dynamics.)
Arguably this is just semantics? But if we stop calling f by a suggestive name, it’s no longer clear what importance we should attach to it, if any. We might care about the quantity whose gradient we’re ascending, or about the biology-fitness, but f is not either of those.
I’m using this term here for consistency with the post, though I call it into question later on. “Loss function” or “cost function” would be more standard in SGD.
There is no such thing as a per-example gradient in the model. I’m assuming the “true gradient” from SGD corresponds to ∇f in the model, since the intended analogy seems to be “the model steps look like ascending f plus noise, just like SGD steps look like descending the true loss function plus noise.”
As an initial aside, I wonder if there is a general factor of spherical-cow-trustingness (SCT?) which separates us. I surely have a moderate amount of SCT! This is not a stance which is easily succinctly justified, but for me comes from having seen (and conjured) many successful spherical cows over the years. Have you read MacKay’s Information Theory, Inference, and Learning Algorithms? In the chapter ‘Why have sex?’, he has an even more spherical model of natural selection than mine here, which I (and many others) consider very illuminating. But, it’s so spherical. So spherical. Still works :).
On that note, my models here of evolution seem pretty solid and obviously capturing the spirit, to me. The first one (which I call ‘annealing-style’) is very close to a classic introduction of simulated annealing, a widely-used/considered simple mutate-and-select algorithm. The second one is less rigorously-presented but captures a lot more of the spirit and detail of natural evolutions, including horizontal transfer and a larger population.
This is a much appreciated critique. You’ve clearly engaged with this post in some depth and put effort in to describe some disagreements. I’m not currently moved (beyond the existing stance of the original post, which has a bunch of caveats and future work).
On to some specific responses.
No, I think it just is this, the sampling being the stochastic realisation of competition between differently-fit individuals/populations/alleles. If you wanted, you could introduce something like a ‘data distribution’ (e.g. of environmental scenarios) and an expectation over success-rates on that distribution as the true-loss analogue. But you’d just end up with something of the form of my Ps i.e. there’d be some latent ‘true fitness’ or ‘true Elo-across-envs’ (the actual expectation) and sample-based estimation would yield some (locally monotonic) probability of an ‘actually fitter’ instance being measured as such in any given lineup.[1]
Mild push-back (I’m not sure exactly what you’re trying to say here): I don’t prescribe a relationship between the gradient and the next step. All I prescribe is that fitter instances are directionally selected/preferred. The gradient comes out of the maths; that’s basically the heart of this post!
Your example of correlated directions is really nice. I agree that SGD in full generality can have this. I also think a less spherical model of mutation+selection would also have this! (In fact lots of empirical and theoretical genetics finds/predicts correlations.)
Go back to my ‘distribution of environment scenarios’ lower-level model. Certainly this could induce correlations if we wanted it to, for basically the same reasons as in your example.
On descendant-generation and population, I refer at first to my spherical-cow-trustingness. To me it obviously captures the spirit! But in any case, I think the details are closer than you seem to allow.
This is only true of one semantic, for the first model. The first model also has another, population-wise semantic. ‘Instead of a parent individual and cloned and mutated offspring individuals, considering parent and offspring populations, the same reasoning and proof is immediately applicable if we assume that mutations arise sufficiently rarely to be either fixed or lost before the next mutation arises… Of course this is not true for real natural selection.’
In the second model there is a population of different individuals with different genomes, consisting of a ‘fixed’ part and a set of ‘in flight’ (not yet fixed or extinct) mutations. This is a less-spherical cow, but I think captures even more the detail and spirit of real natural selection.
On learning rates, I appreciate this remark. I called this out in the original as something for further research, but (for SCT reasons) I don’t consider it especially important/urgent.
As I said elsewhere, any number of practical departures from pure SGD mess with the step size anyway (and with the gradient!) so this feels like asking for too much. Do we really think SGD vs momentum vs Adam vs … is relevant to the conclusions we want to draw? (Serious question; my best guess is ‘no’, but I hold that medium-lightly.)
Now think about the space including SGD, RMSProp, etc. The ‘depends on gradient norm’ piece which arises from my evolution model seems entirely at home in that family. But I would also be interested in further analysis exploring the details of (various models of) natural selection from a ‘learning rate’ POV.
Of course, in practical SGD, there are also often learning-rate schedules, including decay but also more wacky things. Would be interesting (but not especially loadbearing) to learn more about how natural selection handles this (I’ve heard it suggested that mutation-rate, among other things, is meta-selected for adaptability, which seems eminently plausible).
True, usually. But in Darwin’s day, ‘fitness’ just meant ‘how fitted are you to the environment?’ i.e. ‘what’s your Elo rating here?’, which has later been operationalised in various ways, usually accounting for number or proportion of descendants (either at a genetic unit level or at an inclusive genetic level for organisms).
f is the quantity whose gradient we’re ascending. And it’s biology-fitness in the pre-operationalisation-as-fecundity/offspring sense.
Please let me know if I’ve missed anything that seems important to you—these are longish comments.
e.g. imagine characterising the sample-based loss estimate via mean and variance. For now, assume normal and homoscedastic (more spherical cows!). Is my claim about a monotonic selection prob arising from this then basically clear? There’s some prob of ‘accidentally’ getting samples which invert the fitness order, but this prob decreases as the true fitness difference increases. (For non-normal or heteroscedastic, this still holds locally, but the coefficients will be different in different regions.)
I no longer feel like I know what claim you’re advancing.
In this post and in recent comments (1, 2), you’ve said that you have proven an “equivalence” between selection and SGD under certain conditions. You’ve used phrases like “mathematically equivalent” and “a proof of equivalence.”
When someone tells me they have a mathematical proof that two things are equivalent, I read the word “equivalent” to mean “really, totally, exactly the same (under the right conditions).” I think this is a pretty standard way to interpret this kind of language. And in my critique, I argued—I think successfully—against this kind of equivalence.
From your latest comments, it appears that you are not claiming anything this strong. But now I don’t know what you are claiming. For instance, you write here:
Details that are “unimportant” in one context, or for making a particular argument, may be very important in some other context or argument[1]. To take one example:
This all depends on “the conclusions we want to draw.” I don’t know which conclusions you want to draw, and surely the difference between these optimizers is not always irrelevant.
If I am, say, training a neural net, then I am definitely not going to be indifferent between Adam and SGD—Adam is way faster! And even if you don’t care about speed, just about asymptotic performance, the two find solutions with different characteristics, cf. the literature on the “adaptive optimization gap.”
The profusion of different SGD-like optimizers is not evidence that the differences between them don’t matter. It’s the opposite: the differences matter a lot, and that’s why people keep coming up with new variants. If all such optimizers were about the same, there’d be no reason to make new ones.
Or, consider that the analogy only holds for infinitesimal step size, on both sides[2]. Yet the practical efficacy of SGD has much to do with the fact that it works fine even at large step sizes, up to some breaking point.
Until we tie down what we’re trying to do, I don’t know how to assess whether any of these distinctions are (ir)relevant or (un)important.
On another note, the fact that the gradient “comes out of the maths” does not seem very noteworthy to me. It’s inevitable from the problem setup: everything is rotationally symmetric except f, and we’re restricted to function evaluations inside an ϵ-ball, so we can’t feel the higher-order derivatives of f. As long as we’re not analytically computing any of those higher derivatives, we should expect any direction appearing in the result to be a function of the gradient—it’s the only source of directionality available, the only thing that breaks the symmetry[3].
Here the function of the gradient is just the identity, but I expect you’d still deem it “acceptable” if it were not, since it would still fall into the “class” of optimizers that includes Adam etc. (For Adam, the function involves the gradient and the buffers computed from it, which in turn require the introduction of a privileged basis.)
By these standards, what optimizer isn’t equivalent to gradient descent? Optimizers that analytically compute higher-order derivatives, that’s all I can come up with. Which is a strange place to draw a line in the sand, IMO: “all 0th and 1st order optimizers are the same, but 2nd+ order optimizers are different.” Surely there are lots of important differences between different 0th and 1st order optimizers?
The nice thing about a true equivalence result is that it’s not context-dependent like this. All the details match, all at once, so we can readily apply the result in whatever context we like without having to check whether this context cares about some particular nuance that didn’t transfer.
We need this limit because the SGD step never depends on higher-order derivatives, no matter the step size. But a guess-and-check optimizer like selection can feel higher-order derivatives at finite step sizes.
Likewise, we should expect a scalar like the step size to be some function of the gradient magnitude, since there aren’t any other scalars in the problem (again, ignoring higher-order derivatives). I guess there’s f itself, but it seems natural to “quotient out” that degree of freedom by requiring the optimizer to behave the same way for f and f+c.