instead, if there are hyperparameters that prevent the error rate going below 0.1, these will be selected by gradient descent as giving a better performance.
I don’t follow this point. If we’re talking about using SGD to update (hyper)parameters, using a batch of images from the currently used datasets, then the gradient update would be determined by the gradient of the loss with respect to that batch of images.
I want to flag that—in the case of evolutionary algorithms—we should not assume here that the fitness function is defined with respect to just the current batch of images, but rather with respect to, say, all past images so far (since the beginning of the entire training process); otherwise the selection pressure is “myopic” (i.e. models that outperform others on the current batch of images have higher fitness).
(I might be over-pedantic about this topic due to previously being very confused about it.)
I don’t follow this point. If we’re talking about using SGD to update (hyper)parameters, using a batch of images from the currently used datasets, then the gradient update would be determined by the gradient of the loss with respect to that batch of images.
To keep it simple, assume the hyperparameters are updated by evolutionary algorithm or some similar search-then-continue-or-stop process.
I want to flag that—in the case of evolutionary algorithms—we should not assume here that the fitness function is defined with respect to just the current batch of images, but rather with respect to, say, all past images so far (since the beginning of the entire training process); otherwise the selection pressure is “myopic” (i.e. models that outperform others on the current batch of images have higher fitness).
(I might be over-pedantic about this topic due to previously being very confused about it.)