In contrast, if our learning algorithm is some evolutionary computation algorithm, the models (in the population) in which θ8 happens to be larger are expected to outperform the other models, in iteration 2. Therefore, we should expect iteration 2 to increase the average value of θ8 (over the model population).
Why do those models outperform? I think you must be imagining a different setup, but I’m interpreting your setup as:
This is a classification problem, so, we’re getting feedback on correct labels X for some Y.
It’s online, so we’re doing this in sequence, and learning after each.
We keep a population of models, which we update (perhaps only a little) after every training example; population members who predicted the label correctly get a chance to reproduce, and a few population members who didn’t are killed off.
The overall prediction made by the system is the average of all the predictions (or some other aggregation).
Large θ8 influences at one time-step will cause predictions which make the next time-step easier.
So, if the population has an abundance of high θ8 at one time step, the population overall does better in the next time step, because it’s easier for everyone to predict.
So, the frequency of high θ8 will not be increased at all. Just like in gradient descent, there’s no point at which the relevant population members are specifically rewarded.
In other words, many members of the population can swoop in and reap the benefits caused by high-θ8 members. So high-θ8 carriers do not specifically benefit.
In other words, many members of the population can swoop in and reap the benefits caused by high-θ8 members. So high-θ8 carriers do not specifically benefit.
Oops! I agree. My above example is utterly confused and incorrect (I think your description of my setup matches what I was imagining).
Backtracking from this, I now realize that my core reasoning behind my argument—about evolutionary computation algorithms producing non-myopic behavior by default—was incorrect.
(My confusion stemmed from thinking that an argument I made about a theoretical training algorithm also applies to many evolutionary computation algorithms; which seems incorrect!)
Sorry for taking so long to respond to this one.
I don’t get the last step in your argument:
Why do those models outperform? I think you must be imagining a different setup, but I’m interpreting your setup as:
This is a classification problem, so, we’re getting feedback on correct labels X for some Y.
It’s online, so we’re doing this in sequence, and learning after each.
We keep a population of models, which we update (perhaps only a little) after every training example; population members who predicted the label correctly get a chance to reproduce, and a few population members who didn’t are killed off.
The overall prediction made by the system is the average of all the predictions (or some other aggregation).
Large θ8 influences at one time-step will cause predictions which make the next time-step easier.
So, if the population has an abundance of high θ8 at one time step, the population overall does better in the next time step, because it’s easier for everyone to predict.
So, the frequency of high θ8 will not be increased at all. Just like in gradient descent, there’s no point at which the relevant population members are specifically rewarded.
In other words, many members of the population can swoop in and reap the benefits caused by high-θ8 members. So high-θ8 carriers do not specifically benefit.
Oops! I agree. My above example is utterly confused and incorrect (I think your description of my setup matches what I was imagining).
Backtracking from this, I now realize that my core reasoning behind my argument—about evolutionary computation algorithms producing non-myopic behavior by default—was incorrect.
(My confusion stemmed from thinking that an argument I made about a theoretical training algorithm also applies to many evolutionary computation algorithms; which seems incorrect!)