I suspect I made our recent discussions unnecessarily messy by simultaneously talking about: (1) “informal strategic stuff” (e.g. the argument that selection processes are strategically important, which I now understand is not contradictory to your model of the future); and (2) my (somewhat less informal) mathematical argument about evolutionary computation algorithms.
The rest of this comment involves only the mathematical argument. I want to make that argument narrower than the version that perhaps you responded to: I want it to only be about absolute myopia, rather than more general concepts of myopia or full agency. Also, I (now) think my argument applies only to learning setups in which the behavior of the model/agent can affect what the model encounters in future iterations/episodes. Therefore, my argument does not apply to setups such as unsupervised learning for past stock prices or RL for Atari games (when each episode is a new game).
My argument is (now) only the following: Suppose we have a learning setup in which the behavior of the model at a particular moment may affect the future inputs/environments that the model will be trained on. I argue that evolutionary computation algorithms seem less likely to yield an absolute myopic model, relative to gradient decent. If you already think that, you might want to skip the rest of this comment (in which I try to support this argument).
I think the following property might make a learning algorithm more likely to yield models that are NOT absolute myopic:
During training, a parameter’s value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future iterations/episodes.
I think that this property tends to apply to evolutionary computation algorithms more than it applies to gradient descent. I’ll use the following example to explain why I think that:
Suppose we have some online supervised learning setup. Suppose that during iteration 1 the model needs to predict random labels (and thus can’t perform better than chance), however, if parameter θ8 has a large value then the model makes predictions that cause the examples in iteration 2 to be more predictable. By assumption, during iteration 2 the value of θ8 does not (directly) affect predictions.
How should we expect our learning algorithm to update the parameter θ8 at the end of iteration 2?
If our learning algorithm is gradient decent, it seems that we should NOT expect θ8 to increase, because there is no iteration in which the relevant component of the gradient (i.e. the partial derivative of the objective with respect to θ8) is expected to be positive.
In contrast, if our learning algorithm is some evolutionary computation algorithm, the models (in the population) in which θ8 happens to be larger are expected to outperform the other models, in iteration 2. Therefore, we should expect iteration 2 to increase the average value of θ8 (over the model population).
In contrast, if our learning algorithm is some evolutionary computation algorithm, the models (in the population) in which θ8 happens to be larger are expected to outperform the other models, in iteration 2. Therefore, we should expect iteration 2 to increase the average value of θ8 (over the model population).
Why do those models outperform? I think you must be imagining a different setup, but I’m interpreting your setup as:
This is a classification problem, so, we’re getting feedback on correct labels X for some Y.
It’s online, so we’re doing this in sequence, and learning after each.
We keep a population of models, which we update (perhaps only a little) after every training example; population members who predicted the label correctly get a chance to reproduce, and a few population members who didn’t are killed off.
The overall prediction made by the system is the average of all the predictions (or some other aggregation).
Large θ8 influences at one time-step will cause predictions which make the next time-step easier.
So, if the population has an abundance of high θ8 at one time step, the population overall does better in the next time step, because it’s easier for everyone to predict.
So, the frequency of high θ8 will not be increased at all. Just like in gradient descent, there’s no point at which the relevant population members are specifically rewarded.
In other words, many members of the population can swoop in and reap the benefits caused by high-θ8 members. So high-θ8 carriers do not specifically benefit.
In other words, many members of the population can swoop in and reap the benefits caused by high-θ8 members. So high-θ8 carriers do not specifically benefit.
Oops! I agree. My above example is utterly confused and incorrect (I think your description of my setup matches what I was imagining).
Backtracking from this, I now realize that my core reasoning behind my argument—about evolutionary computation algorithms producing non-myopic behavior by default—was incorrect.
(My confusion stemmed from thinking that an argument I made about a theoretical training algorithm also applies to many evolutionary computation algorithms; which seems incorrect!)
I suspect I made our recent discussions unnecessarily messy by simultaneously talking about: (1) “informal strategic stuff” (e.g. the argument that selection processes are strategically important, which I now understand is not contradictory to your model of the future); and (2) my (somewhat less informal) mathematical argument about evolutionary computation algorithms.
The rest of this comment involves only the mathematical argument. I want to make that argument narrower than the version that perhaps you responded to: I want it to only be about absolute myopia, rather than more general concepts of myopia or full agency. Also, I (now) think my argument applies only to learning setups in which the behavior of the model/agent can affect what the model encounters in future iterations/episodes. Therefore, my argument does not apply to setups such as unsupervised learning for past stock prices or RL for Atari games (when each episode is a new game).
My argument is (now) only the following: Suppose we have a learning setup in which the behavior of the model at a particular moment may affect the future inputs/environments that the model will be trained on. I argue that evolutionary computation algorithms seem less likely to yield an absolute myopic model, relative to gradient decent. If you already think that, you might want to skip the rest of this comment (in which I try to support this argument).
I think the following property might make a learning algorithm more likely to yield models that are NOT absolute myopic:
During training, a parameter’s value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future iterations/episodes.
I think that this property tends to apply to evolutionary computation algorithms more than it applies to gradient descent. I’ll use the following example to explain why I think that:
Suppose we have some online supervised learning setup. Suppose that during iteration 1 the model needs to predict random labels (and thus can’t perform better than chance), however, if parameter θ8 has a large value then the model makes predictions that cause the examples in iteration 2 to be more predictable. By assumption, during iteration 2 the value of θ8 does not (directly) affect predictions.
How should we expect our learning algorithm to update the parameter θ8 at the end of iteration 2?
If our learning algorithm is gradient decent, it seems that we should NOT expect θ8 to increase, because there is no iteration in which the relevant component of the gradient (i.e. the partial derivative of the objective with respect to θ8) is expected to be positive.
In contrast, if our learning algorithm is some evolutionary computation algorithm, the models (in the population) in which θ8 happens to be larger are expected to outperform the other models, in iteration 2. Therefore, we should expect iteration 2 to increase the average value of θ8 (over the model population).
Sorry for taking so long to respond to this one.
I don’t get the last step in your argument:
Why do those models outperform? I think you must be imagining a different setup, but I’m interpreting your setup as:
This is a classification problem, so, we’re getting feedback on correct labels X for some Y.
It’s online, so we’re doing this in sequence, and learning after each.
We keep a population of models, which we update (perhaps only a little) after every training example; population members who predicted the label correctly get a chance to reproduce, and a few population members who didn’t are killed off.
The overall prediction made by the system is the average of all the predictions (or some other aggregation).
Large θ8 influences at one time-step will cause predictions which make the next time-step easier.
So, if the population has an abundance of high θ8 at one time step, the population overall does better in the next time step, because it’s easier for everyone to predict.
So, the frequency of high θ8 will not be increased at all. Just like in gradient descent, there’s no point at which the relevant population members are specifically rewarded.
In other words, many members of the population can swoop in and reap the benefits caused by high-θ8 members. So high-θ8 carriers do not specifically benefit.
Oops! I agree. My above example is utterly confused and incorrect (I think your description of my setup matches what I was imagining).
Backtracking from this, I now realize that my core reasoning behind my argument—about evolutionary computation algorithms producing non-myopic behavior by default—was incorrect.
(My confusion stemmed from thinking that an argument I made about a theoretical training algorithm also applies to many evolutionary computation algorithms; which seems incorrect!)