[EDIT: 2019-11-09: The argument I made here seems incorrect; see here (H/T Abram for showing me that my reasoning was wrong).]
Conjecture:It is not possible to set up a learning system which gets you full agency in the sense of eventually learning to take all the Pareto improvements.
.
There’s also reason to suspect the conjecture to be false. There’s a natural instrumental convergence toward dynamic consistency; a system will self-modify to greater consistency in many cases. If there’s an attractor basin around full agency, one would not expect it to be that hard to set up incentives which push things into that attractor basin.
Apart from this, it seems to me that some evolutionary computation algorithms tend to yield models that take all the Pareto improvements, given sufficiently long runtime. The idea is that at any point during training we should expect a model to outperform another model—that takes one less Pareto improvement—on future fitness evaluations (all other things being equal).
Any global optimization technique can find the global optimum of a fixed evaluation function given time. This is a different problem. As I mentioned before, the assumption of simulable environments which you invoke to apply evolutionary algorithms to RL problems assumes too much; it fundamentally changes the problem from a control problem to a selection problem. This is exactly the kind of mistake which prompted me to come up with the selection/control distinction.
How would you propose to apply evolutionary algorithms to online learning? How would you propose to apply evolutionary algorithms to non-episodic environments? I’m not saying it can’t be done, but in doing so, your remark will no longer apply. For online non-episodic problems, you don’t get to think directly in terms of climbing a fitness landscape.
Taking a step back, I want to note two things about my model of the near future (if your model disagrees with those things, that disagreement might explain what’s going on in our recent exchanges):
(1) I expect many actors to be throwing a lot of money on selection processes (especially unsupervised learning), and I find it plausible that such efforts would produce transformative/dangerous systems.
(2) Suppose there’s some competitive task that is financially important (e.g. algo-trading), for which actors build systems that use a huge neural network trained via gradient descent. I find it plausible that some actors will experiment with evolutionary computation methods, trying to produce a component that will outperform and replace that neural network.
Regarding the questions you raised:
How would you propose to apply evolutionary algorithms to online learning?
One can use a selection process—say, some evolutionary computation algorithm—to produce a system that performs well in an online learning task. The fitness metric would be based on the performance in many (other) online learning tasks for which training data is available (e.g. past stock prices) or for which the environment can be simulated (e.g. Atari games, robotic arm + boxes).
How would you propose to apply evolutionary algorithms to non-episodic environments?
I’m not sure whether this refers to non-episodic tasks (the issue being slower/sparser feedback?) or environments that can’t be simulated (in which case the idea above seems to apply: one can use a selection process, using other tasks for which there’s training data or for which the environment can be simulated).
(1) I expect many actors to be throwing a lot of money on selection processes (especially unsupervised learning), and I find it plausible that such efforts would produce transformative/dangerous systems.
Sure.
(2) Suppose there’s some competitive task that is financially important (e.g. algo-trading), for which actors build systems that use a huge neural network trained via gradient descent. I find it plausible that some actors will experiment with evolutionary computation methods, trying to produce a component that will outperform and replace that neural network.
Maybe, sure.
There seems to be something I’m missing here. What you said earlier:
Apart from this, it seems to me that some evolutionary computation algorithms tend to yield models that take all the Pareto improvements, given sufficiently long runtime. The idea is that at any point during training we should expect a model to outperform another model—that takes one less Pareto improvement—on future fitness evaluations (all other things being equal).
is an essentially mathematical remark, which doesn’t have a lot to do with AI timelines and projections of which technologies will be used. I’m saying that this remark strikes me as a type error, because it confuses what I meant by “take all the Pareto improvements”—substituting the (conceptually and technologically difficult) control concept for the (conceptually straightforward, difficult only because of processing power limitations) selection concept.
I interpret you that way because your suggestion to apply evolutionary algorithms appears to be missing data. We can apply evolutionary algorithms if we can define a loss function. But the problem I’m pointing at (off full vs partial agency) has to do with difficulties of defining a loss function.
>How would you propose to apply evolutionary algorithms to online learning?
One can use a selection process—say, some evolutionary computation algorithm—to produce a system that performs well in an online learning task. The fitness metric would be based on the performance in many (other) online learning tasks for which training data is available (e.g. past stock prices) or for which the environment can be simulated (e.g. Atari games, robotic arm + boxes).
So, what is the argument that you’d tend to get full agency out of this? I think the situation is not very different from applying gradient descent in a similar way.
Using data from past stock prices, say, creates an implicit model that the agent’s trades can never influence the stock price. This is of course a mostly fine model for today’s ML systems, but, it’s also an example of what I’m talking about—training procedures tend to create partial agency rather than full agency.
Training the system on many online learning tasks, there will not be an incentive to optimize across tasks—the training procedure implicitly assumes that the different tasks are independent. This is significant because you really need a whole lot of data in order to learn effective online learning tactics; it seems likely you’d end up splitting larger scenarios into a lot of tiny episodes, creating myopia.
I’m not saying I’d be happily confident that such a procedure would produce partial agents (therefore avoiding AI risk). And indeed, there are differences between doing this with gradient descent and evolutionary algorithms. One of the things I focused on in the post, time-discounting, becomes less relevant—but only because it’s more natural to split things into episodes in the case of evolutionary algorithms, which still creates myopia as a side effect.
What I’m saying is there’s a real credit assignment problem here—you’re trying to pick between different policies (ie the code which the evolutionary algorithms are selecting between), based on which policy has performed better in the past. But you’ve taken a lot of actions in the past. And you’ve gotten a lot of individual pieces of feedback. You don’t know how to ascribe success/failure credit—that is, you don’t know how to match individual pieces of feedback to individual decisions you made (and hence to individual pieces of code).
So you solve the problem in a basically naive way: you assume that the feedback on “instance n” was related to the code you were running at that time. This is a myopic assumption!
>How would you propose to apply evolutionary algorithms to non-episodic environments?
I’m not sure whether this refers to non-episodic tasks (the issue being slower/sparser feedback?) or environments that can’t be simulated (in which case the idea above seems to apply: one can use a selection process, using other tasks for which there’s training data or for which the environment can be simulated).
The big thing with environments that can’t be simulated is that you don’t have a reset button, so you can’t back up and try again; so, episodic and simulable are pretty related.
Sparse feedback is related to what I’m talking about, but feels like a selection-oriented way of understanding the difficulty of control; “sparse feedback” still applies to very episodic problems such as chess. The difficulty with control is that arbitrarily long historical contexts can sometimes matter, and you have to learn anyway. But I agree that it’s much easier for this to present real difficulty if the rewards are sparse.
I suspect I made our recent discussions unnecessarily messy by simultaneously talking about: (1) “informal strategic stuff” (e.g. the argument that selection processes are strategically important, which I now understand is not contradictory to your model of the future); and (2) my (somewhat less informal) mathematical argument about evolutionary computation algorithms.
The rest of this comment involves only the mathematical argument. I want to make that argument narrower than the version that perhaps you responded to: I want it to only be about absolute myopia, rather than more general concepts of myopia or full agency. Also, I (now) think my argument applies only to learning setups in which the behavior of the model/agent can affect what the model encounters in future iterations/episodes. Therefore, my argument does not apply to setups such as unsupervised learning for past stock prices or RL for Atari games (when each episode is a new game).
My argument is (now) only the following: Suppose we have a learning setup in which the behavior of the model at a particular moment may affect the future inputs/environments that the model will be trained on. I argue that evolutionary computation algorithms seem less likely to yield an absolute myopic model, relative to gradient decent. If you already think that, you might want to skip the rest of this comment (in which I try to support this argument).
I think the following property might make a learning algorithm more likely to yield models that are NOT absolute myopic:
During training, a parameter’s value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future iterations/episodes.
I think that this property tends to apply to evolutionary computation algorithms more than it applies to gradient descent. I’ll use the following example to explain why I think that:
Suppose we have some online supervised learning setup. Suppose that during iteration 1 the model needs to predict random labels (and thus can’t perform better than chance), however, if parameter θ8 has a large value then the model makes predictions that cause the examples in iteration 2 to be more predictable. By assumption, during iteration 2 the value of θ8 does not (directly) affect predictions.
How should we expect our learning algorithm to update the parameter θ8 at the end of iteration 2?
If our learning algorithm is gradient decent, it seems that we should NOT expect θ8 to increase, because there is no iteration in which the relevant component of the gradient (i.e. the partial derivative of the objective with respect to θ8) is expected to be positive.
In contrast, if our learning algorithm is some evolutionary computation algorithm, the models (in the population) in which θ8 happens to be larger are expected to outperform the other models, in iteration 2. Therefore, we should expect iteration 2 to increase the average value of θ8 (over the model population).
In contrast, if our learning algorithm is some evolutionary computation algorithm, the models (in the population) in which θ8 happens to be larger are expected to outperform the other models, in iteration 2. Therefore, we should expect iteration 2 to increase the average value of θ8 (over the model population).
Why do those models outperform? I think you must be imagining a different setup, but I’m interpreting your setup as:
This is a classification problem, so, we’re getting feedback on correct labels X for some Y.
It’s online, so we’re doing this in sequence, and learning after each.
We keep a population of models, which we update (perhaps only a little) after every training example; population members who predicted the label correctly get a chance to reproduce, and a few population members who didn’t are killed off.
The overall prediction made by the system is the average of all the predictions (or some other aggregation).
Large θ8 influences at one time-step will cause predictions which make the next time-step easier.
So, if the population has an abundance of high θ8 at one time step, the population overall does better in the next time step, because it’s easier for everyone to predict.
So, the frequency of high θ8 will not be increased at all. Just like in gradient descent, there’s no point at which the relevant population members are specifically rewarded.
In other words, many members of the population can swoop in and reap the benefits caused by high-θ8 members. So high-θ8 carriers do not specifically benefit.
In other words, many members of the population can swoop in and reap the benefits caused by high-θ8 members. So high-θ8 carriers do not specifically benefit.
Oops! I agree. My above example is utterly confused and incorrect (I think your description of my setup matches what I was imagining).
Backtracking from this, I now realize that my core reasoning behind my argument—about evolutionary computation algorithms producing non-myopic behavior by default—was incorrect.
(My confusion stemmed from thinking that an argument I made about a theoretical training algorithm also applies to many evolutionary computation algorithms; which seems incorrect!)
[EDIT: 2019-11-09: The argument I made here seems incorrect; see here (H/T Abram for showing me that my reasoning was wrong).]
.
Apart from this, it seems to me that some evolutionary computation algorithms tend to yield models that take all the Pareto improvements, given sufficiently long runtime. The idea is that at any point during training we should expect a model to outperform another model—that takes one less Pareto improvement—on future fitness evaluations (all other things being equal).
Any global optimization technique can find the global optimum of a fixed evaluation function given time. This is a different problem. As I mentioned before, the assumption of simulable environments which you invoke to apply evolutionary algorithms to RL problems assumes too much; it fundamentally changes the problem from a control problem to a selection problem. This is exactly the kind of mistake which prompted me to come up with the selection/control distinction.
How would you propose to apply evolutionary algorithms to online learning? How would you propose to apply evolutionary algorithms to non-episodic environments? I’m not saying it can’t be done, but in doing so, your remark will no longer apply. For online non-episodic problems, you don’t get to think directly in terms of climbing a fitness landscape.
Taking a step back, I want to note two things about my model of the near future (if your model disagrees with those things, that disagreement might explain what’s going on in our recent exchanges):
(1) I expect many actors to be throwing a lot of money on selection processes (especially unsupervised learning), and I find it plausible that such efforts would produce transformative/dangerous systems.
(2) Suppose there’s some competitive task that is financially important (e.g. algo-trading), for which actors build systems that use a huge neural network trained via gradient descent. I find it plausible that some actors will experiment with evolutionary computation methods, trying to produce a component that will outperform and replace that neural network.
Regarding the questions you raised:
One can use a selection process—say, some evolutionary computation algorithm—to produce a system that performs well in an online learning task. The fitness metric would be based on the performance in many (other) online learning tasks for which training data is available (e.g. past stock prices) or for which the environment can be simulated (e.g. Atari games, robotic arm + boxes).
I’m not sure whether this refers to non-episodic tasks (the issue being slower/sparser feedback?) or environments that can’t be simulated (in which case the idea above seems to apply: one can use a selection process, using other tasks for which there’s training data or for which the environment can be simulated).
Sure.
Maybe, sure.
There seems to be something I’m missing here. What you said earlier:
is an essentially mathematical remark, which doesn’t have a lot to do with AI timelines and projections of which technologies will be used. I’m saying that this remark strikes me as a type error, because it confuses what I meant by “take all the Pareto improvements”—substituting the (conceptually and technologically difficult) control concept for the (conceptually straightforward, difficult only because of processing power limitations) selection concept.
I interpret you that way because your suggestion to apply evolutionary algorithms appears to be missing data. We can apply evolutionary algorithms if we can define a loss function. But the problem I’m pointing at (off full vs partial agency) has to do with difficulties of defining a loss function.
So, what is the argument that you’d tend to get full agency out of this? I think the situation is not very different from applying gradient descent in a similar way.
Using data from past stock prices, say, creates an implicit model that the agent’s trades can never influence the stock price. This is of course a mostly fine model for today’s ML systems, but, it’s also an example of what I’m talking about—training procedures tend to create partial agency rather than full agency.
Training the system on many online learning tasks, there will not be an incentive to optimize across tasks—the training procedure implicitly assumes that the different tasks are independent. This is significant because you really need a whole lot of data in order to learn effective online learning tactics; it seems likely you’d end up splitting larger scenarios into a lot of tiny episodes, creating myopia.
I’m not saying I’d be happily confident that such a procedure would produce partial agents (therefore avoiding AI risk). And indeed, there are differences between doing this with gradient descent and evolutionary algorithms. One of the things I focused on in the post, time-discounting, becomes less relevant—but only because it’s more natural to split things into episodes in the case of evolutionary algorithms, which still creates myopia as a side effect.
What I’m saying is there’s a real credit assignment problem here—you’re trying to pick between different policies (ie the code which the evolutionary algorithms are selecting between), based on which policy has performed better in the past. But you’ve taken a lot of actions in the past. And you’ve gotten a lot of individual pieces of feedback. You don’t know how to ascribe success/failure credit—that is, you don’t know how to match individual pieces of feedback to individual decisions you made (and hence to individual pieces of code).
So you solve the problem in a basically naive way: you assume that the feedback on “instance n” was related to the code you were running at that time. This is a myopic assumption!
The big thing with environments that can’t be simulated is that you don’t have a reset button, so you can’t back up and try again; so, episodic and simulable are pretty related.
Sparse feedback is related to what I’m talking about, but feels like a selection-oriented way of understanding the difficulty of control; “sparse feedback” still applies to very episodic problems such as chess. The difficulty with control is that arbitrarily long historical contexts can sometimes matter, and you have to learn anyway. But I agree that it’s much easier for this to present real difficulty if the rewards are sparse.
I suspect I made our recent discussions unnecessarily messy by simultaneously talking about: (1) “informal strategic stuff” (e.g. the argument that selection processes are strategically important, which I now understand is not contradictory to your model of the future); and (2) my (somewhat less informal) mathematical argument about evolutionary computation algorithms.
The rest of this comment involves only the mathematical argument. I want to make that argument narrower than the version that perhaps you responded to: I want it to only be about absolute myopia, rather than more general concepts of myopia or full agency. Also, I (now) think my argument applies only to learning setups in which the behavior of the model/agent can affect what the model encounters in future iterations/episodes. Therefore, my argument does not apply to setups such as unsupervised learning for past stock prices or RL for Atari games (when each episode is a new game).
My argument is (now) only the following: Suppose we have a learning setup in which the behavior of the model at a particular moment may affect the future inputs/environments that the model will be trained on. I argue that evolutionary computation algorithms seem less likely to yield an absolute myopic model, relative to gradient decent. If you already think that, you might want to skip the rest of this comment (in which I try to support this argument).
I think the following property might make a learning algorithm more likely to yield models that are NOT absolute myopic:
During training, a parameter’s value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future iterations/episodes.
I think that this property tends to apply to evolutionary computation algorithms more than it applies to gradient descent. I’ll use the following example to explain why I think that:
Suppose we have some online supervised learning setup. Suppose that during iteration 1 the model needs to predict random labels (and thus can’t perform better than chance), however, if parameter θ8 has a large value then the model makes predictions that cause the examples in iteration 2 to be more predictable. By assumption, during iteration 2 the value of θ8 does not (directly) affect predictions.
How should we expect our learning algorithm to update the parameter θ8 at the end of iteration 2?
If our learning algorithm is gradient decent, it seems that we should NOT expect θ8 to increase, because there is no iteration in which the relevant component of the gradient (i.e. the partial derivative of the objective with respect to θ8) is expected to be positive.
In contrast, if our learning algorithm is some evolutionary computation algorithm, the models (in the population) in which θ8 happens to be larger are expected to outperform the other models, in iteration 2. Therefore, we should expect iteration 2 to increase the average value of θ8 (over the model population).
Sorry for taking so long to respond to this one.
I don’t get the last step in your argument:
Why do those models outperform? I think you must be imagining a different setup, but I’m interpreting your setup as:
This is a classification problem, so, we’re getting feedback on correct labels X for some Y.
It’s online, so we’re doing this in sequence, and learning after each.
We keep a population of models, which we update (perhaps only a little) after every training example; population members who predicted the label correctly get a chance to reproduce, and a few population members who didn’t are killed off.
The overall prediction made by the system is the average of all the predictions (or some other aggregation).
Large θ8 influences at one time-step will cause predictions which make the next time-step easier.
So, if the population has an abundance of high θ8 at one time step, the population overall does better in the next time step, because it’s easier for everyone to predict.
So, the frequency of high θ8 will not be increased at all. Just like in gradient descent, there’s no point at which the relevant population members are specifically rewarded.
In other words, many members of the population can swoop in and reap the benefits caused by high-θ8 members. So high-θ8 carriers do not specifically benefit.
Oops! I agree. My above example is utterly confused and incorrect (I think your description of my setup matches what I was imagining).
Backtracking from this, I now realize that my core reasoning behind my argument—about evolutionary computation algorithms producing non-myopic behavior by default—was incorrect.
(My confusion stemmed from thinking that an argument I made about a theoretical training algorithm also applies to many evolutionary computation algorithms; which seems incorrect!)