Ofer comments on Defining Myopia

Ofer 4 Nov 2019 17:45 UTC
LW: 3 AF: 2
AF
I suspect I made our recent discussions unnecessarily messy by simultaneously talking about: (1) “informal strategic stuff” (e.g. the argument that selection processes are strategically important, which I now understand is not contradictory to your model of the future); and (2) my (somewhat less informal) mathematical argument about evolutionary computation algorithms.

The rest of this comment involves only the mathematical argument. I want to make that argument narrower than the version that perhaps you responded to: I want it to only be about absolute myopia, rather than more general concepts of myopia or full agency. Also, I (now) think my argument applies only to learning setups in which the behavior of the model/agent can affect what the model encounters in future iterations/episodes. Therefore, my argument does not apply to setups such as unsupervised learning for past stock prices or RL for Atari games (when each episode is a new game).

My argument is (now) only the following: Suppose we have a learning setup in which the behavior of the model at a particular moment may affect the future inputs/environments that the model will be trained on. I argue that evolutionary computation algorithms seem less likely to yield an absolute myopic model, relative to gradient decent. If you already think that, you might want to skip the rest of this comment (in which I try to support this argument).

I think the following property might make a learning algorithm more likely to yield models that are NOT absolute myopic:

During training, a parameter’s value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future iterations/episodes.

I think that this property tends to apply to evolutionary computation algorithms more than it applies to gradient descent. I’ll use the following example to explain why I think that:

Suppose we have some online supervised learning setup. Suppose that during iteration 1 the model needs to predict random labels (and thus can’t perform better than chance), however, if parameter $θ_{8}$ has a large value then the model makes predictions that cause the examples in iteration 2 to be more predictable. By assumption, during iteration 2 the value of $θ_{8}$ does not (directly) affect predictions.

How should we expect our learning algorithm to update the parameter $θ_{8}$ at the end of iteration 2?

If our learning algorithm is gradient decent, it seems that we should NOT expect $θ_{8}$ to increase, because there is no iteration in which the relevant component of the gradient (i.e. the partial derivative of the objective with respect to $θ_{8}$ ) is expected to be positive.

In contrast, if our learning algorithm is some evolutionary computation algorithm, the models (in the population) in which $θ_{8}$ happens to be larger are expected to outperform the other models, in iteration 2. Therefore, we should expect iteration 2 to increase the average value of $θ_{8}$ (over the model population).
- abramdemski 9 Nov 2019 7:47 UTC
  LW: 6 AF: 3
  AF Parent
  Sorry for taking so long to respond to this one.
  I don’t get the last step in your argument:
  In contrast, if our learning algorithm is some evolutionary computation algorithm, the models (in the population) in which θ8 happens to be larger are expected to outperform the other models, in iteration 2. Therefore, we should expect iteration 2 to increase the average value of θ8 (over the model population).
  Why do those models outperform? I think you must be imagining a different setup, but I’m interpreting your setup as:
  - This is a classification problem, so, we’re getting feedback on correct labels X for some Y.
  - It’s online, so we’re doing this in sequence, and learning after each.
  - We keep a population of models, which we update (perhaps only a little) after every training example; population members who predicted the label correctly get a chance to reproduce, and a few population members who didn’t are killed off.
  - The overall prediction made by the system is the average of all the predictions (or some other aggregation).
  - Large $θ_{8}$ influences at one time-step will cause predictions which make the next time-step easier.
  - So, if the population has an abundance of high $θ_{8}$ at one time step, the population overall does better in the next time step, because it’s easier for everyone to predict.
  - So, the frequency of high $θ_{8}$ will not be increased at all. Just like in gradient descent, there’s no point at which the relevant population members are specifically rewarded.
  In other words, many members of the population can swoop in and reap the benefits caused by high- $θ_{8}$ members. So high- $θ_{8}$ carriers do not specifically benefit.
  What links here?
  - Ofer's comment on Plausibly, almost every powerful algorithm would be manipulative by Stuart_Armstrong (7 Feb 2020 12:41 UTC; 1 point)
  - Ofer 9 Nov 2019 13:47 UTC
    LW: 9 AF: 6
    AF Parent
    
    In other words, many members of the population can swoop in and reap the benefits caused by high-θ8 members. So high-θ8 carriers do not specifically benefit.
    
    Oops! I agree. My above example is utterly confused and incorrect (I think your description of my setup matches what I was imagining).
    
    Backtracking from this, I now realize that my core reasoning behind my argument—about evolutionary computation algorithms producing non-myopic behavior by default—was incorrect.
    
    (My confusion stemmed from thinking that an argument I made about a theoretical training algorithm also applies to many evolutionary computation algorithms; which seems incorrect!)
    What links here?
    Ofer's comment on Partial Agency by abramdemski (28 Sep 2019 9:00 UTC; 8 points)
    Ofer's comment on Counterfactual Oracles = online supervised learning with random selection of training episodes by Wei Dai (10 Sep 2019 10:44 UTC; 4 points)
    Ofer's comment on Defining Myopia by abramdemski (20 Oct 2019 16:43 UTC; 2 points)