Interesting… it seems that this doesn’t necessarily happen if we use online gradient descent instead, because the loss gradient (computed for a single episode) ought to lead away from model parameters that would increase the loss for the current episode and reduce it for future episodes.
I agree, my reasoning above does not apply to gradient descent (I misunderstood this point before reading your comment).
I think it still applies to evolutionary algorithms (which might end up beingrelevant).
how can we think more generally about what kinds of learning algorithms will produce episodic optimizers vs cross-episodic optimizers?
Maybe learning algorithms that have the following property are more likely to yield models with “cross-episodic behavior”:
During training, a parameter’s value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future episodes.
Also, what name would you suggest for this problem, if not “inner alignment”?
I agree, my reasoning above does not apply to gradient descent (I misunderstood this point before reading your comment).
I think it still applies to evolutionary algorithms (which might end up being relevant).
Maybe learning algorithms that have the following property are more likely to yield models with “cross-episodic behavior”:
During training, a parameter’s value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future episodes.
Maybe “non-myopia” as Evan suggested.