Here’s a simple toy model that illustrates the difference between 2 and 3 (that doesn’t talk about attention layers, etc.).
Say you have a bunch of triplets (x,z1,z2). Your want to train a model that predicts z1 from x and z2 from x,z1.
Your model consists of three components: f,g1,g2. It makes predictions as follows: y=f(x) z1=g1(y) z2=g2(y,z1)
(Why have such a model? Why not have two completely separate models, one for predicting z1 and one for predicting z2? Because it might be more efficient to use a single f both for predicting z1 and for predicting z2, given that both predictions presumably require “interpreting” x.)
So, intuitively, it first builds an “inner representation” (embedding) of x. Then it sequentially makes predictions based on that inner representation.
Now you train f and g1 to minimize the prediction loss on the (x,z1) parts of the triplets. Simultaneously you train f,g2 to minimize prediction loss on the full (x,z1,z2) triplets. For example, you update f and g1 with the gradients ∇θ0,θ1l(z1,gθ11(fθ0(x))
and you update f and g2 with the gradients
∇θ0,θ2l(z2,gθ22(z1,(fθ0(x))). (The z1 here is the “true” z1, not one generated by the model itself.)
This training pressures g1 to be myopic in the second and third sense described in the post. In fact, even if we were to train θ0,θ2 with the z1 predicted by g1 rather than the true z1, g1 is pressured to be myopic.
Type 3 myopia: Training doesn’t pressure g1 to output something that makes the z2 follow an easier-to-predict (computationally or information-theoretically) distribution. For example, imagine that on the training data z1=0 implies z2=0, while under z1=1, z2 follows some distribution that depends in complicated ways on x. Then g1 will not try to predict z1=0 more often.
Type 2 myopia: g1 won’t try to provide useful information to g2 in its output, even if it could. For example, imagine that the z1s are strings representing real numbers. Imagine that x is always a natural number, that z1 is the x-th Fibonacci number and z2 is the x+1-th Fibonacci number. Imagine further that the model representing g1 is large enough to compute the x-th Fibonacci number, while the model representing g2 is not. Then one way in which one might think one could achieve low predictive loss would be for g1 to output the x-th Fibonacci number and then encode, for example, the x−1-th Fibonacci number in the decimal digits. (E.g., g1(10)=55.0000000000034.) And then g2 computes the x+1-th Fibonacci number from the x-th decimal. But the above training will not give rise to this strategy, because g2 gets the true z1 as input, not the one produced by g1. Further, even if we were to change this, there would still be pressure against this strategy because g1 (θ1) is not optimized to give useful information to g2. (The gradient used to update θ1 doesn’t consider the loss on predicting z2.) If it ever follows the policy of encoding information in the decimal digits, it will quickly learn to remove that information to get higher prediction accuracy on z1.
Of course, g1 still won’t be pressured to be type-1-myopic. If predicting z1 requires predicting z2, then g1 will be trained to predict (“plan”) z2.
(Obviously, $g_2$ is pressured to be myopic in this simple model.)
Now what about f? Well, f is optimized both to enable predicting z1 from f(x) and predicting z2 from f(x),z1. Therefore, if resources are relevantly constrained in some way (e.g., the model computing f is small, or the output of f is forced to be small), f will sometimes sacrifice performance on one to improve performance on the other. So, adapting a paragraph from the post: The trained model for f (and thus in some sense the overall model) can and will sacrifice accuracy on z1 to achieve better accuracy on z2. In particular, we should expect trained models to find an efficient tradeoff between accuracy on z1 and accuracy on z2. When z1 is relatively easy to predict, f will spend most of its computation budget on predicting z2.
So, f is not “Type 2” myopic. Or perhaps put differently: The calculations going into predicting z1 aren’t optimized purely for predicting z2.
However, f is still “Type 3” myopic. Because the prediction made by g1 isn’t fed (in training) as an input to g2 or the loss, there’s no pressure towards making f influence the output of g1 in a way that has anything to do with z2. (In contrast to the myopia of g1, this really does hinge on not using g2(f(x),g1(f(x))) in training. If g2(f(x),g1(f(x))) mattered in training, then there would be pressure for f to trick g1 into performing calculations that are useful for predicting z2. Unless you use stop-gradients...)
* This comes with all the usual caveats of course. In principle, the inductive bias may favor a situationally aware model that is extremely non-myopic in some sense.
Here’s a simple toy model that illustrates the difference between 2 and 3 (that doesn’t talk about attention layers, etc.).
Say you have a bunch of triplets (x,z1,z2). Your want to train a model that predicts z1 from x and z2 from x,z1.
Your model consists of three components: f,g1,g2. It makes predictions as follows:
y=f(x)
z1=g1(y)
z2=g2(y,z1)
(Why have such a model? Why not have two completely separate models, one for predicting z1 and one for predicting z2? Because it might be more efficient to use a single f both for predicting z1 and for predicting z2, given that both predictions presumably require “interpreting” x.)
So, intuitively, it first builds an “inner representation” (embedding) of x. Then it sequentially makes predictions based on that inner representation.
Now you train f and g1 to minimize the prediction loss on the (x,z1) parts of the triplets. Simultaneously you train f,g2 to minimize prediction loss on the full (x,z1,z2) triplets. For example, you update f and g1 with the gradients
∇θ0,θ1l(z1,gθ11(fθ0(x))
and you update f and g2 with the gradients
∇θ0,θ2l(z2,gθ22(z1,(fθ0(x))).
(The z1 here is the “true” z1, not one generated by the model itself.)
This training pressures g1 to be myopic in the second and third sense described in the post. In fact, even if we were to train θ0,θ2 with the z1 predicted by g1 rather than the true z1, g1 is pressured to be myopic.
Type 3 myopia: Training doesn’t pressure g1 to output something that makes the z2 follow an easier-to-predict (computationally or information-theoretically) distribution. For example, imagine that on the training data z1=0 implies z2=0, while under z1=1, z2 follows some distribution that depends in complicated ways on x. Then g1 will not try to predict z1=0 more often.
Type 2 myopia: g1 won’t try to provide useful information to g2 in its output, even if it could. For example, imagine that the z1s are strings representing real numbers. Imagine that x is always a natural number, that z1 is the x-th Fibonacci number and z2 is the x+1-th Fibonacci number. Imagine further that the model representing g1 is large enough to compute the x-th Fibonacci number, while the model representing g2 is not. Then one way in which one might think one could achieve low predictive loss would be for g1 to output the x-th Fibonacci number and then encode, for example, the x−1-th Fibonacci number in the decimal digits. (E.g., g1(10)=55.0000000000034.) And then g2 computes the x+1-th Fibonacci number from the x-th decimal. But the above training will not give rise to this strategy, because g2 gets the true z1 as input, not the one produced by g1. Further, even if we were to change this, there would still be pressure against this strategy because g1 (θ1) is not optimized to give useful information to g2. (The gradient used to update θ1 doesn’t consider the loss on predicting z2.) If it ever follows the policy of encoding information in the decimal digits, it will quickly learn to remove that information to get higher prediction accuracy on z1.
Of course, g1 still won’t be pressured to be type-1-myopic. If predicting z1 requires predicting z2, then g1 will be trained to predict (“plan”) z2.
(Obviously, $g_2$ is pressured to be myopic in this simple model.)
Now what about f? Well, f is optimized both to enable predicting z1 from f(x) and predicting z2 from f(x),z1. Therefore, if resources are relevantly constrained in some way (e.g., the model computing f is small, or the output of f is forced to be small), f will sometimes sacrifice performance on one to improve performance on the other. So, adapting a paragraph from the post: The trained model for f (and thus in some sense the overall model) can and will sacrifice accuracy on z1 to achieve better accuracy on z2. In particular, we should expect trained models to find an efficient tradeoff between accuracy on z1 and accuracy on z2. When z1 is relatively easy to predict, f will spend most of its computation budget on predicting z2.
So, f is not “Type 2” myopic. Or perhaps put differently: The calculations going into predicting z1 aren’t optimized purely for predicting z2.
However, f is still “Type 3” myopic. Because the prediction made by g1 isn’t fed (in training) as an input to g2 or the loss, there’s no pressure towards making f influence the output of g1 in a way that has anything to do with z2. (In contrast to the myopia of g1, this really does hinge on not using g2(f(x),g1(f(x))) in training. If g2(f(x),g1(f(x))) mattered in training, then there would be pressure for f to trick g1 into performing calculations that are useful for predicting z2. Unless you use stop-gradients...)
* This comes with all the usual caveats of course. In principle, the inductive bias may favor a situationally aware model that is extremely non-myopic in some sense.