Caspar Oesterheld comments on How LLMs are and are not myopic

Caspar Oesterheld 8 Dec 2023 19:48 UTC
LW: 10 AF: 7
0
AF
Here’s a simple toy model that illustrates the difference between 2 and 3 (that doesn’t talk about attention layers, etc.).
Say you have a bunch of triplets $(x, z_{1}, z_{2})$ . Your want to train a model that predicts $z_{1}$ from $x$ and $z_{2}$ from $x, z_{1}$ .
Your model consists of three components: $f, g_{1}, g_{2}$ . It makes predictions as follows:
$y = f (x)$
$z_{1} = g_{1} (y)$
$z_{2} = g_{2} (y, z_{1})$
(Why have such a model? Why not have two completely separate models, one for predicting $z_{1}$ and one for predicting $z_{2}$ ? Because it might be more efficient to use a single $f$ both for predicting $z_{1}$ and for predicting $z_{2}$ , given that both predictions presumably require “interpreting” $x$ .)
So, intuitively, it first builds an “inner representation” (embedding) of $x$ . Then it sequentially makes predictions based on that inner representation.
Now you train $f$ and $g_{1}$ to minimize the prediction loss on the $(x, z_{1})$ parts of the triplets. Simultaneously you train $f, g_{2}$ to minimize prediction loss on the full $(x, z_{1}, z_{2})$ triplets. For example, you update $f$ and $g_{1}$ with the gradients
$\nabla_{θ_{0}, θ_{1}} l (z_{1}, g_{1}^{θ_{1}} (f^{θ_{0}} (x))$
and you update $f$ and $g_{2}$ with the gradients
$\nabla_{θ_{0}, θ_{2}} l (z_{2}, g_{2}^{θ_{2}} (z_{1}, (f^{θ_{0}} (x)))$ .
(The $z_{1}$ here is the “true” $z_{1}$ , not one generated by the model itself.)
This training pressures $g_{1}$ to be myopic in the second and third sense described in the post. In fact, even if we were to train $θ_{0}, θ_{2}$ with the $z_{1}$ predicted by $g_{1}$ rather than the true $z_{1}$ , $g_{1}$ is pressured to be myopic.
- Type 3 myopia: Training doesn’t pressure $g_{1}$ to output something that makes the $z_{2}$ follow an easier-to-predict (computationally or information-theoretically) distribution. For example, imagine that on the training data $z_{1} = 0$ implies $z_{2} = 0$ , while under $z_{1} = 1$ , $z_{2}$ follows some distribution that depends in complicated ways on $x$ . Then $g_{1}$ will not try to predict $z_{1} = 0$ more often.
- Type 2 myopia: $g_{1}$ won’t try to provide useful information to $g_{2}$ in its output, even if it could. For example, imagine that the $z_{1}$ s are strings representing real numbers. Imagine that $x$ is always a natural number, that $z_{1}$ is the $x$ -th Fibonacci number and $z_{2}$ is the $x + 1$ -th Fibonacci number. Imagine further that the model representing $g_{1}$ is large enough to compute the $x$ -th Fibonacci number, while the model representing $g_{2}$ is not. Then one way in which one might think one could achieve low predictive loss would be for $g_{1}$ to output the $x$ -th Fibonacci number and then encode, for example, the $x - 1$ -th Fibonacci number in the decimal digits. (E.g., $g_{1} (10) = 55.0000000000034$ .) And then $g_{2}$ computes the $x + 1$ -th Fibonacci number from the $x$ -th decimal. But the above training will not give rise to this strategy, because $g_{2}$ gets the true $z_{1}$ as input, not the one produced by $g_{1}$ . Further, even if we were to change this, there would still be pressure against this strategy because $g_{1}$ ( $θ_{1}$ ) is not optimized to give useful information to $g_{2}$ . (The gradient used to update $θ_{1}$ doesn’t consider the loss on predicting $z_{2}$ .) If it ever follows the policy of encoding information in the decimal digits, it will quickly learn to remove that information to get higher prediction accuracy on $z_{1}$ .
Of course, $g_{1}$ still won’t be pressured to be type-1-myopic. If predicting $z_{1}$ requires predicting $z_{2}$ , then $g_{1}$ will be trained to predict (“plan”) $z_{2}$ .
(Obviously, $g_2$ is pressured to be myopic in this simple model.)
Now what about $f$ ? Well, $f$ is optimized both to enable predicting $z_{1}$ from $f (x)$ and predicting $z_{2}$ from $f (x), z_{1}$ . Therefore, if resources are relevantly constrained in some way (e.g., the model computing $f$ is small, or the output of $f$ is forced to be small), $f$ will sometimes sacrifice performance on one to improve performance on the other. So, adapting a paragraph from the post: The trained model for $f$ (and thus in some sense the overall model) can and will sacrifice accuracy on $z_{1}$ to achieve better accuracy on $z_{2}$ . In particular, we should expect trained models to find an efficient tradeoff between accuracy on $z_{1}$ and accuracy on $z_{2}$ . When $z_{1}$ is relatively easy to predict, $f$ will spend most of its computation budget on predicting $z_{2}$ .
So, $f$ is not “Type 2” myopic. Or perhaps put differently: The calculations going into predicting $z_{1}$ aren’t optimized purely for predicting $z_{2}$ .
However, $f$ is still “Type 3” myopic. Because the prediction made by $g_{1}$ isn’t fed (in training) as an input to $g_{2}$ or the loss, there’s no pressure towards making $f$ influence the output of $g_{1}$ in a way that has anything to do with $z_{2}$ . (In contrast to the myopia of $g_{1}$ , this really does hinge on not using $g_{2} (f (x), g_{1} (f (x)))$ in training. If $g_{2} (f (x), g_{1} (f (x)))$ mattered in training, then there would be pressure for $f$ to trick $g_{1}$ into performing calculations that are useful for predicting $z_{2}$ . Unless you use stop-gradients...)
* This comes with all the usual caveats of course. In principle, the inductive bias may favor a situationally aware model that is extremely non-myopic in some sense.