How is “The object is” → ” a” or ” an” a case where models may show non-myopic behavior? Loss will depend on the prediction of ” a” or ” an”. It will also depend on the completion of “The object is an” or “The object is a”, depending on which appears in the current training sample. AFAICT the model will just optimize next token predictions, in both cases...?
How is “The object is” → ” a” or ” an” a case where models may show non-myopic behavior?
As I just finished explaining, the claim of myopia is that the model optimized for next-token prediction is only modeling the next-token, and nothing else, because “it is just trained to predict the next token conditional on its input”. The claim of non-myopia is that a model will be modeling additional future tokens in addition to the next token, a capability induced by attempting to model the next token better. If myopia were true, GPT-3 would not be attempting to infer ‘the next token is ‘a’/‘an’ but then what is the token after that—is it talking about “tiger” or “orangutan”, which would then backwards chain to determine ‘a’/‘an’?′ because the next token could not be either “tiger” or “orangutan” (as that would be ungrammatical). They are not the same thing, and I have given a concrete example both of what it would mean to model ‘a’/‘an’ myopically (modeling it based solely on base rates of ‘a’ vs ‘an’) and shown that GPT-3 does not do so and is adjusting its prediction based on a single specific later token (‘tiger’ vs ‘orangutan’)*.
If the idea that GPT-3 would be myopic strikes you as absurd and you cannot believe anyone would believe anything as stupid as ‘GPT-3 would just predict the next token without attempting to predict relevant later tokens’, because natural language is so obviously saturated with all sorts of long-range or reverse dependencies which myopia would ignore & do badly predicting the next token—then good! The ‘a’/‘an’ example works, and so there’s no need to bring in more elaborate hypothetical examples like analyzing hapax legomena or imagining encoding a text maze into a prompt and asking GPT-3 for the first step (which could only be done accurately by planning through the maze, finding the optimal trajectory, and then emitting the first step while throwing away the rest) where someone could reasonably wonder if that’s even possible much less whether it’d actually learned any such thing.
* My example here is not perfect because I had to change the wording a lot between the vowel/vowel-less version, which muddies the waters a bit (maybe you could argue that phrases like ‘colored orange’ leads to an ‘a’ bias without anything recognizable as “inference of ‘tiger’” involved, and vice-versa for “clever great ape”/”orangutan”, as a sheer brute force function of low-order English statistics); preferably you’d do something like instruction-following, where the model is told the vowel/vowel-less status of the final word will switch based on a single artificial token at the beginning of the prompt, where there could be no such shortcut cheating. But in my defense, the Playground was almost unusable when I was trying to write my comment and I had to complete >5 times for each working completion, so I got what I got.
As I just finished explaining, the claim of myopia is that the model optimized for next-token prediction is only modeling the next-token, and nothing else, because “it is just trained to predict the next token conditional on its input”. The claim of non-myopia is that a model will be modeling additional future tokens in addition to the next token, a capability induced by attempting to model the next token better.
These definitions are not equivalent to the ones we gave (and as far as I’m aware the definitions we use are much closer to commonly used definitions of myopia and non-myopia than the ones you give here).
Arthur is also entirely correct that your examples are not evidence of non-myopia by the definitions we use.
The definition of myopia that we use is that the model minimises loss on the next token and the next token alone, this is not the same as requiring that the model only ‘models’ / ‘considers information only directly relevant to’ the next token and the next token alone.
A model exhibiting myopic behaviour can still be great at the kinds of tasks you describe as requiring ‘modelling of future tokens’. The claim that some model was displaying myopic behaviour here would simply be that all of this ‘future modelling’ (or any other internal processing) is done entirely in service of minimising loss on just the next token. This is in contrast to the kinds of non-myopic models we are considering in this post—where the minimisation of loss over a multi-token completion encourages sacrificing some loss when generating early tokens in certain situations.
At the risk of being too vague to be understood… you can always factorise a probability distribution as P(X1)P(X2|X1) etc, so plain next token prediction should be able to do the job, but maybe there’s a more natural “causal” factorisation that goes like P(subject)P(verb|subject) etc, which is not ordered the same as the tokens but from which we can derive the token probabilities, and maybe that’s easier to learn than the raw next token probabilities.
How is “The object is” → ” a” or ” an” a case where models may show non-myopic behavior? Loss will depend on the prediction of ” a” or ” an”. It will also depend on the completion of “The object is an” or “The object is a”, depending on which appears in the current training sample. AFAICT the model will just optimize next token predictions, in both cases...?
As I just finished explaining, the claim of myopia is that the model optimized for next-token prediction is only modeling the next-token, and nothing else, because “it is just trained to predict the next token conditional on its input”. The claim of non-myopia is that a model will be modeling additional future tokens in addition to the next token, a capability induced by attempting to model the next token better. If myopia were true, GPT-3 would not be attempting to infer ‘the next token is ‘a’/‘an’ but then what is the token after that—is it talking about “tiger” or “orangutan”, which would then backwards chain to determine ‘a’/‘an’?′ because the next token could not be either “tiger” or “orangutan” (as that would be ungrammatical). They are not the same thing, and I have given a concrete example both of what it would mean to model ‘a’/‘an’ myopically (modeling it based solely on base rates of ‘a’ vs ‘an’) and shown that GPT-3 does not do so and is adjusting its prediction based on a single specific later token (‘tiger’ vs ‘orangutan’)*.
If the idea that GPT-3 would be myopic strikes you as absurd and you cannot believe anyone would believe anything as stupid as ‘GPT-3 would just predict the next token without attempting to predict relevant later tokens’, because natural language is so obviously saturated with all sorts of long-range or reverse dependencies which myopia would ignore & do badly predicting the next token—then good! The ‘a’/‘an’ example works, and so there’s no need to bring in more elaborate hypothetical examples like analyzing hapax legomena or imagining encoding a text maze into a prompt and asking GPT-3 for the first step (which could only be done accurately by planning through the maze, finding the optimal trajectory, and then emitting the first step while throwing away the rest) where someone could reasonably wonder if that’s even possible much less whether it’d actually learned any such thing.
* My example here is not perfect because I had to change the wording a lot between the vowel/vowel-less version, which muddies the waters a bit (maybe you could argue that phrases like ‘colored orange’ leads to an ‘a’ bias without anything recognizable as “inference of ‘tiger’” involved, and vice-versa for “clever great ape”/”orangutan”, as a sheer brute force function of low-order English statistics); preferably you’d do something like instruction-following, where the model is told the vowel/vowel-less status of the final word will switch based on a single artificial token at the beginning of the prompt, where there could be no such shortcut cheating. But in my defense, the Playground was almost unusable when I was trying to write my comment and I had to complete >5 times for each working completion, so I got what I got.
These definitions are not equivalent to the ones we gave (and as far as I’m aware the definitions we use are much closer to commonly used definitions of myopia and non-myopia than the ones you give here).
Arthur is also entirely correct that your examples are not evidence of non-myopia by the definitions we use.
The definition of myopia that we use is that the model minimises loss on the next token and the next token alone, this is not the same as requiring that the model only ‘models’ / ‘considers information only directly relevant to’ the next token and the next token alone.
A model exhibiting myopic behaviour can still be great at the kinds of tasks you describe as requiring ‘modelling of future tokens’. The claim that some model was displaying myopic behaviour here would simply be that all of this ‘future modelling’ (or any other internal processing) is done entirely in service of minimising loss on just the next token. This is in contrast to the kinds of non-myopic models we are considering in this post—where the minimisation of loss over a multi-token completion encourages sacrificing some loss when generating early tokens in certain situations.
Some Twitter discussion: https://twitter.com/saprmarks/status/1715100934936854691
Recent papers demonstrating LLMs are not myopic and you can extract predictions of tokens beyond the next token:
“Eliciting Latent Predictions from Transformers with the Tuned Lens”, Belrose et al 2023
“Jump to Conclusions: Short-Cutting Transformers With Linear Transformations”, Din et al 2023
“Future Lens: Anticipating Subsequent Tokens from a Single Hidden State”, Pal et al 2023
At the risk of being too vague to be understood… you can always factorise a probability distribution as P(X1)P(X2|X1) etc, so plain next token prediction should be able to do the job, but maybe there’s a more natural “causal” factorisation that goes like P(subject)P(verb|subject) etc, which is not ordered the same as the tokens but from which we can derive the token probabilities, and maybe that’s easier to learn than the raw next token probabilities.
I’ve no idea if this is what gwern meant.