gwern comments on Steering Behaviour: Testing for (Non-)Myopia in Language Models

gwern 6 Dec 2022 3:36 UTC
LW: 13 AF: 8
−1
AF

Another way to go about testing for non-myopia in plain LLMs might be to look for tokens that are rare in the training distribution, but when they do occur are followed by text that’s very easy to predict.

I think there are simpler ways to make this point. This came up back in the original agency discussions in 2020, IIRC, but a LM ought to be modeling tokens ‘beyond’ the immediate next token due to grammar and the fact that text is generated by agents with long-range correlation inducing things like ‘plans’ or ‘desires’ which lead to planning and backwards chaining. If GPT-3 were truly not doing anything at all in trying to infer future tokens, I’d expect its generated text to look much more incoherent than it does as it paints itself into corners and sometimes can’t even find a grammatical way out.

English may not be quite as infamous as German is in terms of requiring planning upfront to say a sensible sentence, but there’s still plenty of simple examples where you need to know what you are going to say before you can say it—like indefinite articles. For example, consider the sentence “[prompt context omitted] This object is “: presumably the next word is ‘a X’ or ‘an X’ . This is a purely syntactic mechanical decision, as the article token depends on and is entirely determined by the next future word’s spelling, and nothing else—so which is it? Well, that will depend on what X is more likely to be, a word starting with a vowel sound or not. You simply cannot do better than a unigram-level prediction of a/an frequencies if you do not know what X is to ensure that it is consistent with your choice of ‘a’ or ‘an’. Given the very high quality of GPT-3 text, it seems unlikely that GPT-3 is ignoring the prompt context and simply picking between ‘a’/‘an’ using the base rate frequency in English; the log-probs (and unigram vs bigrams) should reflect this.

...I was going to try some examples to show that a/an were being determined by the tokens after them showing that GPT-3 must in some sense be non-myopically planning in order to keep itself consistent and minimize overall likelihood to some degree—but the OA Playground is erroring out repeatedly due to overload from ChatGPT tonight. Oy vey… Anyway, an example of what I am suggesting is: “The next exhibit in the zoo is a fierce predator from India, colored orange. The animal in the cage is ”; the answer is ‘a tiger’, and GPT-3 prefers ‘a’ to an’ - even if you force it to ‘an’ (which it agilely dodges by identifying the animal instead as an ‘Indian tiger), the logprobs remain unhappy about ‘an’ specifically. Conversely, we could ask for a vowel animal, and I tried “The next exhibit in the zoo is a clever great ape from Indonesia, colored orange. The animal in the cage is ”; this surprised me when GPT-3 was almost evenly split 55:45 between ‘a’/‘an’ (instead of either being 95:5 on base rates, or 5:95 because it correctly predicted the future tokens would be ‘orangutan’), but it completes ‘orangutan’ either way! What’s going on? Apparently lots of people are uncertain whether you say ‘a orangutan’ or ‘an orangutan’, and while the latter seems to be correct, Google still pulls up plenty of hits for the former, including authorities like National Geographic or WWF or Wikipedia which would be overweighted in GPT-3 training.

I find it difficult to tell any story about my tests here which exclude GPT-3 inferring the animal’s name in order to predict tokens in the future in order to better predict which indefinite article it needs to predict immediately. Nothing in the training would encourage such myopia, and such myopia will obviously damage the training objective by making it repeatedly screw up predictions of indefinite articles which a model doing non-myopic modeling would be able to predict easily. It is easy to improve on the base rate prediction of ‘a’/‘an’ by thinking forward to what word follows it; so, the model will.
- Arthur Conmy 6 Dec 2022 5:07 UTC
  9 points
  5
  Parent
  How is “The object is” → ” a” or ” an” a case where models may show non-myopic behavior? Loss will depend on the prediction of ” a” or ” an”. It will also depend on the completion of “The object is an” or “The object is a”, depending on which appears in the current training sample. AFAICT the model will just optimize next token predictions, in both cases...?
  - gwern 6 Dec 2022 15:31 UTC
    3 points
    −4
    Parent
    
    How is “The object is” → ” a” or ” an” a case where models may show non-myopic behavior?
    
    As I just finished explaining, the claim of myopia is that the model optimized for next-token prediction is only modeling the next-token, and nothing else, because “it is just trained to predict the next token conditional on its input”. The claim of non-myopia is that a model will be modeling additional future tokens in addition to the next token, a capability induced by attempting to model the next token better. If myopia were true, GPT-3 would not be attempting to infer ‘the next token is ‘a’/‘an’ but then what is the token after that—is it talking about “tiger” or “orangutan”, which would then backwards chain to determine ‘a’/‘an’?′ because the next token could not be either “tiger” or “orangutan” (as that would be ungrammatical). They are not the same thing, and I have given a concrete example both of what it would mean to model ‘a’/‘an’ myopically (modeling it based solely on base rates of ‘a’ vs ‘an’) and shown that GPT-3 does not do so and is adjusting its prediction based on a single specific later token (‘tiger’ vs ‘orangutan’)*.
    
    If the idea that GPT-3 would be myopic strikes you as absurd and you cannot believe anyone would believe anything as stupid as ‘GPT-3 would just predict the next token without attempting to predict relevant later tokens’, because natural language is so obviously saturated with all sorts of long-range or reverse dependencies which myopia would ignore & do badly predicting the next token—then good! The ‘a’/‘an’ example works, and so there’s no need to bring in more elaborate hypothetical examples like analyzing hapax legomena or imagining encoding a text maze into a prompt and asking GPT-3 for the first step (which could only be done accurately by planning through the maze, finding the optimal trajectory, and then emitting the first step while throwing away the rest) where someone could reasonably wonder if that’s even possible much less whether it’d actually learned any such thing.
    
    * My example here is not perfect because I had to change the wording a lot between the vowel/vowel-less version, which muddies the waters a bit (maybe you could argue that phrases like ‘colored orange’ leads to an ‘a’ bias without anything recognizable as “inference of ‘tiger’” involved, and vice-versa for “clever great ape”/”orangutan”, as a sheer brute force function of low-order English statistics); preferably you’d do something like instruction-following, where the model is told the vowel/vowel-less status of the final word will switch based on a single artificial token at the beginning of the prompt, where there could be no such shortcut cheating. But in my defense, the Playground was almost unusable when I was trying to write my comment and I had to complete >5 times for each working completion, so I got what I got.
    - Megan Kinniment 7 Dec 2022 7:26 UTC
      10 points
      1
      Parent
      As I just finished explaining, the claim of myopia is that the model optimized for next-token prediction is only modeling the next-token, and nothing else, because “it is just trained to predict the next token conditional on its input”. The claim of non-myopia is that a model will be modeling additional future tokens in addition to the next token, a capability induced by attempting to model the next token better.
      These definitions are not equivalent to the ones we gave (and as far as I’m aware the definitions we use are much closer to commonly used definitions of myopia and non-myopia than the ones you give here).
      Arthur is also entirely correct that your examples are not evidence of non-myopia by the definitions we use.
      
      The definition of myopia that we use is that the model minimises loss on the next token and the next token alone, this is not the same as requiring that the model only ‘models’ / ‘considers information only directly relevant to’ the next token and the next token alone.
      
      A model exhibiting myopic behaviour can still be great at the kinds of tasks you describe as requiring ‘modelling of future tokens’. The claim that some model was displaying myopic behaviour here would simply be that all of this ‘future modelling’ (or any other internal processing) is done entirely in service of minimising loss on just the next token. This is in contrast to the kinds of non-myopic models we are considering in this post—where the minimisation of loss over a multi-token completion encourages sacrificing some loss when generating early tokens in certain situations.
    - gwern 19 Oct 2023 23:44 UTC
      3 points
      0
      Parent
      Some Twitter discussion: https://twitter.com/saprmarks/status/1715100934936854691
      - gwern 11 Nov 2023 17:40 UTC
        7 points
        0
        Parent
        Recent papers demonstrating LLMs are not myopic and you can extract predictions of tokens beyond the next token:
        
        “Eliciting Latent Predictions from Transformers with the Tuned Lens”, Belrose et al 2023
        “Jump to Conclusions: Short-Cutting Transformers With Linear Transformations”, Din et al 2023
        “Future Lens: Anticipating Subsequent Tokens from a Single Hidden State”, Pal et al 2023
  - David Johnston 6 Dec 2022 9:22 UTC
    1 point
    0
    Parent
    At the risk of being too vague to be understood… you can always factorise a probability distribution as $P (X_{1}) P (X_{2} | X_{1})$ etc, so plain next token prediction should be able to do the job, but maybe there’s a more natural “causal” factorisation that goes like $P (s u b j e c t) P (v e r b | s u b j e c t)$ etc, which is not ordered the same as the tokens but from which we can derive the token probabilities, and maybe that’s easier to learn than the raw next token probabilities.
    
    I’ve no idea if this is what gwern meant.
- Evan R. Murphy 6 Dec 2022 22:21 UTC
  LW: 1 AF: 1
  0
  AF Parent
  I find your examples of base GPT-3 predicting indefinite articles for words like ‘tiger’ and ‘orangutan’ pretty interesting. I think I agree that these are evidence that the model is doing some modelling/inference of future tokens beyond the next immediate token.
  However, this sort of future-token modelling still seems consistent with a safety-relevant notion of next-token myopia, because any inference that GPT-3 is doing of future tokens here still appears to be in the service of minimising loss on the immediate next token. Inferring ‘orangutan’ helps the model to better predict ‘an’, rather than indicating any kind of tendency to try and sacrifice loss on ‘an’ in order to somehow score better on ‘orangutan’. [1]
  The former still leaves us with a model that is at least plausibly exempt from instrumental convergence. [2] Whereas the latter would seem to come from a model (or more likely a similarly-trained, scaled-up version of the model) that is at risk of developing instrumentally convergent tendencies, including perhaps deceptive alignment. So that’s why I am not too worried about the kind of future-token inference you are describing and still consider a model which does this kind of thing ‘myopic’ in the important sense of the word.
  --
  [1]: As I write this, I am questioning whether the explanation of myopia we gave in the “What is myopia?” section is totally consistent what I am saying here. I should take another look at that section and see if it warrants a revision. (Update: No revision needed, the definitions we gave in the “What is myopoia?” section are consistent with what I’m saying in this comment.)
  [2]: However, the model could still be at risk of simulating an agent that has instrumentally convergent tendencies. But that seems like a different kind of risk to manage than the base model itself being instrumentally convergent.