I think you’re just saying here that the model doesn’t place all its prediction mass on one token but instead spreads it out, correct?
Yes. For a base model. A tuned/RLHFed model however is doing something much closer to that (‘flattened logits’), and this plays a large role in the particular weirdnesses of those models, especially as compared to the originals (eg. it seems like maybe they suck at any kind of planning or search or simulation because they put all the prediction mass on the max-arg token rather than trying to spread mass out proportionately and so if that one token isn’t 100% right, the process will fail).
Another possible reading is that you’re saying that the model tries to actively avoid committing to one possible meaning (ie favors next tokens that maintain superposition)
Hm, I don’t think base models would necessarily do that, no. I can see the tuned models having the incentives to train them to do so (eg. the characteristic waffle and non-commitment and vagueness are presumably favored by raters), but not the base models.
They are non-myopic, so they’re incentivized to plan ahead, but only insofar as that predicts the next token in the original training data distribution (because real tokens reflect planning or information from ‘the future’); unless real agents are actively avoiding commitment, there’s no incentive there to worsen your next-token prediction by trying to create an ambiguity which is not actually there.
(The ambiguity is in the map, not the territory. To be more concrete, imagine the ambiguity is over “author identity”, as the LLM is trying to infer whether ‘gwern’ or ‘eggsyntax’ wrote this LW comment. At each token, it maintains a latent about its certainty of the author identity; because it is super useful for prediction to know who is writing this comment, right? And the more tokens it sees for the prediction, the more confident it becomes the answer is ‘gwern’. But when I’m actually writing this, I have no uncertainty—I know perfectly well ‘gwern’ is writing this, and not ‘eggsyntax’. I am not in any way trying to ‘avoid committing to one possible [author]’ - the author is just me, gwern, fully committed from the start, whatever uncertainty a reader might have while reading this comment from start to finish. My next token, therefore, is not better predicted by imagining that I’m suffering from mental illness or psychedelics as I write this and thus might suddenly spontaneously claim to be eggsyntax and this text is deliberately ambiguous because at any moment I might be swerving from gwern to eggsyntax and back. The next token is better predicted by inferring who the author is to reduce ambiguity as much as possible, and expecting them to write in a normal non-ambiguous fashion given whichever author it actually is.)
Yes. For a base model. A tuned/RLHFed model however is doing something much closer to that (‘flattened logits’), and this plays a large role in the particular weirdnesses of those models, especially as compared to the originals (eg. it seems like maybe they suck at any kind of planning or search or simulation because they put all the prediction mass on the max-arg token rather than trying to spread mass out proportionately and so if that one token isn’t 100% right, the process will fail).
Hm, I don’t think base models would necessarily do that, no. I can see the tuned models having the incentives to train them to do so (eg. the characteristic waffle and non-commitment and vagueness are presumably favored by raters), but not the base models.
They are non-myopic, so they’re incentivized to plan ahead, but only insofar as that predicts the next token in the original training data distribution (because real tokens reflect planning or information from ‘the future’); unless real agents are actively avoiding commitment, there’s no incentive there to worsen your next-token prediction by trying to create an ambiguity which is not actually there.
(The ambiguity is in the map, not the territory. To be more concrete, imagine the ambiguity is over “author identity”, as the LLM is trying to infer whether ‘gwern’ or ‘eggsyntax’ wrote this LW comment. At each token, it maintains a latent about its certainty of the author identity; because it is super useful for prediction to know who is writing this comment, right? And the more tokens it sees for the prediction, the more confident it becomes the answer is ‘gwern’. But when I’m actually writing this, I have no uncertainty—I know perfectly well ‘gwern’ is writing this, and not ‘eggsyntax’. I am not in any way trying to ‘avoid committing to one possible [author]’ - the author is just me, gwern, fully committed from the start, whatever uncertainty a reader might have while reading this comment from start to finish. My next token, therefore, is not better predicted by imagining that I’m suffering from mental illness or psychedelics as I write this and thus might suddenly spontaneously claim to be eggsyntax and this text is deliberately ambiguous because at any moment I might be swerving from gwern to eggsyntax and back. The next token is better predicted by inferring who the author is to reduce ambiguity as much as possible, and expecting them to write in a normal non-ambiguous fashion given whichever author it actually is.)