Yes, but note in the simulator/Bayesian meta-RL view, it is important that the LLMs do not “produce a response”: they produce a prediction of ‘the next response’. The logits will, of course, try to express the posterior, averaging across all of the possibilities. This is what the mixture is: there’s many different meanings which are still possible, and you’re not sure which one is ‘true’ but they all have a lot of different posterior probabilities by this point, and you hedge your bets as to the exact next token as incentivized by a proper scoring rule which encourages you to report the posterior probability as the output which minimizes your loss. (A hypothetical agent may be trying to produce a response, but so too do all of the other hypothetical agents which are live hypotheses at that point.) Or it might be clearer to say, it produces predictions of all of the good-sounding responses, but never produces any single response.
Everything after that prediction, like picking a single, discrete, specific logit and ‘sampling’ it to fake ‘the next token’, is outside the LLM’s purview except insofar as it’s been trained on outputs from such a sampling process and has now learned that’s one of the meanings mixed in. (When Llama-3-405b is predicting the mixture of meanings of ‘the next token’, it knows ChatGPT or Claude could be the LLM writing it and predicts accordingly, but it doesn’t have anything really corresponding to “I, Lama-3-405b, am producing the next token by Boltzmann temperature sampling at x temperature”. It has a hazy idea what ‘temperature’ is from the existing corpus, and it can recognize when a base model—itself—has been sampled from and produced the current text, but it lacks the direct intuitive understanding implied by “produce a response”.) Hence all of the potential weirdness when you hardwire the next token repeatedly and feed it back in, and it becomes ever more ‘certain’ of what the meaning ‘really’ is, or it starts observing that the current text looks produced-by-a-specific-sampling-process rather than produced-by-a-specific-human, etc.
Yes, but note in the simulator/Bayesian meta-RL view, it is important that the LLMs do not “produce a response”: they produce a prediction of ‘the next response’.
Absolutely! In the comment you’re responding to I nearly included a link to ‘Role-Play with Large Language Models’; the section there on playing 20 questions with a model makes that distinction really clear and intuitive in my opinion.
there’s many different meanings which are still possible, and you’re not sure which one is ‘true’ but they all have a lot of different posterior probabilities by this point, and you hedge your bets as to the exact next token
Just for clarification, I think you’re just saying here that the model doesn’t place all its prediction mass on one token but instead spreads it out, correct? Another possible reading is that you’re saying that the model tries to actively avoid committing to one possible meaning (ie favors next tokens that maintain superposition), and I thought I remembered seeing evidence that they don’t do that.
I think you’re just saying here that the model doesn’t place all its prediction mass on one token but instead spreads it out, correct?
Yes. For a base model. A tuned/RLHFed model however is doing something much closer to that (‘flattened logits’), and this plays a large role in the particular weirdnesses of those models, especially as compared to the originals (eg. it seems like maybe they suck at any kind of planning or search or simulation because they put all the prediction mass on the max-arg token rather than trying to spread mass out proportionately and so if that one token isn’t 100% right, the process will fail).
Another possible reading is that you’re saying that the model tries to actively avoid committing to one possible meaning (ie favors next tokens that maintain superposition)
Hm, I don’t think base models would necessarily do that, no. I can see the tuned models having the incentives to train them to do so (eg. the characteristic waffle and non-commitment and vagueness are presumably favored by raters), but not the base models.
They are non-myopic, so they’re incentivized to plan ahead, but only insofar as that predicts the next token in the original training data distribution (because real tokens reflect planning or information from ‘the future’); unless real agents are actively avoiding commitment, there’s no incentive there to worsen your next-token prediction by trying to create an ambiguity which is not actually there.
(The ambiguity is in the map, not the territory. To be more concrete, imagine the ambiguity is over “author identity”, as the LLM is trying to infer whether ‘gwern’ or ‘eggsyntax’ wrote this LW comment. At each token, it maintains a latent about its certainty of the author identity; because it is super useful for prediction to know who is writing this comment, right? And the more tokens it sees for the prediction, the more confident it becomes the answer is ‘gwern’. But when I’m actually writing this, I have no uncertainty—I know perfectly well ‘gwern’ is writing this, and not ‘eggsyntax’. I am not in any way trying to ‘avoid committing to one possible [author]’ - the author is just me, gwern, fully committed from the start, whatever uncertainty a reader might have while reading this comment from start to finish. My next token, therefore, is not better predicted by imagining that I’m suffering from mental illness or psychedelics as I write this and thus might suddenly spontaneously claim to be eggsyntax and this text is deliberately ambiguous because at any moment I might be swerving from gwern to eggsyntax and back. The next token is better predicted by inferring who the author is to reduce ambiguity as much as possible, and expecting them to write in a normal non-ambiguous fashion given whichever author it actually is.)
Yes, but note in the simulator/Bayesian meta-RL view, it is important that the LLMs do not “produce a response”: they produce a prediction of ‘the next response’. The logits will, of course, try to express the posterior, averaging across all of the possibilities. This is what the mixture is: there’s many different meanings which are still possible, and you’re not sure which one is ‘true’ but they all have a lot of different posterior probabilities by this point, and you hedge your bets as to the exact next token as incentivized by a proper scoring rule which encourages you to report the posterior probability as the output which minimizes your loss. (A hypothetical agent may be trying to produce a response, but so too do all of the other hypothetical agents which are live hypotheses at that point.) Or it might be clearer to say, it produces predictions of all of the good-sounding responses, but never produces any single response.
Everything after that prediction, like picking a single, discrete, specific logit and ‘sampling’ it to fake ‘the next token’, is outside the LLM’s purview except insofar as it’s been trained on outputs from such a sampling process and has now learned that’s one of the meanings mixed in. (When Llama-3-405b is predicting the mixture of meanings of ‘the next token’, it knows ChatGPT or Claude could be the LLM writing it and predicts accordingly, but it doesn’t have anything really corresponding to “I, Lama-3-405b, am producing the next token by Boltzmann temperature sampling at x temperature”. It has a hazy idea what ‘temperature’ is from the existing corpus, and it can recognize when a base model—itself—has been sampled from and produced the current text, but it lacks the direct intuitive understanding implied by “produce a response”.) Hence all of the potential weirdness when you hardwire the next token repeatedly and feed it back in, and it becomes ever more ‘certain’ of what the meaning ‘really’ is, or it starts observing that the current text looks produced-by-a-specific-sampling-process rather than produced-by-a-specific-human, etc.
Absolutely! In the comment you’re responding to I nearly included a link to ‘Role-Play with Large Language Models’; the section there on playing 20 questions with a model makes that distinction really clear and intuitive in my opinion.
Just for clarification, I think you’re just saying here that the model doesn’t place all its prediction mass on one token but instead spreads it out, correct? Another possible reading is that you’re saying that the model tries to actively avoid committing to one possible meaning (ie favors next tokens that maintain superposition), and I thought I remembered seeing evidence that they don’t do that.
Yes. For a base model. A tuned/RLHFed model however is doing something much closer to that (‘flattened logits’), and this plays a large role in the particular weirdnesses of those models, especially as compared to the originals (eg. it seems like maybe they suck at any kind of planning or search or simulation because they put all the prediction mass on the max-arg token rather than trying to spread mass out proportionately and so if that one token isn’t 100% right, the process will fail).
Hm, I don’t think base models would necessarily do that, no. I can see the tuned models having the incentives to train them to do so (eg. the characteristic waffle and non-commitment and vagueness are presumably favored by raters), but not the base models.
They are non-myopic, so they’re incentivized to plan ahead, but only insofar as that predicts the next token in the original training data distribution (because real tokens reflect planning or information from ‘the future’); unless real agents are actively avoiding commitment, there’s no incentive there to worsen your next-token prediction by trying to create an ambiguity which is not actually there.
(The ambiguity is in the map, not the territory. To be more concrete, imagine the ambiguity is over “author identity”, as the LLM is trying to infer whether ‘gwern’ or ‘eggsyntax’ wrote this LW comment. At each token, it maintains a latent about its certainty of the author identity; because it is super useful for prediction to know who is writing this comment, right? And the more tokens it sees for the prediction, the more confident it becomes the answer is ‘gwern’. But when I’m actually writing this, I have no uncertainty—I know perfectly well ‘gwern’ is writing this, and not ‘eggsyntax’. I am not in any way trying to ‘avoid committing to one possible [author]’ - the author is just me, gwern, fully committed from the start, whatever uncertainty a reader might have while reading this comment from start to finish. My next token, therefore, is not better predicted by imagining that I’m suffering from mental illness or psychedelics as I write this and thus might suddenly spontaneously claim to be eggsyntax and this text is deliberately ambiguous because at any moment I might be swerving from gwern to eggsyntax and back. The next token is better predicted by inferring who the author is to reduce ambiguity as much as possible, and expecting them to write in a normal non-ambiguous fashion given whichever author it actually is.)