I think a key idea related to this topic and not yet mentioned in the comments (maybe because it is elementary?) is the probabilistic chain rule. A basic “theorem” of probability which, in our case, shows that the procedure of always sampling the next word conditioned on the previous words is mathematically equivalent to sampling from the joint of probability distribution of complete human texts. To me this almost fully explains why LLMs’ outputs seem to have been generated with global information in mind. What is missing is to see why our intuition of “merely” generating the next token differs from sampling from the joint distribution. My guess is that humans instinctively (but incorrectly) associate directional causality to conditional probability and because of this, it surprises us when we see dependencies running in the opposite direction in the generated text.
EDIT: My comment concerns transformer architectures, I don’t yet know how rlhf works.
I think a key idea related to this topic and not yet mentioned in the comments (maybe because it is elementary?) is the probabilistic chain rule. A basic “theorem” of probability which, in our case, shows that the procedure of always sampling the next word conditioned on the previous words is mathematically equivalent to sampling from the joint of probability distribution of complete human texts. To me this almost fully explains why LLMs’ outputs seem to have been generated with global information in mind. What is missing is to see why our intuition of “merely” generating the next token differs from sampling from the joint distribution. My guess is that humans instinctively (but incorrectly) associate directional causality to conditional probability and because of this, it surprises us when we see dependencies running in the opposite direction in the generated text.
EDIT: My comment concerns transformer architectures, I don’t yet know how rlhf works.
Yeah, but all sorts of elementary things elude me. So thanks for the info.