[Question] Newbie questions about information theory and transformers

I have some very naive (and unfortunately sort of stream-of-consciousness) questions brought up by this post:


https://​​www.lesswrong.com/​​posts/​​vAAneYowLkaHnihCg/​​textbooks-are-all-you-need

I definitely know that my ideas are poorly formed and I have no idea what I’m talking about. I’m happy to be told that my whole picture here is fundamentally flawed, or that there’s some better question that I should be asking, or that a combination of words I’ve thrown together is gibberish. But I’m trying to avoid worrying about asking dumb things when I’m curious, so I’m just going to put them out there.

Is there much I can read on the relationship between information theory and transformers/​LLMs?

I keep trying to picture the flow of information in transformers. Is the information being “distilled”?

I’m imagining the weights and biases of the model to be some kind of distilled and encoded version of the training data. Does that make any sense? Since it can’t store all of it, what is it doing? Incorporating more information from the tokens in the training data that have the highest entropy (as far as how much it reduces the average uncertainty of the correct next token)?

If so, is this driven by gradient descent? Does minimizing the cost function end up encoding more information from the training set tokens that contain the most entropy into the weights of the model?

Does that information get further distilled into the tokens it generates from a prompt? Is there some way in which having the model generate a bunch of new training data is “unencoding” the information about the original training set stored in the weights, and in doing so producing a new, smaller training set that represents the highest entropy information from the original?

Does this improve quality because it’s letting the network try “reconnecting” itself when it’s re-encoded into a new model being trained on it? Could this help fix things that were stuck in a local minima in the original model?

When the LLM selects tokens from the large set, does that choice of tokens encode information from that LLM’s training in addition to the information from the selected tokens themselves?

If we start caring a lot about the ability of an LLM to select the best tokens from a huge dataset and generate a complimentary set of synthetic data on top of its general functionality, would this create some kind of selective pressure that could evolve? Like… the reward function being based on the improvement in loss from the LLM to its “offspring” and the cost function being based on how big the training set chosen was or how many FLOPS went into training the offspring.