This is awesome! So far, I’m not seeing much engagement (in the comments) with most of the new ideas in this post, but I suspect this is due to its length and sprawling nature rather than potential interest. This post is a solid start on creating a common vocabulary and framework for thinking about LLMs.
I like the work you did on formalizing LLMs as a stochastic process, but I suspect that some of the exploration of the consequences is more distracting than helpful in an overview like this. In particular: 4.B, 4.C, 4.D, 4.E, 5.B, and 5.C. These results are mostly an enumeration of basic properties of finite-state Markov Chains, rather than something helpful for the analysis of LLMs in particular.
I am very excited to read your thoughts on the Preferred Decomposition Problem. Do you have thoughts on preferred decompositions of a premise into simulacra? There should likely be a distinction between μ-decomposition and s-decomposition (where, if I’m understanding correctly, s∈S refers to the set of premises, not simulacra, which is a bit confusing).
I suspect that, pragmatically, the choice of μ-decomposition should favor those premises that neatly factor into simulacra. And that the different premises in a particular μ-decomposition should share simulacra. You mention something similar in 10.C, but in the context of human experts rather than simulacra.
On a separate note, I think that μ∞ is confusing notation because:
At first glance, I would guess that it refers to μ with infinite context window length.
The notation doesn’t include a reference to the underlying data set. If I’m reading it right, μ∞ isn’t a universally optimal LLM, it is an optimal LLM w.r.t. to a particular corpus C.
Thanks for writing this up. I think that you’ll see a lot more discussion on smaller posts.
I think my definition of μ∞ is correct. It’s designed to abstract away all the messy implementation details of the ML architecture and ML training process.
Now, you can easily amend the definition to include an infinite context window k. In fact, if you let k>N then that’s essentially an infinite context window. But it’s unclear what optimal inference is supposed to look like when k=∞. When the context window is infinite (or very large) the internet corpus consists of a single datapoint.
This is awesome! So far, I’m not seeing much engagement (in the comments) with most of the new ideas in this post, but I suspect this is due to its length and sprawling nature rather than potential interest. This post is a solid start on creating a common vocabulary and framework for thinking about LLMs.
I like the work you did on formalizing LLMs as a stochastic process, but I suspect that some of the exploration of the consequences is more distracting than helpful in an overview like this. In particular: 4.B, 4.C, 4.D, 4.E, 5.B, and 5.C. These results are mostly an enumeration of basic properties of finite-state Markov Chains, rather than something helpful for the analysis of LLMs in particular.
I am very excited to read your thoughts on the Preferred Decomposition Problem. Do you have thoughts on preferred decompositions of a premise into simulacra? There should likely be a distinction between μ-decomposition and s-decomposition (where, if I’m understanding correctly, s∈S refers to the set of premises, not simulacra, which is a bit confusing).
I suspect that, pragmatically, the choice of μ-decomposition should favor those premises that neatly factor into simulacra. And that the different premises in a particular μ-decomposition should share simulacra. You mention something similar in 10.C, but in the context of human experts rather than simulacra.
On a separate note, I think that μ∞ is confusing notation because:
At first glance, I would guess that it refers to μ with infinite context window length.
The notation doesn’t include a reference to the underlying data set. If I’m reading it right, μ∞ isn’t a universally optimal LLM, it is an optimal LLM w.r.t. to a particular corpus C.
Thanks for writing this up. I think that you’ll see a lot more discussion on smaller posts.
I think my definition of μ∞ is correct. It’s designed to abstract away all the messy implementation details of the ML architecture and ML training process.
Now, you can easily amend the definition to include an infinite context window k. In fact, if you let k>N then that’s essentially an infinite context window. But it’s unclear what optimal inference is supposed to look like when k=∞. When the context window is infinite (or very large) the internet corpus consists of a single datapoint.