veered comments on Remarks 1–18 on GPT (compressed)

veered 23 Mar 2023 16:16 UTC
9 points
6
This is awesome! So far, I’m not seeing much engagement (in the comments) with most of the new ideas in this post, but I suspect this is due to its length and sprawling nature rather than potential interest. This post is a solid start on creating a common vocabulary and framework for thinking about LLMs.
I like the work you did on formalizing LLMs as a stochastic process, but I suspect that some of the exploration of the consequences is more distracting than helpful in an overview like this. In particular: 4.B, 4.C, 4.D, 4.E, 5.B, and 5.C. These results are mostly an enumeration of basic properties of finite-state Markov Chains, rather than something helpful for the analysis of LLMs in particular.
I am very excited to read your thoughts on the Preferred Decomposition Problem. Do you have thoughts on preferred decompositions of a premise into simulacra? There should likely be a distinction between μ-decomposition and s-decomposition (where, if I’m understanding correctly, $s \in S$ refers to the set of premises, not simulacra, which is a bit confusing).
I suspect that, pragmatically, the choice of μ-decomposition should favor those premises that neatly factor into simulacra. And that the different premises in a particular μ-decomposition should share simulacra. You mention something similar in 10.C, but in the context of human experts rather than simulacra.
On a separate note, I think that $μ_{\infty}$ is confusing notation because:
1. At first glance, I would guess that it refers to μ with infinite context window length.
2. The notation doesn’t include a reference to the underlying data set. If I’m reading it right, $μ_{\infty}$ isn’t a universally optimal LLM, it is an optimal LLM w.r.t. to a particular corpus $C$ .
Thanks for writing this up. I think that you’ll see a lot more discussion on smaller posts.
- Cleo Nardo 23 Mar 2023 18:45 UTC
  3 points
  0
  Parent
  I think my definition of $μ_{\infty}$ is correct. It’s designed to abstract away all the messy implementation details of the ML architecture and ML training process.
  Now, you can easily amend the definition to include an infinite context window $k$ . In fact, if you let $k > N$ then that’s essentially an infinite context window. But it’s unclear what optimal inference is supposed to look like when $k = \infty$ . When the context window is infinite (or very large) the internet corpus consists of a single datapoint.