Tomek Korbak comments on Remarks 1–18 on GPT (compressed)

Tomek Korbak 21 Mar 2023 11:47 UTC
6 points
2
Thanks, I found the post quite stimulating. Some questions and thoughts:
1. Is LLM dynamics ergodic? I.e. is the time average $P^{\infty}$ equal to ${lim}_{N \to \infty} \frac{1}{N} \sum_{n}^{N} π_{n} (0)$ , the average page vector?.
2. One potential issue with this formalisation is that you always assume a prompt of size $k$ (so you need to introduce artificial “null tokens” if the prompt is shorter) and you don’t give special treatment to the token <|endoftext|>. For me, it would be more intuitive to consider LLM dynamics in terms of finite, variable length, token-level Markov chains (until <|endoftext|>). While a fixed block size is actually being used during training, the LLM is incentivised to disregard anything before <|endoftext|>. So these two prompts should induce the same distribution: Document about cats.<|endoftext|>My name is; Document about dogs.<|endoftext|>My name is. Your formalisation doesn’t account for this symmetry.
3. Dennett is spelled with “tt”.
4. Note that a softmax-based LLM will always put non-zero probability on every token. So there are no strictly absorbing states. You’re careful enough to define absorbing states as “once you enter, you are unlikely to ever leave”, but then your toy Waluigi model is implausible. A Waluigi can always switch back to a Luigi.
- Cleo Nardo 21 Mar 2023 14:53 UTC
  5 points
  0
  Parent
  1. Almost certainly ergodic in the limit. But it’s highly period due to English grammar.
  2. Yep, just for convenience.
  3. Yep.
  4. Temp = 0 would give exactly absorbing states.