The autoregressive language model μ:Tk→Δ(T) which maps a prompt x∈Tk to a distribution over tokens μ(⋅|x)∈Δ(T).
Tk should actually be T∗; I think you mean “the set of all strings constructed from the alphabet of tokens” and not “the set of all length k strings constructed from the alphabet of tokens”?
You used the former meaning earlier for Remark 1:
Let T be the set of possible tokens in our vocabulary. A language model (LLM) is given by a stochastic function μ:T∗→Δ(T) mapping a prompt (t1…tk) to a predicted tokentk+1.
I guess in this formalism you’d need to consider the empty string/similar null token a valid token, so the prompt/completion is prefixed/suffixed with empty strings (to pad to the size of the context window).
Otherwise, you’d need to define the domain as a union over the set of all strings with token lengths ≤ the context window.
Tk should actually be T∗; I think you mean “the set of all strings constructed from the alphabet of tokens” and not “the set of all length k strings constructed from the alphabet of tokens”?
You used the former meaning earlier for Remark 1:
Finite context window.
Realised later on, thanks.
I guess in this formalism you’d need to consider the empty string/similar null token a valid token, so the prompt/completion is prefixed/suffixed with empty strings (to pad to the size of the context window).
Otherwise, you’d need to define the domain as a union over the set of all strings with token lengths ≤ the context window.