faul_sname comments on Nathan Helm-Burger’s Shortform

faul_sname 30 Apr 2024 18:32 UTC
4 points
2
Somewhat of an oversimplification below, but
Each position in vision models you are trying to transform points in a continuous 3-dimensional space (RGB) to and from the model representation. That is, to embed a pixel you go $c^{3} \to R^{d_model}$ , and to unembed you go $R^{d_model} \to c^{3}$ where $c \in R and 0 \leq c < 2^{color_depth_in_bits}$ .
In a language model, you are trying to transform 100,000-dimensional categorical data to and from the model representation. That is, to embed a token you go $t \to R^{d_model}$ and to unembed $R^{d_model} \to R^{d_vocab}$ where $t \in R and 0 \leq t < d_vocab$ -- for embedding, you can think of the embedding as a 1-hot $t \to R^{d_vocab}$ followed by a $R^{d_vocab} \to R^{d_model}$ , though in practice you just index into a tensor of shape (d_vocab, d_model) because 1-hot encoding and then multiplying is a waste of memory and compute. So you can think of a language model as having 100,000 “channels”, which encode “the token is the” / “the token is Bob” / “the token is |”.
- Nathan Helm-Burger 1 May 2024 2:26 UTC
  4 points
  0
  Parent
  Yeah, I was playing around with using a VAE to compress the logits output from a language transformer. I did indeed settle on treating the vocab size (e.g. 100,000) as the ‘channels’.