Each position in vision models you are trying to transform points in a continuous 3-dimensional space (RGB) to and from the model representation. That is, to embed a pixel you go c3→Rd_model , and to unembed you go Rd_model→c3 where c∈R and 0≤c<2color_depth_in_bits.
In a language model, you are trying to transform 100,000-dimensional categorical data to and from the model representation. That is, to embed a token you go t→Rd_model and to unembed Rd_model→Rd_vocab where t∈R and 0≤t<d_vocab -- for embedding, you can think of the embedding as a 1-hot t→Rd_vocab followed by a Rd_vocab→Rd_model, though in practice you just index into a tensor of shape (d_vocab, d_model) because 1-hot encoding and then multiplying is a waste of memory and compute. So you can think of a language model as having 100,000 “channels”, which encode “the token is the” / “the token is Bob” / “the token is |”.
Yeah, I was playing around with using a VAE to compress the logits output from a language transformer. I did indeed settle on treating the vocab size (e.g. 100,000) as the ‘channels’.
Somewhat of an oversimplification below, but
Each position in vision models you are trying to transform points in a continuous 3-dimensional space (RGB) to and from the model representation. That is, to embed a pixel you go c3→Rd_model , and to unembed you go Rd_model→c3 where c∈R and 0≤c<2color_depth_in_bits.
In a language model, you are trying to transform 100,000-dimensional categorical data to and from the model representation. That is, to embed a token you go t→Rd_model and to unembed Rd_model→Rd_vocab where t∈R and 0≤t<d_vocab -- for embedding, you can think of the embedding as a 1-hot t→Rd_vocab followed by a Rd_vocab→Rd_model, though in practice you just index into a tensor of shape
(d_vocab, d_model)
because 1-hot encoding and then multiplying is a waste of memory and compute. So you can think of a language model as having 100,000 “channels”, which encode “the token isthe
” / “the token isBob
” / “the token is|
”.Yeah, I was playing around with using a VAE to compress the logits output from a language transformer. I did indeed settle on treating the vocab size (e.g. 100,000) as the ‘channels’.