TurnTrout comments on TurnTrout’s shortform feed

TurnTrout 12 Aug 2023 18:12 UTC
4 points
People sometimes think of transformers as specifying joint probability distributions over token sequences up to a given context length. However, I think this is sometimes not the best abstraction. From another POV, transformers take as input embedding vectors of a given dimension $d_{model}$ , and map to another output vector of dimension (# of tokens).
This is important in that it emphasizes the implementation of the transformer, and helps avoid possible missteps around thinking of transformers as “trying to model text” or various forms of finetuning as basically “conditioning” the model, and also accounts for the success of e.g. soft prompts (which don’t correspond to any token sequences).